DocDB

DocDB module documentation

Goal

Dolibarr provides electronic documents management (EDM) functionality.
These documents are uploaded by users to the server's file system. Some documents, such as invoices, are generated and stored within the DM system. Additionally, management data is stored in a database. This includes all the data from forms, such as invoice numbers, client names, amounts, descriptions, etc. The docDB module allows storing documents in the database instead of using the server's file system.

Operating principle

Dolibarr is coded in the PHP language. This language provides programmers with a variety of functions that allow them to understand and interact with the program's environment. For example, the echo function displays text on the browser screen, strpos gives the position of a string within another string, and sort sorts an array of values. These functions are numerous and cover almost all the problems developers encounter.
The PHP code of an application is contained in files, often referred to as scripts, which are used by the web server when the user makes a request in their browser. Without going into too much detail, the user clicks or interacts with their keyboard, the browser sends a request, the server analyzes the corresponding scripts, and returns the results produced by executing these scripts.
Regarding the EDM documents, Dolibarr uses functions dedicated to file management. Examples include opendir, readdir, or readfile, which read directories or files contained on the server's disk, and exif_read_data, imagejpeg, or getimagesize, which handle images. There are many more functions, and I have identified 99 of them as of now.
This system works well but has a number of limitations, including those of the file systems used for storage. These systems do not offer the flexibility and power of databases for preserving and accessing documents.
For example, it is not straightforward to set up replication from one server to another. Replication allows real-time synchronization of two identical data spaces, which is a powerful advantage for security and facilitates periodic backups. Not using replication nowadays implies operating in a mode that becomes increasingly archaic and tolerates the loss of a certain amount of data in the event of a server failure.
It is also not very easy to deploy and maintain two separate data management systems. On one side, we have the documents stored in the file system, and on the other side, we have management data stored in the database. A bug, an incident, or an unintentional operation on either of the two systems can lead to data becoming desynchronized. Again, it works, but it was not easy to develop and would not be easy to evolve.
Furthermore, character sets are not treated the same way in both systems, which introduces issues and limitations regarding document names.

So why was Dolibarr coded this way? The answer would logically come from the original developers, but we can speculate that at the time, databases did not handle large unit volumes of data very well. These are stored in what is called blobs, which did not appear early in the history of databases, whereas dealing with large files was a problem solved early on in the design of file systems.

This is the idea behind docDB: handling blobs is now running well, and it is advantageous to use it in Dolibarr.

There was never any question of rewriting a significant part of Dolibarr. Firstly, this would not give a choice to the technicians responsible for deployment. Secondly, it makes sense to test docDB over a sufficiently long period of time. Furthermore, it would be a lot of work.
The implemented solution simply modifies the PHP functions related to document management and replaces them with functions that produce equivalent results to the programs but do something different in reality. For example, opendir is replaced by opendir_Bypass. Where opendir opened a directory on the disk, opendir_Bypass opens and reads a table in the database. This is called an emulator. The original functions are emulated by the emulation functions. They produce the same results but with a completely different underlying method, and the same approach is applied to all functions related to documents. A more modern terminology might refer to it as aspect-oriented programming.
In practice, once docDB is activated, the document directory becomes almost completely useless. It only contains the Dolibarr event log, the famous install.lock file, and the necessary api subdirectory for certain operations. In any case, it no longer stores anything, and the documents can be deleted. They are now stored in two tables (llx_doc_data and llx_doc_directory) and are naturally associated with any operation related to the database.

DocDB is activated using two scripts (docDBmigrateScripts.py and docDBmigrateFiles.py), each of which is used once. The first one modifies the Dolibarr script code, and the second one extracts the documents from the directory and loads them into the database. These two operations are reversible, allowing testing and rollback if necessary.
The detailed procedure is explained in the installation document (install-en_EN.html) delivered with the module. The operation is quite simple and fast for those with minimal command-line experience.

Limitations

Currently, docDB has a few limitations, but fortunately, they are minimal.

The main limitation is that it only works with MySQL, or more precisely, MariaDB. It would be entirely possible to develop an adaptation for PostgreSQL, but I'm waiting for a user to express the need for it. Implementing this is not straightforward because the management principle of blobs in PostgreSQL is different from that used in MySQL/MariaDB. In short, blobs cannot be placed in any table; a different approach is required.

Additionally, there is a performance issue that occurs only on Windows when the server is overloaded. If the memory becomes full and the system starts pagination, the document management will be significantly slowed down. In practice, this is not a significant concern because the entire system becomes slow, and in any case,a corrective action is needed to use Dolibarr under acceptable conditions. I encountered this issue during testing, as the servers were installed on virtual machines with limited memory. It's worth noting that this issue does not occur on Linux.

In terms of minor limitations, empty directories are not recreated after a migration rollback. If you migrate to docDB and later, after deleting the documents on the disk, use the --reverse option to recreate the same documents, only non-empty directories will be generated. In reality, this has no significant impact because Dolibarr can create a directory when it needs to, and it does not affect its functionality.

And that's it. All of Dolibarr's functions are transparently accessible, which is the desired goal.

Differences

There is, however, a slight difference in functionality, which is as follows: When requesting the deletion of a folder in the EDM space, Dolibarr asks for confirmation, notifying if any documents are contained within that folder and indicating that they will be destroyed. However, if confirmed, the deletion fails, and the contents must be deleted one by one.
If docDB is enabled, the deletion of the entire folder structure works, after confirmation, of course.

Advantages and disadvantages

Enabling docDB means that you will have a unified technical management for leveraging your favorite ERP.
Whenever data-related issues arise, the technician in charge of operations only needs to focus on the database and nothing else. Personally, I find this highly valuable. It applies to tasks such as implementation, backup management, restore, and security considerations.
This significantly reduces the time spent and the risk of errors.

Additionally, it becomes possible to fully leverage certain capabilities of the database. For example, replication allows for maintaining multiple identical copies of a database simultaneously. Incremental backups, which are more challenging to implement but highly useful for ensuring maximum security, can also be fully utilized.

There is also the issue of transactional operations. When you perform a series of queries and updates on an SQL database, they are part of a transaction. This transaction ensures that all these queries and updates are either executed entirely or not at all. In other words, your database remains consistent regardless of program or hardware failures. However, in the standard case of Dolibarr, this may not be fully effective. Unpleasant scenarios such as incomplete backups or partially synchronized data can be imagined. While such cases should occur very rarely in reality, the idea is anyway uncomfortable.

Lastly, regarding Dolibarr development, it is challenging to program mass operations concerning documents. They do not follow the usual logic of database programming, requiring double consideration each time. This increases the programming workload and inevitably reduces the feasibility of developing new functions.

The advantage of the docDB module is precisely to store documents in the database alongside other data.
It becomes easier to achieve transactional integrity, replication, and incremental backups.

Sustainability

Once DocDB is enabled, it is nothing more than a single PHP script. It utilizes standard functions documented in the official PHP documentation. The license under which it is distributed allows for full ownership (see below: license).
The question arises about its evolution if a new function were to appear in Dolibarr that is not supported in DocDB.
While we cannot exclude this possibility, in such a case, DocDB would need to be extended (see below: extending DocDB). I would certainly take care of that, but I am not immortal. However, it can be confidently stated that an open-source software developed by a single programmer who has passed away is easier to evolve than a proprietary software written by a large team from a bankrupt company. We, as Dolibarr users, know this well, and it is one of the things we appreciate about this software.
Furthermore, the migration scripts are written in the Python language (www.python.org), another widely known and popular scripting language, at least as significant as PHP. Why did I choose Python? Simply due to personal preference. I believe it is more convenient for performing these kinds of operations. The question of Dolibarr's evolution is quite disconnected from the work done by these scripts, as they are not used in production but only during planned migrations, never unexpectedly and under stress. Additionally, these scripts are freely accessible and modifiable. While their functioning is indeed more complex than the PHP part, it is nothing really esoteric. A large number of programmers from various backgrounds would be capable of maintaining and modifying these scripts.

Usage by Third-Party Modules

If you are using a Dolibarr module that is not included in the standard modules, the first consequence is that it has likely not been tested with DocDB, and therefore, you need to ensure that everything works correctly.
Here is one of the possible procedures you can follow if you find yourself in this situation :
- Set up a testing environment. It is important not to test directly in production. If that was your initial plan, I appreciate your trust, but sincerely, it is better to be cautious. This environment should be set up in the usual manner, including the DocDB migration, and you should take care to inject your data into the database, including documents.
- Install your module, preferably in htdocs/custom.
- Use the command docDBmigrateScripts.py to migrate your module. The command will only see the modifications to be made in the module since the Dolibarr scripts will have already been migrated. Feel free to use the --dry-run option during the first attempt to visualize what will happen.
That's all you need to do.
If your module does not manage EDM documents, there is a good chance that it will not be impacted at all. And if it directly handles documents without using Dolibarr's functions, some of its scripts will be modified, but it would be quite unlikely for them to encounter any execution issues.
We can't make any guarantees, especially in the field of computer science. If you encounter a problem, don't hesitate to contact me through the contact form on the website www.apia-asso.fr.

List of modules tested with Dolibarr

The following modules have been successfully tested. If you want yours to be included in this list, use the contact form on the website www.apia-asso.fr to send me its name, the tested version, the test date, and your name or alias so that I can add them to the table below.

Module Version Test date Tester's name
Scanner 17.0.0 15/05/2023 jmbc

Tests and measurements conducted

The tests were performed with Dolibarr 17.0.0 on May 15, 2023.

List of operations included in the protocol :
- docDB migration
- document comparison : initial / after reverse
- adding/removing logo
- accented characters in document name
- verification of generating PDF customer invoices
- access to document thumbnails
- access to complete documents
- storage of third-party documents
- cropping of third-party documents
- storage of project documents
- EDM space: folder creation
- EDM space: file upload
- EDM space: sub-folder creation
- EDM space: folder/sub-folder deletion
- EDM space: rename folder/sub-folder/folder with files
- EDM space: rename file with accents
- EDM space: automatic folder structure - development
- EDM space: automatic folder structure - preview
- EDM space: automatic folder structure - access to customer invoices
- EDM space: automatic folder structure - access to projects
- EDM space: automatic folder structure - access to third parties
- test suite - creation of 10 clients via API
- test suite - creation of 100 invoices per client
- test suite - download of 100 documents per client
- access to documents after reverse

Test conditions

Four virtual machines, two with docDB (Linux and Windows), and two with native Dolibarr without docDB (Linux and Windows). The tests are conducted simultaneously on each system.
Operating Systems used: Linux Debian 11, Windows 10 Pro 1909
Database: MariaDB 10.5.18 (Linux), MariaDB 11.1 x64 (Windows)
PHP: 7.4.33 (Linux and Windows)
Python: 3.9.2 (Linux), 3.11.3 (Windows)
RAM Memory: 2GB (Linux), 4GB (Windows)

Performance Comparison

There is no measurable difference between each of the four systems for each of the tested functions, except for document download with docDB, which takes approximately ten percent longer execution time (eleven hundredths of a second instead of ten).
Furthermore, the Windows systems were memory-constrained, which made them approximately five to six times longer for the bulk download test (generating 1000 invoices, downloading 1000 documents using the Dolibarr API). The difference was more pronounced for downloads under these conditions: approximately five to seven seconds to generate an invoice and download a document with docDB, which is about twice as long as without docDB.
The conclusion is that docDB is less efficient in a memory-constrained Windows environment (this does not represent normal operating conditions).

Character Sets

The character set tested is utf8, which appears to be the standard operating mode in Dolibarr. Cases where the database uses a different character set are not tested (e.g., iso-8859-1). Using docDB without conducting additional tests is not recommended in these cases.
To determine the character set used in the database, you can use the query
select default_character_set_name from information_schema.schemata where schema_name = "dolibarr_base_name";
The response should be utf8mb4 or utf8mb3
Furthermore, the tables containing binary data for documents use a different character set and collation sequence than the standard ones used in Dolibarr. The reason is that in file systems, accented characters are different from their equivalent unaccented characters (É is different from E, à is different from a, etc.), whereas they are considered equivalent in an SQL query. This is problematic for emulating file system behavior in an SQL query, which is why docDB uses CHARSET=utf8 COLLATE=utf8_bin to achieve the desired results.

Command reference

Please refer to the installation guide install-en_EN.html provided with the module for usage examples.

Migration commands

These commands are useful during migration of PHP sources and loading GED documents. They should be used outside of production environments.

docDBmigrateScripts Linux : python3 docDBmigrateScripts.py --dolibarrDir=path_name [--reverse] [--dry-run]
Windows : py -X utf8 docDBmigrateScripts.py --dolibarrDir=path_name [--reverse] [--dry-run]

Modifies the content of the scripts in the Dolibarr htdocs directory in order to enable database access.
For example, an instruction sequence like if (file_exists($file)) will be transformed into if (file_exists_Bypass($file)). This process is applied to all PHP scripts, except those copied from the install/scriptFiles folder of docDB.
The --reverse option restores the scripts to their original state
--dolibarrDir Full path to the Dolibarr htdocs Dolibarr directory.
If Dolibarr is located at /var/www/dolibarr, then --dolibarrDir= should point to /var/www/dolibarr/htdocs
--reverse Restores the PHP scripts of Dolibarr to their initial state.
Note that this is not a restoration from backup files, but rather a reverse processing of the scripts found in the folder.
--dry-run Requests a test processing.
The algorithm is followed exactly until the point of modifying the processed element, which is ultimately not modified.
-X utf8 Windows. Instructs Python to use the UTF-8 character set in the script

docDBmigrateFiles Linux : python3 docDBmigrateFiles.py --dolibarrDir=path_name [--reverse] [--dry-run]
Windows : py -X utf8 docDBmigrateFiles.py --dolibarrDir=path_name [--reverse] [--dry-run] [--phpdir=php_binaries_path]

This command reads the contents of the Dolibarr documents directory and loads it into the database. The directory path is obtained from the $dolibarr_main_data_root variable in the Dolibarr configuration file (htdocs/conf/conf.php). This file also provides the database access parameters.
If a file is reported as missing, a correction should be made either in the database (table llx_ecm_files) or in the directory where the file is missing. The corrective action is preferable to avoid future issues during operation. For example, you can use an SQL command like delete from llx_ecm_files where filename like '%missing_file_name%'; to remove a reference to a non-existing document. In this case, first verify the safety of the query using a select command like select filepath,filename from llx_ecm_files where filename like '%missing_file_name%'; to list any unrelated documents that might be mistakenly deleted.
--dolibarrDir Full path to the Dolibarr htdocs directory.
If Dolibarr is located at /var/www/dolibarr, then --dolibarrDir should point to /var/www/dolibarr/htdocs.
--reverse Loads the EDM documents from the database back to the Dolibarr documents directory.
--dry-run Requests a test processing.
The algorithm is followed exactly until the point of modifying the processed element, which does not occur.
-X utf8 Windows. Instructs Python to use the UTF-8 character set in the script.
--phpdir Windows. Specifies the path to the PHP executable for Python.
This option is required if the PHP executable path is not in the Windows PATH. Access to PHP is used to obtain certain characteristics of image files that are read in real-time by Dolibarr and not stored in the database. In contrast, docDB stores this data in the database to serve it to Dolibarr when requested.

SQL Queries

These queries are useful for testing and studying the behavior of Dolibarr and docDB.
Since they are only for querying, they do not imply any risk of data destruction.

SQL query Example result Explanation
select * from llx_doc_directory;
| rowid | path_name    |
|     1 |              |
|     2 | /adherent    |
|     3 | /adherent/1  |
|     4 | /adherent/10 |
|     5 | /adherent/11 |
    
The docDB table llx_doc_directory contains the list of directories that are supposed to contain the documents.
select rowid,path_name,filemtime,
octet_length(datablob) as filesize,
octet_length(exif_data) as exif_data,
octet_length(imagesize) as imagesize
from llx_doc_data order by rowid;
| rowid | path_name                | filemtime           | filesize | exif_data | imagesize |
|     1 | /adherent/1/Jean.pdf     | 2023-01-27 21:36:39 |   153092 |         4 |         4 |
|     2 | /adherent/10/Jeanne.pdf  | 2023-01-27 21:36:39 |    44421 |         4 |         4 |
|     3 | /adherent/11/Léo.pdf     | 2023-01-30 09:56:36 |    16135 |         4 |         4 |
|     4 | /adherent/14/Charlie.pdf | 2023-01-30 09:56:18 |    25034 |         4 |         4 |
|     5 | /adherent/2/Arthur.pdf   | 2023-01-27 21:36:39 |    44844 |         4 |         4 |
    
The docDB table llx_doc_data contains the documents. The datablob column is not displayable. Instead, filesize indicates the size of the document. exif_data and imagesize are encoded strings containing image characteristics if applicable (empty in the case of PDF files).
select filepath,filename from llx_ecm_files;
| filepath    | filename     |
| adherent/1  | Jean.pdf     |
| adherent/10 | Jeanne.pdf   | 
| adherent/11 | Léo.pdf      |
| adherent/14 | Charlie.pdf  |
| adherent/2  | Arthur.pdf   |
    
The Dolibarr table llx_ecm_files contains the list of documents and the corresponding filepaths.
select label from llx_ecm_directories;
| label            |
| Annual documents |
    
The Dolibarr table llx_ecm_directories contains the list of certain directories, those created by users in the EDM space.

Extending docDB

We have explained (Operating principle) that Dolibarr uses PHP functions dedicated to file management.
However, Dolibarr does not use all the relevant functions, and consequently, it was not necessary to program the emulation of all these functions.
Let's take the example where Dolibarr wants to test the presence of the PDF document corresponding to invoice FA2304-0001. To do this, it uses the function file_exists, which is transformed by docDB into file_exists_Bypass. In short, this function executes the query select rowid from llx_doc_data where path_name='/invoice/FA2304-0001/FA2304-0001.pdf'. If the query is successful, meaning the document exists in the llx_doc_data table, docDB returns the value true.
Now, let's imagine Dolibarr wants to use the function filectime to determine the creation date of a file. This occasionally happens but never for a document. The corresponding docDB function filectime_Bypass simply calls PHP and returns the result obtained from PHP back to Dolibarr.
If, in a future version, Dolibarr were to use PHP to obtain the creation date of a document, it would be necessary to properly implement filectime_Bypass. This would involve testing if the relevant file is a document (if ($useByPass && doc_in_db($filename))), and in that case, using the appropriate SQL query. The docDB functions are relatively simple, and their use and extension are within the reach of a reasonably experienced PHP programmer.
If you want to know the list of emulated functions, you can find it in the docDBmigrateScripts.py script, in the variable KEYWORDS_FOUND_16.
The non-implemented functions are listed in filebypass.php after the comment The following functions are not used in the Dolibarr documents context.

Troubleshooting

In case of any issues, activate the "Logs and Traces" module in Dolibarr, with a level of LOG_DEBUG(7).
After execution, you will find lines like 2023-05-12 15:00:02 DEBUG n.n.n.n DOC_DB ++ filesize /project/PJ2303-0025/contract_signed [370816]. In this example, the mention DOC_DB ++ indicates the call to the filesize function for the document /project/PJ2303-0025/contract_signed, returning a size of 370816. If the mention is DOC_DB --, it indicates that the returned data is from PHP without any specific processing performed by docDB.

During the execution of the docDBmigrateFiles.py script on Windows, you may encounter a crash with a fieldnotfound error message. This indicates that you need to specify the PHP path using the --phpdir option.

On Windows, you may experience issues related to image file processing. Ensure that the PHP exif_data option is properly enabled. You can follow the procedure explained in the installation guide for this purpose.

License

DocDB is under the GPL license. Refer to the COPYING file for more details.