Bidirectional File Synchronization Using Pooling Mechanism

Bidirectional File Synchronization Using Pooling Mechanism

Georgii KapanadzeTechnical Leave a Comment

Introduction

Interested in having synchronized various systems for file management such as Dropbox, Microsoft OneDrive or Microsoft SharePoint? Let me introduce some basic principles, best practices of bidirectional synchronization of files and documents, how to reduce development time and off course financial costs.

What do you need?

From the start you need to take care of Application Program Interfaces (APIs) of each system you want to synchronize. To deal with authentication, learn schema representation and principles, data manipulation and bring it to one common language. This can be achieved either with hard work of a developer learning hundreds of documentation pages, or with an integration platform.

As it happens, we do offer on integration platform with quite a unique technology, so you might consider checking out more about our Connect Bridge. This software allows you to use APIs of various system with the use of a simple SQL (Standard Query Language). And it does it not matter whether you are .NET or Java or any other language developer. The schema is visualized in Query Analyzer tool of Connect Bridge and the developer can test his query in this tool and see the results immediately. Then you just need to control the database to track the changes and you are good to go.

Pooling Mechanism

The pooling mechanism principle is pretty simple: the data from target systems are retrieved and processed once per specified time period or on user action. And that is it.

Advantages

Not all systems provide possibility to trigger actions after file change in real-time. If one of them does not provide this feature, it can cause serious complication. So the main advantage is control of data and time of file synchronization. It gives you a bigger picture of what is happening and opens possibility to avoid unnecessary actions.

Disadvantages

The longer the time between one pool, the bigger the chance to get conflicts between files.

Conflicts handling

In bidirectional synchronization, when the systems are modifying each other, it might happen that the same file has been modified at the same time in different systems. But what happens then? Which one is the correct version? In this case you need to specify which system is the Master and which is Slave in order to decide which version will be overridden.

Core program principle

Changes recognition

In order to track the changes you need to have a database with mapped items of the target systems that have been synchronized. The recognition of update can be either via the modification time or version or whatever is usable and provided by the target systems. Create or delete is pretty simple: if the record of item does not exist in the database, it is new and if the item does not exist in target system, but has a record in database, it has been deleted in target system. And that is it. Some of the systems have a possibility to ask for changes in specified time period but anyway you would have to track what has been synchronized because of failures caused by target systems or connection.

File synchronization engine

For main synchronization logic to work with target systems it is good to create a Provider class for each of them and implement the common interface specifying the basic CRUD (Create-Read-Update-Delete) operations. Then in the core algorithm you don’t need to take care about which one is which. You can just create the general logic of bidirectional synchronization and Provider classes will handle the manipulation itself. If a good core algorithm is implemented, it does not matter how many systems you are syncing. You can just add implementation of other providers. This algorithm needs to follow hierarchy of masters and slaves in order to handle conflicts correctly. If you synchronize by pairs sorted by superiority it should be fine.

Peformance

You cannot influence too much the creation and modification operations, but the most important part is the data retrieval. There is no need to retrieve all data. You can keep the last file synchronization time and ask server only for items with newer creation and modification time. Delete operations depend on server logic. Some of them provide bulk delete operations. Moreover, if the whole folder was deleted and server logic deletes all sub-items within deleted item, it does not make sense to delete it one by one.

Data consistency safety

First of all, it is not a good idea to retrieve data from different places in code places because if you divide it between long lasting operations like uploading of files, meanwhile the user can change the content of systems and you will be working with different data with the same program context which will cause serious troubles and might lead to data loss.

During the process, various exceptions can occur which you cannot influence such as internal server error of target systems or loss of connection etc.. The best practice is to divide exception handling into separate units covering code that might try to run until all operations are done, but do not continue to next unit. It is sort of a level tree. I will give you an example: your synchronization finds out that there were 10 files in 5 folders created in the first system. So it will start to create those 5 folders in other systems but one of the insert operations throws an exception. It can try to create those another 4 folders but should not start inserting files because the paths of 2 files do not exist. It can be handled in different and more complicated way but trust me to keep it as simple as possible. The variations count of possible error scenarios in bi-directional synchronization of more systems is very big number and moreover recursive.


Did you find this article useful?

Join more than 6000 subscribers of our newsletter with fresh news from the world of system integration and business software!

100% privacy. We don´t spam.

Leave a Reply

Your email address will not be published. Required fields are marked *

For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.