# Default pipeline for synchronization

This document describes the default pipeline ELO Sync uses to synchronize data.

Depending on the type and direction of the synchronization job, elements are added to or removed from the pipeline.

# Overview

A synchronization pipeline can consist of an unlimited number of stages; the current default pipeline uses two stages.

Each stage contains a series of middleware components that each perform an individual step for a synchronization entry.

The next stage only begins once all entries in the previous stage have been processed. This ensures that some operations are always performed after others.

flowchart LR
start(Quelle) --> def-pipeline-stage
subgraph def-pipeline [default pipeline]
    direction TB
    subgraph def-pipeline-stage [default stage]
        direction TB
        automapping(automatic mapping)
        --> itemresolution(handle linked elements)
        --> deduplication(resolve duplicate elements)
        --> entry-limit(note count limit)
        --> size-limit(note size limit)
        --> classification(classify elements)
        --> difference(determine differences between elements)
        --> conflicts(resolve conflicts)
        --> execution(apply non-destructive changes)
    end
    subgraph deletion-pipeline-stage [deletion stage]
        execution2(apply destructive changes)
    end
    def-pipeline-stage --> deletion-pipeline-stage
end

The default and delete phases are performed sequentially rather than in parallel.

This ensures that changes are always performed in the correct order and destructive changes do not prevent the successful execution of other changes.

In addition, this makes it easier to use non-destructive changes, as they do not have to be checked for any additional restrictions.

# Reasons for the two-stage pipeline

There is currently only one reason for using a two-stage pipeline:

The combination of moving an element and deleting its (indirect) parent element between two synchronizations.

In all systems currently supported, deleting a folder also deletes its entire contents. This leads to a problem if an element is moved from a folder and the folder is then deleted.

For performance reasons, the synchronization code is always in parallel for each entry, so it is possible that the folder is deleted in the target system before the element has been moved.

Information

It seems an obvious solution could be to map the recorded changes in the order they occur, but this can lead to minor errors in borderline cases:

It is not guaranteed that the synchronized systems use precise clocks.
Synchronization may take place well after the changes, and with bidirectional synchronization, the object is moved in one system and the folder deleted in the other.

As the synchronization code cannot rely on the precision of the reported timestamp, another solution must be found so that these operations can be performed successfully.

The current solution is to record all the destructive changes in a queue, which is executed once all the other changes have been processed.

# Example

Assume the following initial structure in both systems:

Folder A
- File F1
- File F2
Folder B

Between two synchronization runs, a user moves the file F1 to B and then deletes folder A.

To repeat this operation correctly, the changes have to be performed in the same order.

Please note that move and delete actions do not have to be performed on the same system and therefore it cannot be guaranteed that the timestamp (or similar) can be compared.

Information

Timestamps can differ between systems, for example, because a system simply has the wrong time set, or an NTP time server was temporarily unavailable.

There are lots of reasons for unreliable timestamps.

A general recommendation for reliable systems is not to use timestamps when deciding on actions.

# Source

The source for the synchronization retrieves the elements from the individual systems for processing and makes them available to other components.

All readable systems are inserted in the processing queue for the source, and then a generator block is created that calls the elements and provides them for further processing. The generator block starts a number of parallel workers between 1 and the number of systems to be synchronized.

Each worker retrieves an element from the processing queue and performs different actions, depending on the element type.

For a system, the worker calls its root collection and inserts all the root elements made available by this collection in the processing queue. The elements are grouped in blocks of up to eight so that subsequent processing can combine multiple retrievals or changes.

If the processed element is a provider element from another collection, this collection is called and its root elements are inserted in the processing queue. Otherwise, the children of the entry are called and divided into groups of eight entries and then processed like entries in a system.

# Executor

The executor executes each configured pipeline stage serially, and all entries within one stage in parallel.

For the first stage, the source block is taken from the source and processed; later stages process the queue for their stage as the source block.

After receiving an input context for processing, the executor checks whether the current context has a predecessor and, if so, waits until this is completed. Afterwards, the executor starts the actual execution of the middleware components configured for the stage.

# Middlewares

# Automatic mapping

This automatically maps objects between different systems based on different criteria.

The default configuration for this middleware maps elements based on their relative path within the synchronized structure.

# Example

If a document library in SharePoint Online contains a file under Project/Marketing/Presentation.pptx, a search is performed in the ELO repository for Project, for a Marketing folder within the folder, and in this folder a document with the short name Presentation with a current version that has the file extension .pptx.

# Handle linked elements

This is required if previous steps in the pipeline could only determine IDs of elements, but did not retrieve the elements themselves.

This middleware then searches for the metadata of these elements in the database and, if possible, queries the associated systems these elements belong to as to whether the elements exist (or were deleted).

This ensures that all the subsequent middleware components have at least access to the metadata and the status of these elements. If elements have been deleted, these elements are replaced with corresponding markers that can be examined by other middleware.

# Resolve duplicate elements

This middleware is used for optimization to prevent elements from being synchronized multiple times in the same synchronization.

Information

The worst-case scenario is using delta mode (only retrieving differences). The connected systems then list each element multiple times for every change.

If 100 changes have been made to an article since the last synchronization, for example, the system could list this article 100 times.

This would lead to at least 100 synchronizations of this element (99 of which would be unnecessary).

# Observe count limit

This middleware forces the configured entry limit for a synchronization job.

If no limit is configured, this middleware does nothing.

If the configured limit is reached, an approval is generated automatically, or, if a hard limit is configured, all the other entries are skipped.

No additional elements are synchronized until approval has been confirmed by a user.

Information

All entries are retrieved from the system, even if the limit has already been reached.

This is required to specify the exact number of entries.

Approval handling

# Observe size limit

This middleware forces the configured file size limit for synchronized files.

If no limit is configured, this middleware does nothing.

An approval is generated for each file that exceeds the configured limit, or, if a hard limit is configured, the file is not synchronized.

The file is not synchronized until approval has been granted by a user.

Approval handling

# Classify elements

This middleware ensures that the status classification of objects is correct, or determines their status if unknown.

As with handling linked elements, this middleware is used to ensure that later middlewares have a consistent status for all the synchronized items.

This middleware has the following tasks:

To ensure that the new objects are marked as new
To ensure that changed elements are actually marked as changed
To unsure that unchanged elements are actually marked as unchanged

# Determine differences between elements

This middleware performs most of the analysis during synchronization.

It determines which changes were made to elements and whether these changes conflict with other changes.

To begin, a prototype is determined from all the connected elements.

This prototype is used as a comparison template for all other elements.

It then retrieves all fields from the prototype and determines the associated fields for each field.

Each field entry is then checked for changes and, if necessary and possible, their contents are compared.

If there are no conflicting changes, a corresponding object is created that enables the change to be processed later. The modification object contains all the information required to make the change.

If multiple conflicting changes are made in different fields, a conflict is generated. A new conflict is created with all relevant data, including the elements and fields that are affected by this conflict.

# Resolve conflicts

This middleware is used to process and resolve conflicts.

If a handler can resolve a conflict, a corresponding modification is created and placed in a queue with the other changes.

If no handler can provide a solution, the conflict is written in the technical log and the synchronization of this entry is canceled to ensure that no incorrect changes are made.

Refer to Conflict handling for more information.

# Apply non-destructive changes

This middleware performs the actual synchronization after all analyses are complete.

The middleware is added to several stages and makes different types of changes, depending on the stage.

The actual implementation of changes is delegated to the registered change handlers, which move changes to later phases, or can even rearrange pending changes in the current phase.

The currently registered handlers behave as follows:

In the default stage, all changes are performed directly, with the exception of deletions. The execution order depends on the intra entry staging of the changes that were previously defined by the analysis middleware.
All destructive changes are rescheduled in the deletion stage.
All remaining scheduled changes are performed in the deletion stage.

← Supported target URLs for SharePoint Conflict handling →