Tech Thoughts

Synchronizing MongoDB with Oracle using PySpark

02.09.2024

One of our clients needed to have an updated copy of his data from MongoDB to an Oracle database, in order to integrate with their analytics software and perform analysis. To achieve this, we opted to use Apache Spark, specifically the Python version.

We faced the following challenges:

High Volume of Data: MongoDB contains a substantial number of documents, making it impractical to synchronize all data each time. We needed a method to extract only the documents that had changed since the last synchronization.

Lack of Upsert Functionality: Spark does not natively support upsert operations, which are crucial for updating existing documents and adding new ones in Oracle.

To address these challenges, we developed a PySpark script that runs periodically to manage the synchronization process. The key steps involved are:

The script starts by establishing connections to both MongoDB and Oracle.
Using a “last_update” field, managed at the application level, our PySpark pipeline extracts only the documents that have been modified since the last synchronization from the MongoDB collection. These documents are loaded into a PySpark DataFrame.
The DataFrame is then mapped to the columns of the Oracle table, with each column converted to the correct data type (for example, strings containing dates are converted to a Date type).
The transformed data is written to a temporary table in Oracle, which has the same structure as the final table, using the overwrite mode. This ensures that only the new or updated data is present in the temporary table.
Finally, a merge query is executed from the temporary table to the final table using the MongoDB ID as the key. This allows to update existing rows and add new ones efficiently. The merge operation is crucial as it prevents data duplication and maintains data integrity.

We also considered an alternative approach: extracting the keys of new or modified documents, deleting those rows from the Oracle table and inserting the updated versions in append mode. However, this solution had two major drawbacks:

Deleting and then inserting rows is less efficient compared to the merge operation.

The alternative approach posed a higher risk of data loss, as deleting existing rows before inserting new ones could lead to inconsistencies if any issues occur during the process.

In conclusion, synchronizing MongoDB with Oracle is a complex challenge, but with the approach outlined above we developed a robust and efficient solution. If you wish to learn more about the implementation or have specific questions, feel free to reach out!

Data Synchronization Spark MongoDB PySpark Oracle Database ETL Apache Spark