Tech Thoughts

Synchronizing MongoDB with Oracle using PySpark

02.09.2024


Agentic RAG schema

One of our clients needed to have an updated copy of his data from MongoDB to an Oracle database, in order to integrate with their analytics software and perform analysis. To achieve this, we opted to use Apache Spark, specifically the Python version.


We faced the following challenges:



To address these challenges, we developed a PySpark script that runs periodically to manage the synchronization process. The key steps involved are:


  1. The script starts by establishing connections to both MongoDB and Oracle.
  2. Using a “last_update” field, managed at the application level, our PySpark pipeline extracts only the documents that have been modified since the last synchronization from the MongoDB collection. These documents are loaded into a PySpark DataFrame.
  3. The DataFrame is then mapped to the columns of the Oracle table, with each column converted to the correct data type (for example, strings containing dates are converted to a Date type).
  4. The transformed data is written to a temporary table in Oracle, which has the same structure as the final table, using the overwrite mode. This ensures that only the new or updated data is present in the temporary table.
  5. Finally, a merge query is executed from the temporary table to the final table using the MongoDB ID as the key. This allows to update existing rows and add new ones efficiently. The merge operation is crucial as it prevents data duplication and maintains data integrity.

We also considered an alternative approach: extracting the keys of new or modified documents, deleting those rows from the Oracle table and inserting the updated versions in append mode. However, this solution had two major drawbacks:



In conclusion, synchronizing MongoDB with Oracle is a complex challenge, but with the approach outlined above we developed a robust and efficient solution. If you wish to learn more about the implementation or have specific questions, feel free to reach out!

Data Synchronization Spark MongoDB PySpark Oracle Database ETL Apache Spark