To enhance performance, 2 basic things can be done. One is upgrading VM’s processor to let the database engine perform to its fullest potential, and the second is enhancing pipeline processes.
We focus on the second approach by inspecting SQL script performance. One thing we note is that some calculation steps are referencing multiple tables to produce a constant. Since it is a constant, the calculation can be done once a day and saved to a table.
For much more complex processes, we start to break the script by splitting processes into Python(run on Dagster) and SQL scripts. As a result, we see a massive improvement in the query without affecting the readability and traceability of the pipeline. Plus, doing it this way helps break up the monolithic process to help catch errors quickly.
We are far from done. Building an in-house data pipeline where an outside party manages most data sources is challenging. Having standards in place to govern every decision is essential. We want to pave the way for the company and team goals to walk together.