DBT Orchestration

Chaim Turkel
Israeli Tech Radar
Published in
6 min readAug 20, 2023

--

DBT is an incredible tool worth exploring. If you’re unfamiliar with it, I encourage you to search for it online, give it a try, and then return here.

In a nutshell:

DBT is a abstract platform for generating sql etls on multiple vendors. It brings the etl to a new level by bringing engineering best practices to the data elt.

So in a standard DBT project, you will have hundreds to thousands of SQL models. These files might be separated into multiple modules or even separate git repos.

As we all know each model or set of models will have to be scheduled at different times. So modules we might run each hour, others on a daily basis, and reports on monthly basis.

DBT opensource itself does not give you a framework for this. If you have the cloud version there is a job deployment service.

So what are your options for orchestration if you are using the opensource DBT CLI?

DBT CLI

DBT gives you a command line tool that you can then use to run models. So for example to run a single module you would run:

dbt run --select modelname

There is a very large variation for finding the models you want to run. You can specify the names, you can use tags paths, and more.

# multiple arguments can be provided to --select
$ dbt run --select my_first_model my_second_model

# these arguments can be projects, models, directory paths, tags, or sources
$ dbt run --select tag:nightly my_model finance.base.*

# use methods and intersections for more complex selectors
$ dbt run --select path:marts/finance,tag:nightly,config.materialized:table

This approach may suffice for a new project, but it may not meet the needs of a larger, more established project.

Initially, a task involving several models may require a few hours of work. To understand the progress, one must review the logs. Furthermore, it is necessary to examine the chart of models to determine which ones have been successful and which ones have failed. In the event of a failure, it is essential to identify the specific model that failed, rather than just noting that the entire task has failed.

All of these issues can be addressed even if they are a nuisance. But there is one issue that cannot be solved using the DBT CLI (as of version 1.5). In addition to the DBT run command, there is a DBT test command. The test command will run a series of tests on your data. You can write very complex tests, there is even great expectations for DBT. With the DBT CLI you can run a tree of models and then run the same tree with the tests. What you cannot do is run each model with its test. This is mandatory, since if a test fails I do not want to continue to run the graph. This is where we must move to an orchestrator to solve this.

DBT Manifest

DBT has a useful feature where it creates a JSON manifest file that includes all the models and their connections in a tree format. This file can be utilized to automate the process of creating an orchestrator for all the models without the need for manual work.

DBT && Airflow

Airflow is a popular orchestrator that is based on Python. It has the ability to load Python files dynamically to run DAGs. Airflow is often paired with DBT and has been a common choice even before the release of DBT version 1.0. There have been articles on how to use the manifest file to generate Python DAG files for Airflow.

Astronomer provides a hosted airflow platform. They recently published an informative article on how to generate an airflow DAG based on a manifest, in order to encourage the use of their platform. You can view the article at Building a Scalable Analytics Architecture With Airflow and dbt.

During one of my previous projects, we utilized airflow as the orchestrator and established a complete CI/CD pipeline. After thoroughly testing and building the project, airflow dags were automatically generated and deployed.

Issues with airflow

After working for a while with airflow, I have come to the conclusion that it is not the best orchestrator for DBT.

The reasons for this are as follows:

  1. A full airflow graph is generated from all models, even though from my experience about 60–70% of models are actually views. This means that airflow is spending a lot of time (we used databricks that can take up to 5 min to spin up a cluster) on views -> something that is never changed after the first run.
  2. DAG Upgrade -> When deploying a new dag to airflow with different schedules, it is fairly frequently that a new table is run before the previous table (that is also new) was run. If table A is run once a day and table C which is dependent on A and B runs every hour (B is also updated every hour), then this can fail since A was not built yet.

We were unable to resolve the first issue and left it unresolved. As for the second issue, we came to the decision that if the table fails due to a preceding table that has not been executed, it is not the end of the world as it will pass later on. Although this may not be the most ideal solution.

DBT && ArgoWorkflow

For my upcoming DBT project, we opted to switch from airflow to Argo Workflow. This platform is rooted in Argo and K8s, and is equipped to handle scheduled jobs that rely on docker. The process entails drafting a YAML document to outline the DAG, and Argo manages the docker execution. Given that with DBT we bundle all models in a docker image (for version control and package pre-downloading), Argo proved to be an excellent choice.

Argo offers a unique feature that is not present in Airflow - the ability to deploy a workflow that runs only once. This means you can either set up a workflow to run just once or schedule it to run on a regular basis.

We have established two workflows on Argo. The first one executes the entire DAG, covering all the views and tables. This process ensures that the tables are executed in the proper order of creation, eliminating any chances of failure. The second workflow is designed to schedule only the tables in DBT, excluding the views.

One of the drawbacks of using Argo workflows is that in order to incorporate functions that run before and after tasks, you must create a docker. In contrast, with airflow, you can simply add Python code in the appropriate location.

Dagster

I haven't had the chance to work with Dagster myself, but according to the documentation, it has an integration for DBT and seems like a promising project. One of its standout features is declarative scheduling, which allows you to specify how often each table should be updated and Dagster will trigger them accordingly based on the graph. This approach to scheduling is quite innovative and definitely worth exploring further.

Summary

After analyzing the pros and cons of various orchestrators, we have found that there is currently no satisfactory solution to this problem with DBT. While Airflow is a reliable option, I personally believe that Argo Workflow, which is entirely docker-based, is a better fit. Additionally, Dagster is an innovative direction worth exploring.

--

--