Trying to make sense of the ETL Landscape

Chaim Turkel
Israeli Tech Radar
Published in
7 min readJun 25, 2023

--

Having been assigned the task of delivering a lightning talk on Benthos, I decided to have a look around to see what else is in the ETL arena.

I have worked in the past with Jenkins, airflow and argo workflow. I have read about apache nifi, Airbyte, Stitch, and Fivetran. I am now officially confused about why there are so many tools that are so similar.

Before you go out and try them all, you first need to define your requirements.

Build Tools

I believe it started from build tools. Each echo system has its local build tool, whether it be maven, npm, tox, cargo, and others.

After we had a local build system, we needed to move it to the build server and add to it GUI so that we could manually trigger it, or automate it as part of a PR.

This brought us tools like Jenkins (Hudson for those that remember), and ms-build.

The center of these tools was to build our software, run unit tests, maybe some integration tests, and then build the software for production.

Cron Tools / Orchestrators

Once deployed we found ourselves with the need to run tasks on different schedules. This brought us different cron tools. Tools range from standard cron, to cron on K8s, and other cronhub’s. Jenkins was a very good tool that moved from the build tool into the cron tool arena. Now we could trigger multiple systems from the same tool.

Flows

But then we needed flows. Our objective was to transfer data from a specific machine to a designated bucket. Once the data has been successfully transferred to the bucket, our intention is to initiate a Spark job for processing. Subsequently, upon completion of the Spark job, there is a need to activate the generation of reports.

In this space, airflow was born.

Airflow was defined as an orchestrator tool. The selling point was that airflow came with hundreds of operators built in, so you did not need to write much code, and you could always write your own operators. So if you wanted to trigger a spark job and wait for it to finish to trigger something else, airflow was the perfect tool.

During that time, many bloggers were busy discussing the differences between Jenkins and Airflow. The main contrast was that Airflow went beyond being a simple cron job that runs when needed. Unlike Jenkins, Airflow had a data-centric approach. Each job in Airflow was aware of the data window it should handle, and it offered additional features such as backfilling.

Over time, Airflow evolved from being a mere orchestrator to a full-fledged data ETL (Extract, Transform, Load) platform. It provided a convenient way to write Python operators for custom data transformations. Moreover, numerous operators were introduced to facilitate seamless integration with popular cloud platforms. Thus, Airflow transitioned from being primarily an orchestrator to a tool that encompassed aspects of ELT (Extract, Load, Transform). For example, it became possible to use Airflow for moving data from S3 to GCP and then executing Dataflow jobs on that data. While other tools like Luigi existed, Airflow gained significant traction and outpaced its competitors.

Data Pipelines

After Airflow solidified its position in the emerging field, the discourse shifted toward the concept of data pipelines.

A data pipeline refers to a series of steps and actions that enable the smooth movement of data from diverse sources to a designated destination, often a data storage or analytics system. It encompasses the extraction of data from various sources, the conversion of data into a suitable format through transformations, and the loading of data into a target system for subsequent analysis or utilization.

Airflow provided a comprehensive set of tools such as tasks, DAGs (Directed Acyclic Graphs), sensors, and cron jobs, which made it remarkably effortless to construct data pipelines. These components allowed users to efficiently design and assemble their data pipelines with ease.

Airflow gained significant popularity, but it was not without its limitations. One notable drawback was the lack of strong support for low-latency jobs, as the framework was primarily designed for long-running tasks. This created an opportunity for competitors to enter the market and address these shortcomings. Prefect and Dagster were among the contenders that emerged during this period. Prefect distinguished itself by offering better support for low-latency jobs and improved concurrency through its Java-based scheduler. On the other hand, Dagster focused on lightweight tasks and emphasized the separation of environments to facilitate easier development. Additionally, Dagster introduced a unique approach by placing significant emphasis on data assets. While Airflow primarily focused on tasks, Dagster concentrated on both tasks and the associated data, ensuring continuous access to valuable metadata throughout the pipeline processes.

K8s Revolution

Airflow gained popularity due to its extensive support for a wide range of operators. However, as the industry shifted towards Kubernetes, the demand for such operators diminished. Instead, the focus shifted towards utilizing Kubernetes pods to handle the workload, as Airflow was primarily designed for orchestration rather than performing the actual data processing. Whenever I set up Airflow with Kubernetes, I couldn’t shake the feeling that something was missing. Kubernetes already supports cron scheduling, and I questioned the necessity of Airflow for simply setting up tasks on Kubernetes. That’s when I came across Argo Workflows. In my view, if all your tasks are running on Kubernetes, Argo Workflows serves as the best alternative to Airflow. While it may not be flawless, it represents a step in the right direction.

Warehouse Revolution

The advent of big data warehouses marked the next significant revolution in the data industry. Before this stage, the prevalent approach involved constructing data lakes and relying heavily on Spark jobs to read, transform, and load vast amounts of data into and out of the data lake.

With the emergence of data warehouses such as Redshift, BigQuery, Snowflake, and Databricks, a significant shift occurred in the way data transformations were performed. The availability of these data warehouses enabled us to leverage the ubiquity of the SQL language for all our transformation needs. This return to the familiar realm of relational database management systems (RDBMS) relieved us from the complexities of handling data transformations using Spark jobs or other custom solutions. As a result, we could devote more attention to fulfilling business requirements rather than being overly concerned with the underlying technology, including CPU and memory-related challenges.

DBT Revolution

The challenge of modeling data in ELT (Extract, Load, Transform) processes has long been a significant concern. However, in 2018, a game-changing solution called DBT (Data Build Tool) was introduced. DBT revolutionized the entire approach to data transformation and data lineage management. Following its release, many tools in the data industry began prioritizing support for DBT, leading to a noticeable shift in focus.

As the data warehouse revolution gained momentum, the significance of DBT grew even further. Since the majority of data transformations are commonly written in SQL, it made sense to leverage the benefits of version control with Git and utilize DBT graphs for managing those transformations. This approach not only facilitated better collaboration and code management but also provided valuable data lineage capabilities along the way.

Data Ingestion

Previously, there was a trend toward developing tools that offered a holistic end-to-end ELT solution. However, the current landscape showcases a notable shift towards specialization, with a greater emphasis on individual components focusing on their specific areas (although some overlap remains between them).

While products like Fivetran and Stitch have long been established in the market, the recent surge in the no-code and low-code movement has paved the way for a new wave of open-source projects in this field.

As I reviewed the projects, a sense of nostalgia for my past came rushing back. About 10 years ago, there was a surge of initiatives that implemented Enterprise Integration Patterns. These projects went beyond mere code for transformations and introduced various components such as filters, translators, aggregators, message routers, and more. Interestingly, it appears that all of these concepts have now resurfaced in the no-code landscape, albeit without the flashy buzzwords that accompanied them back then.

Tools like Airbyte prioritize the crucial task of efficiently and effectively transporting data from diverse sources to distinct destinations.

Benthos places a strong emphasis on achieving high-performance capabilities and excels in handling complex and sophisticated data transformations.

Mage is a tool that aims to fully replace Airflow, particularly as an upgraded version of Airflow. It provides a convenient platform for writing Python code effortlessly, leveraging annotations and storing metadata on data at each step of the process.

To obtain a comprehensive overview of the technological advancements over time:

Summary

We find ourselves in a truly exciting era where extensive development teams and excessive coding are no longer prerequisites. The availability of numerous open-source solutions enables us to achieve remarkable results with minimal or even no code at all. If I were to invest in a project, my choice would be Benthos, as it adheres to the principle of separation of concerns and excels in data ingestion. For orchestration purposes, a tool like Argo Workflows would complement Benthos seamlessly. The added convenience lies in Benthos offering a Docker deployment option, which can be easily deployed on Argo Workflows.

--

--