How to develop data pipeline in Airflow through TDD (test-driven development)

Published in

Magrathea

10 min readFeb 3, 2021

I’ve been reading a lot about DataOps and MLOps methodologies lately. One of the pillars of these methodologies is to improve development performance in data science projects. Two techniques widely used in software engineering to achieve this performance are testing and continuous integration. In this article, I will present how to develop a data pipeline in Airflow through tests and, at the end of it, how to configure a simple continuous integration workflow using Github Actions.

I chose to develop the project using Apache Airflow because it is a very popular data engineering tool. In a way, this article is also a hands-on/tutorial, so feel free to go reading and coding together 👨‍💻.

You can find all the code developed in this Github repository.

Also, there is a step-by-step tutorial on developing the pipeline and tests you can find here. It is very detailed and long because I tried to explain my thinking at each stage.

To follow this article, you will need a basic knowledge of Airflow, Python, pytest e Docker. This article won’t introduce those tools. First, I’ll present the sample project that we’ll use as a basis and right after the articles' steps.

The Sample Project

Imagine that you work for an e-commerce start-up. The analytics team needs to create a new dashboard using data from purchases and the product catalog. They want to follow some trends and metrics. You were asked to develop a pipeline that extracts data from the company’s central database and transfers it to the analytics team's data warehouse. As the company doesn’t have many financial resources yet, it’s also requested to generate aggregate results from the rank of best-selling products. The requested tasks are:

Transfer data from the transactional database to the data warehouse;
Create a table with purchases with product data for the use of the analytics team;
Create a table with grouped data of the top 10 best-selling products across the period.

After a brief search, you identified the database that stores order and product information. This database is a Postgresql and will be called oltp-db in this article. It is the database responsible for the transactional record; it is not recommended to execute heavy queries on it. They can negatively impact the performance of the central system. For this reason, we will develop a pipeline that fetches data from this database and sends it to the analytics database. The data warehouse used by the analytics team is also a Postgresql database and is called olap-db.

After a few queries at the database, you could extract the data format and a sample of it.

In the team's weekly planning, time was used for everyone to help build a preview of how this DAG should work. The image below is the DAG developed in this tutorial, but imagine it was drawn on some whiteboard.

We now have our tasks defined and a basic understanding of the data. We can start our development! However, before that, the sequence of topics in this article is described below:

What’s is TDD
Why is it difficult to apply TDD in data projects?
Setup out development environment
Code our DAG through tests!
Configure CI/CD workflow using Github Actions
Some final considerations

What’s Test-Driven Development

Test-driven development (TDD) is a software development process relying on software requirements being converted to test cases before software is fully developed, and tracking all software development by repeatedly testing the software against all test cases. (wikipedia article)

In summary, we can simplify the stages of TDD as

we wrote a test
we run the test, and it will fail,
we write code to make the test work,
we run the test and check if it’s OK.
Repeat, we add a new test and start a new loop.

The main advantage of this methodology is that it guides the writing of the code towards its objective. It makes you break the problem down into more straightforward steps and quickly validate the written test code's result. In addition to serving as good documentation for new members joining the project.

This article's exciting points are by David O’Keeffe about problems that arise when the project has no tests.

You become afraid of changing your code.
You forget about prior functionality.
The code rot sets in
Production releases turn into rare and often traumatic experiences.

I have been through the situations listed above several times in projects I developed or participated in because they had no tests. What about your data project experience? It took a while to learn how to build a test-driven data project and to understand its importance.

Why is it challenging to apply TDD in data projects?

Airflow is a tool (in a modern view of data engineering) that allows you to build a data pipeline using code. However, when we want to apply this code, TDD generates a conflict due to the code having to have external dependencies. One of the biggest problems with testing a data project's functions is that they depend on the data they are running or an external system.

For example, the function below accesses the Postgresql database and executes an SQL command. We can’t validate it without instantiating PostgresHook. The execution of SQL also has a strong dependency on the table and data contained therein. We could mock the database instance, but we would not be able to test the SQL command execution that is the purpose of the function.

To be able to develop data projects, we need to be able to supply the external dependencies so that we can quickly execute the construction of our code, some methods to guarantee this are:

create containers dockers and populate with fake data (we’ll use this)
build a dev environment in production instances (not recommended)
when using managed cloud services, you may be lucky to have a library that simulates those services' functions; otherwise, you would have to create a development instance to provide this external dependency.

In this project, as the two databases are Postgresql, we will create two Docker containers to simulate a production environment.

Several types of tests can be applied in a data project. This article mentioned some examples: DAG integrity tests, data tests, mock pipeline tests, unit tests, etc. This article will focus on mock pipeline tests, which in my opinion, is the type of test that mainly helps in the development of pipelines with quick feedback, guarantees that it is generating the expected result, and a better understanding of the data.

A good article talking about different data testing types is the Data’s Inferno 7 Circles of Data Testing Hell with Airflow. Circle 6 spoke in the article is mock pipeline tests. Below is a part extracted from:

One of the most difficult aspects of designing and implementing reliable data pipelines is the fact that when working with real data, there are two moving parts: the data (and its quality) and your code. In order to be able to reliably test your code, it’s very important to ensure that your code is the only ‘moving part’. This circle of tests fixes the data in place, allowing the tests to focus on the code.
Implementing this layer is quite involved, and requires a good understanding and definition of your data. The usefulness of the tests here will very strongly depend on how well your fake data models the real data.

That’s it! Sorry to talk a lot about tests, but I like the topic. Next, we’ll start our journey on how to solve our problem!

As mentioned above, to carry out our tests on the data project, we must first provide the external dependencies to Airflow. We will use Docker containers to assist us in this step, and then we will create our DAG.

Setup development environment

For the project, the entire environment was managed using docker-compose. A container was created for airflow, another for the oltp-db and olap-db. The environment building file is shown below. It’s been simplified to be better presented in the article, and the full version can access here.

To make it easier, I created a Makefile file that has three commands:

setup starts the containers and creates the access variables;
testing execute tests;
down shutdown all containers.

If you are following the article together with the code, you can execute the command below:

make setup

After a while (240 s + due to the time to start Airflow), we will have our environment ready. You can access Airflow GUI at localhost:8080. An admin/admin user has been created for development purposes.

Create a DAG through tests!

We will start the development of our DAG. First, we need to create a test, which will eventually fail. Below I present the evolution of the code of our first task load_full_products_datawith the tests. To better understand the whole process, I recommend reading the step-by-step tutorial here.

Created the first test that aims to check the productstable in the olap database. The value compared in this case is the one expected according to the sample file created.
Our first error is that olap_product_size the variable doesn’t exist. So we created the code to fix this.
After correcting the error the next appears, we need to create the products table in the database.
Finally, we start our DAG development with the function to transfer data from the oltp-db database to olap-db.

This interaction continues until we’re able to test all the tasks we plan for the DAG. Below the result of the test file at the end of the tutorial.

And the final code of our DAG is shown below. It’s thoroughly tested and validated.

Even though the DAG code with the test is a simple task, it makes the understanding straightforward. We were able to clearly understand the sequence and logic that we look forward to submitting this pipeline. The development of the code itself was also more straightforward. Building task by task and validating them on the fly ensures that we develop in an optimized flow.

Configure CI/CD workflow using Github Actions

The Makefile file I created to help with local development already automates the steps needed to implement the continuous integration workflow. We want Github Actions to perform the same commands when we do some PR.

The setup configuration is straightforward. However, it will serve as a starting point. Below is an image that brings joy! We have a continuous integration pipeline running our tests, and they have all been validated! 🪅🎊🎉!

Benefits

Using TDD helps a lot to develop the code. It really makes you stop and break your problem into smaller parts and always validate them.
Having a project with local configuration and tests helps the team members. Tests play the role of documentation to help you understand the project more clearly and quickly.
Create fake data has a good impact because you learn and understand more about your data's characteristics.
Setup a CI pipeline will ensure that tests are applied when someone has changed the code and submitted a production change.

Some Final Considerations

Building a mock pipeline scenario for some cases can be challenging. If your stack uses cloud technologies, some services have mock libraries; for other cases, it would be necessary to create a dev instance to make the connection. It may generate an additional cost. Or in the worst case when there is no docker image. My suggestion is to create a DAG that only transfers this source to a staging area for these scenarios. From the staging onwards, you run tests. This is not ideal, but I think it is better than not having a test at all.
Creating the sample data can become complicated. In a productive scenario, there were possibly tables with data that depend on the information in other tables. For example: If there is a DAG that joins between the products and purchases table, the fake data will depend on each other since the goal is to check if there is any product ID in the purchases table that is not in the products table. In that case, libraries like Faker can create this data in a more simplified way.
Why not testing using airflow tasks tests? The ideal would be to develop the tests in a more granular way. Due to the strong dependency between the tasks, it becomes challenging to configure and leave the pipeline functional to apply the test using this command. So at the moment, I think that running airflow dags backfillis the simplest way.
Why not use dbt? The problem we address is quite generic and can be applied to all stages of the pipeline (E, T, L, or enrichment algorithms, machine learning, etc.). The dbt tool ends up focusing on transformations, but it can be thought of as being used together.

Conclusions

It is not simple to set up a data project with tests. However, the investment pays off very quickly. It is much simpler to change the code, help another developer understand the project, and allow multiple people to mutually work on the same base. Even using Airflow, the methodology developed to solve the problem is very generic. It can be easily expanded to your problem.

I hope you enjoyed it. If you have another strategy to perform tests on the data pipeline, send me a message as I would love to exchange ideas about it. If you have any doubts, feel free to send me a message here.