Skip to content

Code-along 2024-06-19 Creating Data Pipelines with Airflow

This project example illustrates creating and testing an Airflow DAG (aka Directed Acyclic Graph) from within DataLab.

This example will:

  • Collects a list of IP addresses used by Amazon Web Services
  • Runs some basic cleanup
  • Prepares the data for use in later processes

We will test the DAG within Airflow and verify how everything works by reviewing the output from the airflow commands

Task 0: Setup

For this project, we need to install the apache-airflow package and run a few commands for configuration within the DataLab environment.

Instructions

Install the apache-airflow package, currently version 2.9.2

!pip install apache-airflow==2.9.2

Modify the PATH variable to make the airflow command accessible

%set_env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/repl/.local/bin

Initialize the Airflow database

!airflow db init

Modify the Airflow configuration file airflow.cfg

!perl -p -i -e 's/\/home\/repl\/airflow\/dags/\/home\/repl\/workspace\/dags/' /home/repl/airflow/airflow.cfg
!perl -p -i -e 's/load_examples = True/load_examples = False/' /home/repl/airflow/airflow.cfg

Task 1: Test basic Airflow commands

We need to verify our Airflow installation is working as expected so we can run a few commands and check the output.

The airflow command is a shell command that handles almost all interaction within the Airflow environment.

NOTE: Within DataLab, the airflow command should be preceeded with a ! character.