Code-along 2024-06-19 Creating Data Pipelines with Airflow
This project example illustrates creating and testing an Airflow DAG (aka Directed Acyclic Graph) from within DataLab.
This example will:
- Collects a list of IP addresses used by Amazon Web Services
- Runs some basic cleanup
- Prepares the data for use in later processes
We will test the DAG within Airflow and verify how everything works by reviewing the output from the airflow commands
Task 0: Setup
For this project, we need to install the apache-airflow package and run a few commands for configuration within the DataLab environment.
Instructions
Install the apache-airflow package, currently version 2.9.2
!pip install apache-airflow==2.9.2Modify the PATH variable to make the airflow command accessible
%set_env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/repl/.local/binInitialize the Airflow database
!airflow db initModify the Airflow configuration file airflow.cfg
!perl -p -i -e 's/\/home\/repl\/airflow\/dags/\/home\/repl\/workspace\/dags/' /home/repl/airflow/airflow.cfg
!perl -p -i -e 's/load_examples = True/load_examples = False/' /home/repl/airflow/airflow.cfgTask 1: Test basic Airflow commands
We need to verify our Airflow installation is working as expected so we can run a few commands and check the output.
The airflow command is a shell command that handles almost all interaction within the Airflow environment.
NOTE: Within DataLab, the airflow command should be preceeded with a ! character.