Skip to main content

Top 11 Data Engineering Projects for Hands-On Learning

Showcase your data engineering skills through these portfolio projects. Practice and deepen your understanding of various technologies to show potential employers your strengths!
Nov 6, 2024  · 25 min read

Data engineering supports the movement and transformation of data. As companies rely on huge amounts of data to gain insights and drive innovation, the demand for data engineers continues to grow.

For data professionals, diving into data engineering projects offers a wealth of opportunities. Hands-on challenges sharpen your technical skills and provide a tangible portfolio to showcase your knowledge and experience.

In this article, I have curated a selection of data engineering projects designed to help you advance your skills and confidently tackle real-world data challenges!

Why Work on Data Engineering Projects?

Building a solid understanding of data engineering through theory and practice is important. If you’re reading this article, you may already know this, but here are three specific reasons to dive into these projects:

Building technical skills

Data engineering projects provide hands-on experience with technologies and methodologies. You'll develop proficiency in programming languages, database management, big data processing, and cloud computing. These technical skills are fundamental to data engineering roles and highly transferable across the tech industry.

Portfolio development 

Creating a portfolio of data engineering projects demonstrates your practical abilities to potential employers. You provide tangible evidence of your capabilities by showcasing implementations of data pipelines, warehouse designs, and optimization solutions. 

A strong portfolio sets you apart in the job market and complements your resume with real-world accomplishments.

Learning tools and technologies 

The data engineering field employs a diverse array of tools and technologies. Working on projects exposes you to data processing frameworks, workflow management tools, and visualization platforms. 

This practical experience keeps you current with industry trends and enhances adaptability in an evolving technological landscape.

Data Engineering Projects for Beginners

These projects aim to introduce the main tools used by data engineers. Start here if you are new to data engineering or need a refresher.

Project 1: ETL pipeline with open data (CSV to SQL)

This project entails building an ETL pipeline using a publicly available dataset, such as weather or transportation data. You will extract the data from a CSV file, clean and transform it using Python (with a library like Pandas), and load the transformed data into Google BigQuery, a cloud-based data warehouse.

This project is excellent for beginners as it introduces core ETL concepts—data extraction, transformation, and loading—while giving exposure to cloud tools like BigQuery. 

You'll also learn how to interact with cloud data warehouses, a core skill in modern data engineering, using simple tools like Python and the BigQuery API. For an introduction, review the beginner’s guide to BigQuery.

As for the data, you can select an available dataset from either Kaggle or data.gov.

Resources

Here are some resources, including GitHub repositories and tutorials, that provide step-by-step guidance:

YouTube videos:

GitHub Repositories:

  • End-to-End Data Pipeline: This repository demonstrates a fully automated pipeline that extracts data from CSV files, transforms it using Python and dbt, and loads it into Google BigQuery.
  • ETL Pipeline with Airflow and BigQuery: This project showcases an ETL pipeline orchestrated with Apache Airflow that automates the extraction of data from CSV files, transformation using Python, and loading into BigQuery.

Courses:

  • ETL and ELT in Python: Learn more about ETL processes in Python, covering foundational concepts and practical implementations to build data pipelines.
  • Understanding Modern Data Architecture: This course offers a comprehensive overview of modern data architecture, focusing on best practices for moving and structuring data in cloud-based systems like BigQuery.

Skills developed

  • Extracting data from CSV with Python
  • Transforming and cleaning data with Python
  • Loading data into BigQuery with Python and SQL

Project 2: Weather data pipeline with Python and PostgreSQL

This project introduces aspiring data engineers to the fundamental process of building a data pipeline, focusing on three core aspects of data engineering: data collection, cleansing, and storage. 

Using Python, you’ll fetch weather conditions and forecasts from various locations from readily available public weather APIs. Once the weather data is collected, you’ll process the raw data, which may involve converting temperature units, handling missing values, or standardizing location names. Finally, you’ll store the cleansed data in a PostgreSQL database.

This project is a strong starting point for new data engineers. It covers the fundamentals of building a data pipeline using widely used industry tools.

Resources

Here are some valuable resources, including GitHub repositories and tutorials, that provide step-by-step guidance to complete this project:

YouTube videos:

GitHub repositories:

  • Weather and Air Quality ETL Pipeline: This repository demonstrates an ETL pipeline that extracts weather and air quality data from public APIs, transforms it into a clean, analyzable format, and loads it into a PostgreSQL database.
  • Weather Data Integration Project: An end-to-end ETL pipeline that extracts weather data, transforms it, and loads it into a PostgreSQL database.

Courses:

  • Creating PostgreSQL Databases: This course offers a comprehensive guide to PostgreSQL, covering essential skills for creating, managing, and optimizing databases—a critical step in the weather data pipeline.
  • Data Engineer in Python: This skill track covers foundational data engineering skills, including data collection, transformation, and storage, providing a strong start for building pipelines in Python.

Skills developed

  • Using Python to write data pipeline applications
  • Collecting data from external sources (APIs)
  • Cleaning data to make it consistent and understandable
  • Setting up databases and storing and organizing data in them

Project 3: London transport analysis

This project offers an excellent starting point for aspiring data engineers. It introduces you to working with real-world data from a major public transport network that handles over 1.5 million daily journeys. 

The project's strength lies in its use of industry-standard data warehouse solutions like Snowflake, Amazon Redshift, Google BigQuery, or Databricks. These platforms are crucial in modern data engineering, allowing you to efficiently process and analyze large datasets. 

By analyzing transport trends, popular methods, and usage patterns, you'll learn how to extract meaningful insights from large datasets - a core competency in data engineering.

Resources

Here are some resources, including guided projects and courses, that provide step-by-step guidance:

Guided projects:

  • Exploring London’s Travel Network: This guided project teaches you how to analyze London's public transport data, helping you explore trends, popular routes, and usage patterns. You'll gain experience with large-scale data analysis using real-world data from a major public transport network.

Courses:

  • Data Warehousing Concepts: This course covers essential data warehousing principles, including architectures and use cases for platforms like Snowflake, Redshift, and BigQuery. It's an excellent foundation for implementing large-scale data storage and processing solutions.

Skills developed

  • Understanding the context of writing queries by better understanding the data.
  • Working with large datasets.
  • Understanding big data concepts.
  • Working with data warehouses and big data tools, like Snowflake, Redshift, BigQuery, or Databricks.

Become a Data Engineer

Become a data engineer through advanced Python learning
Start Learning for Free

Intermediate Data Engineering Projects

These projects focus on skills like being a better programmer and mixing different data platforms. These technical skills are essential for your ability to contribute to an existing tech stack and work as part of a larger team.

Project 4: Performing a code review

This project is all about reviewing the code of another data engineer. While it may not be as hands-on with the technology as some other projects, being able to review others’ code is an important part of growing as a data engineer. 

Reading and reviewing code is just as important of a skill as writing code. After understanding foundational data engineering concepts and practices, you can apply them to reviewing others’ code to ensure that it follows best practices and reduces any potential bugs in the code.

Resources

Here are some valuable resources, including projects and articles, that provide step-by-step guidance:

Guided projects:

  • Performing a Code Review: This guided project offers hands-on experience in code review, simulating the code review process as if you were a senior data professional. It’s an excellent way to practice identifying potential bugs and ensuring best practices are followed.

Articles:

  • How to Do a Code Review: This resource provides recommendations on conducting code reviews effectively, based on extensive experience, and covers various aspects of the review process.

Skills developed

  • Reading and evaluating code written by other data engineers
  • Finding bugs and logic errors when reviewing code
  • Providing feedback on code in a clear and helpful manner

Project 5: Building a retail data pipeline

In this project, you'll build a complete ETL pipeline with Walmart's retail data. You'll retrieve data from various sources, including SQL databases and Parquet files, apply transformation techniques to prepare and clean the data, and finally load it into an easily accessible format.

This project is excellent for building foundational yet advanced data engineering knowledge because it covers essential skills like data extraction from multiple formats, data transformation for meaningful analysis, and data loading for efficient storage and access. It helps reinforce concepts like handling diverse data sources, optimizing data flows, and maintaining scalable pipelines.

Resources

Here are some valuable resources, including guided projects and courses, that provide step-by-step guidance:

Guided projects:

  • Building a Retail Data Pipeline: This guided project takes you through constructing a retail data pipeline using Walmart’s retail data. You’ll learn to retrieve data from SQL databases and Parquet files, transform it for analysis, and load it into an accessible format.

Courses:

  • Database Design: A solid understanding of database design is essential when working on data pipelines. This course covers the basics of designing and structuring databases, which is valuable for handling diverse data sources and optimizing storage.

Skills developed

  • Designing data pipelines for real-world use cases.
  • Extracting data from multiple sources and different formats.
  • Cleaning and transforming data from different formats to improve its consistency and quality.
  • Loading this data into an easily accessible format.

Project 6: Factors influencing student performance with SQL

In this project, you'll analyze a comprehensive database focused on various factors that impact student success, such as study habits, sleep patterns, and parental involvement. By crafting SQL queries, you'll investigate the relationships between these factors and exam scores, exploring questions like the effect of extracurricular activities and sleep on academic performance.

This project builds data engineering skills by enhancing your ability to manipulate and query databases effectively. 

You'll develop skills in data analysis, interpretation, and deriving insights from complex datasets, essential for making data-driven decisions in educational contexts and beyond.

Resources

Here are some resources, including guided projects and courses, that provide step-by-step guidance:

Guided projects:

  • Factors that Fuel Student Performance: This guided project enables you to explore the influence of various factors on student success by analyzing a comprehensive database. You’ll use SQL to investigate relationships between study habits, sleep patterns, and academic performance, gaining experience in data-driven educational analysis.

Courses:

  • Data Manipulation in SQL: A strong foundation in SQL data manipulation is key for this project. This course covers SQL techniques for extracting, transforming, and analyzing data in relational databases, equipping you with the skills to handle complex datasets.

Skills developed

  • Writing and optimizing SQL queries to retrieve and manipulate data effectively.
  • Analyzing complex datasets to identify trends and relationships.
  • Formulating hypotheses and interpreting results based on data.

Advanced Data Engineering Projects

One hallmark of an advanced data engineer is the ability to create pipelines that can handle a multitude of data types in different technologies. These projects focus on expanding your skill set by combining multiple advanced data engineering tools to create scalable data processing systems.

Project 7: Cleaning a dataset with Pyspark

Using an advanced tool like PySpark, you can build pipelines that take advantage of Apache Spark's capabilities. 

Before you attempt to build a project like this, it's important to complete an introductory course to understand the fundamentals of PySpark. This foundational knowledge will enable you to fully utilize this tool for effective data extraction, transformation, and loading.

Resources

Here are some valuable resources, including guided projects, courses, and tutorials, that provide step-by-step guidance:

Guided projects:

  • Cleaning an Orders Dataset with PySpark: This guided project walks you through cleaning an e-commerce orders dataset using PySpark, helping you understand how to extract, transform, and load data in a scalable way with Apache Spark.

Courses:

  • Introduction to PySpark: This course provides an in-depth introduction to PySpark, covering essential concepts and techniques for effectively working with large datasets in Spark. It's an ideal starting point for building a strong foundation in PySpark.

Tutorials:

  • PySpark Tutorial: Getting Started with PySpark: This tutorial introduces the core components of PySpark, guiding you through the setup and fundamental operations so you can confidently start building data pipelines with PySpark.

Skills developed

  • Expanding experience with PySpark
  • Cleaning and transforming data for stakeholders
  • Ingesting large batches of data
  • Deepening knowledge of Python in ETL processes

Project 8: Data modeling with dbt and BigQuery

A popular and powerful modern tool for data engineers is dbt (Data Build Tool), which allows data engineers to follow a software development approach. It offers intuitive version control, testing, boilerplate code generation, lineage, and environments. dbt can be combined with BigQuery or other cloud data warehouses to store and manage your datasets. 

This project will allow you to create pipelines in dbt, generate views, and link the final data to BigQuery.

Resources

Here are some valuable resources, including courses and video tutorials, that provide step-by-step guidance:

YouTube videos:

  • End to End Modern Data Engineering with dbt: In this video, CodeWithYu provides a comprehensive walkthrough of setting up and using dbt with BigQuery, covering the steps for building data pipelines and generating views. It’s a helpful guide for beginners learning to combine dbt and BigQuery in a data engineering workflow.

Courses:

  • Introduction to dbt: This course introduces the fundamentals of dbt, covering basic concepts like Git workflows, testing, and environment management. It’s an excellent starting point for using dbt effectively in data engineering projects.

Skills developed

  • Learn about dbt
  • Learn about BigQuery
  • Understand how to create SQL-based transformations
  • Use software engineering best practices in data engineering (version control, testing, and documentation)

Project 9: Airflow and Snowflake ETL using S3 storage and BI in Tableau

With this project, we’ll look at using Airflow to pull in data using an API and transfer that data into Snowflake using an Amazon S3 bucket. The purpose is to handle the ETL in Airflow and the analytical storage in Snowflake. 

This is an excellent project because it connects to multiple data sources through several cloud storage systems, all orchestrated with Airflow. This project is very complete because it has many moving parts and resembles a real-world data architecture. This project also touches on business intelligence (BI) by adding visualizations in Tableau.

Resources

Here are some valuable resources, including courses and video tutorials, that provide step-by-step guidance:

YouTube videos:

  • Data Pipeline with Airflow, S3, and Snowflake: In this video, Seattle Data Guy demonstrates how to use Airflow to pull data from the PredictIt API, load it into Amazon S3, perform Snowflake transformations, and create Tableau visualizations. This end-to-end guide is ideal for understanding the integration of multiple tools in a data pipeline.

Courses:

  • Introduction to Apache Airflow in Python: This course provides an overview of Apache Airflow, covering essential concepts such as DAGs, operators, and task dependencies. It's a great foundation for understanding how to structure and manage workflows in Airflow.
  • Introduction to Snowflake: This course introduces Snowflake, a powerful data warehousing solution. It covers managing data storage, querying, and optimization. It’s perfect for gaining foundational knowledge before working with Snowflake in data pipelines.
  • Data Visualization in Tableau: This course covers essential Tableau skills for data visualization, allowing you to turn data into insightful visuals—a core step for interpreting data pipeline outputs.

Skills developed

  • Practice creating DAGs in Airflow
  • Practice connecting to an API in Python
  • Practice storing data in Amazon S3 buckets
  • Moving data from Amazon to Snowflake for analysis
  • Simple visualization of data in Tableau
  • Creating a comprehensive, end-to-end data platform

Project 10: Reddit ETL in AWS using Airflow

This project tackles a complex data pipeline with multiple steps using advanced data processing tools in the AWS ecosystem. 

Start by setting up your Apache Airflow to pull in data from Reddit and transform it using SQL. Afterward, you will connect your data to AWS by putting it into an S3 bucket, where we will use AWS Glue to do a little bit more formatting. Then, you can use Athena to test queries before storing the data in Redshift for more long-term data warehousing and analytical querying.

Resources

Here are some resources, including courses and video tutorials, that provide step-by-step guidance:

YouTube videos:

  • Reddit Data Pipeline Engineering Project: CodeWithYu demonstrates a complete Reddit data pipeline in this video, including data extraction with Airflow, transformations with PostgreSQL, and integration with AWS services like S3, Glue, Athena, and Redshift. This walkthrough is a helpful guide to tackling the multi-layered steps in a complex data pipeline.

Courses:

  • Introduction to AWS: This course provides a solid foundation in AWS, covering essential concepts and tools. Understanding the basics of AWS services like S3, Glue, Athena, and Redshift will be crucial for successfully implementing this project.
  • Introduction to Redshift: This course offers a comprehensive introduction to Amazon Redshift, focusing on data warehousing concepts, Redshift architecture, and essential skills for managing and querying large datasets. It's an excellent resource for deepening your understanding of Redshift within AWS pipelines.

Skills developed

  • Pull website data into Airflow
  • Use PostgreSQL to transform data
  • Connect Airflow to AWS to transfer data into S3 buckets
  • Use AWS Glue for ETL
  • Use AWS Athena for simple querying
  • Transfer data from S3 to Amazon Redshift for data warehousing

Project 11: Building a real-time data pipeline with PySpark, Kafka, and Redshift

In this project, you’ll create a robust, real-time data pipeline using PySpark, Apache Kafka, and Amazon Redshift to handle high data ingestion, processing, and storage volumes. 

The pipeline will capture data from various sources in real time, process and transform it using PySpark, and load the transformed data into Redshift for further analysis. Additionally, you’ll implement monitoring and alerting to ensure data accuracy and pipeline reliability.

This project is an excellent opportunity to build foundational skills in real-time data processing and handling big data technologies, such as Kafka for streaming and Redshift for cloud-based data warehousing.

Resources

Here are some resources, including courses and video tutorials, that provide step-by-step guidance:

YouTube videos:

  • Building a Real-Time Data Pipeline with PySpark, Kafka, and Redshift: This video by Darshir Parmar guides you through building a complete real-time data pipeline with PySpark, Kafka, and Redshift. It includes steps for data ingestion, transformation, and loading. The video also covers monitoring and alerting techniques to ensure pipeline performance.

Courses:

  • Introduction to Apache Kafka: This course covers the basics of Apache Kafka, a crucial component for real-time data streaming in this project. It provides an overview of Kafka’s architecture and how to implement it in data pipelines.
  • Streaming Concepts: This course introduces the fundamental concepts of data streaming, including real-time processing and event-driven architectures. It’s an ideal resource for gaining foundational knowledge before building real-time pipelines.

Summary Table of Data Engineering Projects

Here is a summary of the data engineering projects from above to give you a quick reference to the different projects:

Project

Level

Skills

Tools

Weather data pipeline

Beginner

Python to write pipeline applications, API connections, data cleaning

Python, PostgreSQL

ETL pipeline with open data

Beginner

Reading CSV data with Python and Pandas, cleaning data, loading data into BigQuery

Python, BigQuery

London transport analysis

Beginner

Working with large datasets, working with data warehouses

BigQuery

Performing a code review

Intermediate

Code Review, Evaluating code, fixing bugs in code

Coding skills

Building a retail data pipeline

Intermediate

Data pipelines, ETL

Python, SQL

Factors influencing student performance

Intermediate

SQL queries for data analysis

SQL

Cleaning a dataset with PySpark

Advanced

Data cleaning, transformation, and formatting, using PySpark

PySpark, Python

Data engineering with dbt and BigQuery

Advanced

Using dbt for SQL-based transformations, transferring data across platforms

Dbt, BigQuery

Airflow and Snowflake ETL using S3 storage

Advanced

Creating complex ETL pipelines using Airflow DAGs, Moving data from Airflow to Snowflake

Airflow, Snowflake, Tableau

Reddit ETL to AWS project

Advanced

Connecting to APIs, PostgreSQL practice for cleaning  and transferring data from S3, AWS Glue, Athena, and Redshift

Airflow, PostgreSQL, AWS S3, AWS Glue, AWS Athena, Amazon Redshift

Building a real-time data pipeline with PySpark, Kafka, and Redshift

Advanced

Real-time data ingestion, processing, monitoring, and loading data into a data warehouse

PySpark, Kafka, Amazon Redshift

Conclusion

This article presented excellent projects to help you practice your data engineering skills. 

Focus on understanding the fundamental concepts behind how each tool works; this will enable you to use these projects in your job search and explain them successfully. Be sure to review any concepts you find challenging.

Along with building a project portfolio, obtaining a data engineering certification can be a valuable addition to your resume, as it demonstrates your commitment to completing relevant coursework!

Become a Data Engineer

Prove your skills as a job-ready data engineer.

FAQs

What skills do I need to start working on data engineering projects?

For beginner-level projects, basic programming knowledge in Python or SQL and an understanding of data basics (like cleaning and transforming) are helpful. Intermediate and advanced projects often require knowledge of specific tools, like Apache Airflow, Kafka, or cloud-based data warehouses like BigQuery or Redshift.

How can data engineering projects help in building my portfolio?

Completing data engineering projects allows you to showcase your ability to work with data at scale, build robust pipelines, and manage databases. Projects that cover end-to-end workflows (data ingestion to analysis) demonstrate practical skills to potential employers and are highly valuable for a portfolio.

Are cloud tools like AWS and Google BigQuery necessary for data engineering projects?

While not strictly necessary, cloud tools are highly relevant to modern data engineering. Many companies rely on cloud-based platforms for scalability and accessibility, so learning tools like AWS, Google BigQuery, and Snowflake can give you an edge and align your skills with industry needs.

How do I choose the right data engineering project for my skill level?

Start by assessing your knowledge and comfort with core tools. For beginners, projects like data cleaning or building a basic ETL pipeline in Python are great. Intermediate projects might involve databases and more complex queries, while advanced projects often integrate multiple tools (e.g., PySpark, Kafka, Redshift) for real-time or large-scale data processing.


Photo of Tim Lu
Author
Tim Lu
LinkedIn

I am a data scientist with experience in spatial analysis, machine learning, and data pipelines. I have worked with GCP, Hadoop, Hive, Snowflake, Airflow, and other data science/engineering processes.

Topics

Learn more about data engineering with these courses!

course

Introduction to Data Engineering

4 hr
114.3K
Learn about the world of data engineering in this short course, covering tools and topics like ETL and cloud computing.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

Practice Data Engineering Skills with New Hands-On Projects

Find out how you can practice your Data Engineering skills with DataCamp's new hands-on projects.
Alena Guzharina's photo

Alena Guzharina

3 min

blog

10 Docker Project Ideas: From Beginner to Advanced

Learn Docker with these hands-on project ideas for all skill levels, from beginner to advanced, focused on building and optimizing data science applications.
Joel Wembo's photo

Joel Wembo

22 min

blog

5 Essential Data Engineering Skills

Discover the data engineering skills you need to thrive in the industry. Find out about the roles and responsibilities of a data engineer, and how you can develop your own skills.
Joleen Bothma's photo

Joleen Bothma

11 min

blog

10 Data Visualization Project Ideas for All Levels

Practice and improve your data visualization skills with these top projects covering a broad scope of technologies. Knowledge and experience with visualization tools are important for any data professional and improve your ability to communicate analytical findings.
Tim Lu's photo

Tim Lu

15 min

blog

14 Java Projects For All Levels: Beginner, Intermediate, & Advanced

Discover ideas for Java projects across all experience levels from beginner to advanced.
Austin Chia's photo

Austin Chia

9 min

blog

5 Places to Host Your Data Science Portfolio

Creating and showcasing a solid data science portfolio is an essential step in bringing what you learn into reality. Conceptualizing, developing, and publishing your work requires consistency and persistence, but your hard work can help you position yours better.

Hajar Khizou

10 min

See MoreSee More