Skip to main content

Great Expectations Tutorial: Validating Data with Python

Learn how to validate your data with Great Expectations in Python with this end-to-end tutorial!
Nov 28, 2024  · 20 min read

Data quality and consistency are like the foundation of a house—without a solid base, everything built on top risks collapsing. This is where data validation plays an important role. Data validation helps you make sure that your data is accurate, consistent, and reliable.

Great Expectations is an open-source data validation tool that allows you to identify data issues early and ensures your data meets the required quality standards.

In this guide, we will walk you through the process of using Great Expectations for data validation, with a practical end-to-end example to help you get started!

What is Great Expectations?

Great Expectations (GX) is an open-source framework that has become popular for managing and automating data validation in modern data pipelines.

Its Python-based framework is designed to help data teams guarantee the quality and consistency of their data. Users can define "expectations"—rules or tests that describe what valid data should look like—that automatically validate whether the data meets these standards.

Some benefits of Great Expectations include:

  • Automated data validation –  Great Expectations automates the process of validating data, reducing manual effort and minimizing the risk of errors. It ensures that data consistently meets predefined standards.
  • Integration with data pipelines – It easily integrates with various data sources and platforms, including SQL databases, cloud storage, and ETL tools, allowing for data validation across different stages of your pipeline.
  • Clear, actionable validation results – The tool provides transparent validation results, making it easy to spot data quality issues and address them quickly.
  • Data documentation – Great Expectations can generate detailed, accessible documentation of your data validation processes, helping teams align on quality standards and providing a reference for future use.
  • Scalability and flexibility – As an open-source tool, Great Expectations is highly customizable and can scale with your data validation needs, offering flexibility to adjust to various use cases without high costs.

Now, let’s look at an end-to-end example!

Become a Data Engineer

Build Python skills to become a professional data engineer.
Get Started for Free

Setting Up Great Expectations

In this tutorial, you'll learn how to use GX Core, the open-source version of Great Expectations, to validate a Pandas DataFrame. We'll walk through setting up a context, registering a Pandas data source, defining expectations, and validating data batches.

Note: We recommend you follow along with the DataLab notebook, but you can also create your own Python script.

Prerequisites

  • Python 3.9 to 3.12 installed.
  • To avoid conflicts, it is highly recommended that you install Great Expectations within a virtual environment (disclaimer: the setup of virtual environments is beyond the scope of this article).
  • A sample dataset.

Note: If using the provided DataLab notebook, these prerequisites have already been satisfied. Feel free to skip them.

Use the following command to install GX via pip:

pip install great_expectations

This command installs the core package and all necessary dependencies.

Great Expectations requires a data context to manage configurations. We use an ephemeral data context to avoid persisting configurations.

import great_expectations as gx

# Get the Ephemeral Data Context
context = gx.get_context()
assert type(context).__name__ == "EphemeralDataContext"

Creating Your First Data Validation Suite

Now that GX is set up, let's create a data validation suite.

A data source connects Great Expectations to your data, while a data asset represents a specific subset of data (e.g., a table, DataFrame, or file).

In this case, we will prepare everything to connect to a DataFrame called inventory_parts_df. The sample dataset is available in the provided DataLab, and it gets created once we run the SQL block:

If you’re not using DataLab, create your own DataFrame with sample data.

Now, create your data source and asset:

# Add a Pandas Data Source
data_source = context.data_sources.add_pandas(name="inventory_parts")
# Add a Data Asset to the Data Source
data_asset = data_source.add_dataframe_asset(name="inventory_parts_asset")

A batch definition identifies and organizes your data for validation. Here, we add a batch definition that covers the entire DataFrame:

# Define the Batch Definition name
batch_definition_name = "inventory_parts_batch"
# Add the Batch Definition
batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)
assert batch_definition.name == batch_definition_name

A batch is a collection of data tied to a batch definition. To validate data, you'll need to retrieve and link the batch to your DataFrame, in this case inventory_parts_df:

# Define the Batch Parameters
batch_parameters = {"dataframe": inventory_parts_df}
# Retrieve the Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)

Expectations are rules for validating data. In this example, we'll define the following simple expectations:

  1. Ensure inventory_id values are non-null.
  2. Ensure part_num values are unique.
# Create an Expectation Suite
expectation_suite_name = "inventory_parts_suite"
suite = gx.ExpectationSuite(name=expectation_suite_name)
# Add Expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="inventory_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="part_num")
)
# Add the Expectation Suite to the Context
context.suites.add(suite)

You can explore all the available expectations in the Expectation Gallery. We encourage you to add a few more!

After defining the expectations, GX outputs the expectation suite configuration:

{
  "name": "inventory_parts_suite",
  "id": "b2de0b69-0869-4163-8dde-6c09884483f7",
  "expectations": [
    {
      "type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "inventory_id"
      },
      "meta": {},
      "id": "53d6c42a-d190-412f-a113-783b706531f4"
    },
    {
      "type": "expect_column_values_to_be_unique",
      "kwargs": {
        "column": "part_num"
      },
      "meta": {},
      "id": "362a2bdc-616d-4b3a-b7f0-c73808caee78"
    }
  ],
  "meta": {
    "great_expectations_version": "1.2.4"
  },
  "notes": null
}

The suite includes the following details:

  1. Suite name and ID: A unique name (inventory_parts_suite) and identifier to track and manage the suite.
  2. Expectations: Each rule specifies:
    • The type of check (e.g., ensuring a column has no null values or unique entries).
    • Parameters, such as the column being validated.
    • Metadata and a unique ID for each expectation, allowing for easier tracking and customization.
  3. Metadata: Version information for Great Expectations, ensuring compatibility with the tool.
  4. Notes: A placeholder for adding descriptive comments about the suite (optional).

This structured output acts as both documentation and a reusable configuration for validating your dataset so your expectations are clearly defined, traceable, and ready for future use.

5. Validating the data

Finally, validate the batch against the defined expectations and evaluate the results.

# Validate the Data Against the Suite
validation_results = batch.validate(suite)
# Evaluate the Results
print(validation_results)

After running the validation, Great Expectations provides a detailed report on whether the dataset meets the defined expectations:

{
  "success": false,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "inventory_parts-inventory_parts_asset",
          "column": "inventory_id"
        },
        "meta": {},
        "id": "53d6c42a-d190-412f-a113-783b706531f4"
      },
      "result": {
        "element_count": 580069,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": [],
        "partial_unexpected_counts": [],
        "partial_unexpected_index_list": []
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    },
    {
      "success": false,
      "expectation_config": {
        "type": "expect_column_values_to_be_unique",
        "kwargs": {
          "batch_id": "inventory_parts-inventory_parts_asset",
          "column": "part_num"
        },
        "meta": {},
        "id": "362a2bdc-616d-4b3a-b7f0-c73808caee78"
      },
      "result": {
        "element_count": 580069,
        "unexpected_count": 568352,
        "unexpected_percent": 97.98006788847535,
        "partial_unexpected_list": [
          "48379c01",
          "paddle",
          "11816pr0005",
          "2343",
          "3003",
          "30176",
          "3020",
          "3022",
          "3023",
          "30357",
          "3039",
          "3062b",
          "3068b",
          "3069b",
          "3069b",
          "33291",
          "33291",
          "3795",
          "3941",
          "3960"
        ],
        "missing_count": 0,
        "missing_percent": 0.0,
        "unexpected_percent_total": 97.98006788847535,
        "unexpected_percent_nonmissing": 97.98006788847535,
        "partial_unexpected_counts": [
          {
            "value": "3069b",
            "count": 2
          },
          {
            "value": "33291",
            "count": 2
          },
          {
            "value": "11816pr0005",
            "count": 1
          },
          {
            "value": "2343",
            "count": 1
          },
          {
            "value": "3003",
            "count": 1
          },
          {
            "value": "30176",
            "count": 1
          },
          {
            "value": "3020",
            "count": 1
          },
          {
            "value": "3022",
            "count": 1
          },
          {
            "value": "3023",
            "count": 1
          },
          {
            "value": "30357",
            "count": 1
          },
          {
            "value": "3039",
            "count": 1
          },
          {
            "value": "3062b",
            "count": 1
          },
          {
            "value": "3068b",
            "count": 1
          },
          {
            "value": "3795",
            "count": 1
          },
          {
            "value": "3941",
            "count": 1
          },
          {
            "value": "3960",
            "count": 1
          },
          {
            "value": "48379c01",
            "count": 1
          },
          {
            "value": "paddle",
            "count": 1
          }
        ],
        "partial_unexpected_index_list": [
          0,
          3,
          4,
          5,
          6,
          7,
          8,
          9,
          10,
          11,
          12,
          13,
          14,
          15,
          16,
          17,
          18,
          19,
          20,
          21
        ]
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    }
  ],
  "suite_name": "inventory_parts_suite",
  "suite_parameters": {},
  "statistics": {
    "evaluated_expectations": 2,
    "successful_expectations": 1,
    "unsuccessful_expectations": 1,
    "success_percent": 50.0
  },
  "meta": {
    "great_expectations_version": "1.2.4",
    "batch_spec": {
      "batch_data": "PandasDataFrame"
    },
    "batch_markers": {
      "ge_load_time": "20241129T122532.416424Z",
      "pandas_data_fingerprint": "84a1e1939091fcf54324910def3b89cd"
    },
    "active_batch_definition": {
      "datasource_name": "inventory_parts",
      "data_connector_name": "fluent",
      "data_asset_name": "inventory_parts_asset",
      "batch_identifiers": {
        "dataframe": "<DATAFRAME>"
      }
    }
  },
  "id": null
}

This report details the quality of your data, highlighting successes and failures. Here's a simplified explanation of the results:

Overall validation: The validation result was partially successful: 50% of the expectations passed, and 50% failed. A failed expectation indicates a data quality issue that needs attention. In this case, one column did not meet the defined rule.

Expectation 1: inventory_id should have no missing values

  • Result: Passed
  • Explanation: Every value in the inventory_id column is present, with no null or missing entries. This indicates good data completeness for this column.

Expectation 2: part_num should have unique values

  • Result: Failed
  • Explanation: The part_num column contains 97.98% duplicate values, meaning only a few values are unique.
  • Highlights:
    • Example duplicate values include "3069b" and "33291".
    • The tool also shows how frequently these duplicates appear and their row positions, making it easier to locate and fix the issues.

Of course, this is just a sample dataset, and we purposefully included a passing and a failing expectation so you can see both validation results.

That's it! You've successfully run end-to-end data validations.

Integrating Great Expectations into Data Pipelines

In a production setting, validations must be embedded directly into the workflow to continuously monitor data quality at every stage. 

In this section, we’ll discuss how you can integrate Great Expectations into your data pipelines.

These are examples to give you an idea, and extra configurations not included here may be required. Check out each tool's documentation for up-to-date syntax!

Integration with ETL tools

Integrating Great Expectations with popular ETL tools like Apache Airflow or Prefect is relatively straightforward. Embedding validation steps directly into the ETL processes will allow you to catch and address data issues in real time before they affect downstream analysis.

Let’s walk through a simple example of integrating Great Expectations with Prefect to run data validation as part of an automated ETL workflow:

from prefect import task, Flow
import great_expectations as ge
# Define a task to run Great Expectations validation
@task
def validate_data():
    context = ge.data_context.DataContext()
    batch_kwargs = {"path": "path/to/your/datafile.csv", "datasource": "your_datasource"}
    batch = context.get_batch(batch_kwargs, suite_name="your_expectation_suite")
    results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
	    
    # Check validation results and raise an alert if validation fails
    if not results["success"]:
        raise ValueError("Data validation failed!")
# Define your ETL flow
with Flow("ETL_with_GE_Validation") as flow:
    validation = validate_data()
# Execute the flow
flow.run()

In this example, we define a Prefect flow with a task for running Great Expectations validation.

The validate_data() task loads the Great Expectations context, retrieves the data batch, and applies the expectation suite. 

If the data does not meet the validation criteria, the task raises an alert, stopping the workflow and preventing downstream errors.

Continuous data validation

You can schedule validation jobs using various tools, such as cron jobs on Unix-based systems or managed services like Apache Airflow. For this example, we’ll demonstrate how to schedule validation runs using Airflow, which is well-suited for orchestrating data pipelines.

Here’s how you can set up an Airflow DAG (Directed Acyclic Graph) to run Great Expectations validations daily:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
import great_expectations as ge
# Define the DAG and set the schedule to run daily
default_args = {
	  'owner': 'airflow',
	  'start_date': datetime(2024, 1, 1),
	  'retries': 1,
}
dag = DAG(
      'great_expectations_validation',
	default_args=default_args,
	schedule_interval='@daily',  # Runs once a day
)
# Define the function to run the validation
def run_validation():
    context = ge.data_context.DataContext()
    batch = context.get_batch(batch_kwargs, suite_name="your_expectation_suite")
    results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
    return results
# Set up the task in Airflow
validation_task = PythonOperator(
      task_id='run_great_expectations_validation',
      python_callable=run_validation,
      dag=dag,
)
# Set the task in the DAG
validation_task

In this example, we define a DAG that schedules a validation run once a day (@daily). 

The run_validation() function executes the validation by loading the Great Expectations context and running the defined expectation suite against the data.

Best Practices for Data Validation with Great Expectations

Following best practices is always best advised for scalability and efficiency, and it’s no different for data validation with Great Expectations.

Start small and iterate

Begin with foundational data quality checks and gradually expand. It’s better to focus on basic expectations initially, as this helps to avoid overcomplicating the process, which makes for a smoother integration and easier troubleshooting. As your understanding of the dataset improves, you can add more complex validations.

Collaborate across teams

Data quality is not just a technical concern. Collaborate across business teams to define expectations and ensure the implemented validation aligns with the underlying business logic and goals. This cross-functional approach guarantees that data serves its intended purpose and meets the requirements of all stakeholders.

Automate where possible

Automate the process wherever feasible to integrate data validation into data pipelines. Integrating automated validation checks enables continuous monitoring of data quality without manual intervention, which significantly improves efficiency.

Conclusion

Great work! You’ve learned how to configure and validate data in Great Expectations. These techniques will help maintain high data quality and transparency in your workflows.

To continue building your skills, check out these resources:

Become a Data Engineer

Prove your skills as a job-ready data engineer.

FAQs

How does Great Expectations compare to other data validation tools?

Great Expectations is open-source, flexible, and integrates well with modern data pipelines. It stands out for its extensive library of expectations and strong documentation.

Do I need to know Python to use Great Expectations?

While basic knowledge of Python is helpful, Great Expectations provides a user-friendly CLI and extensive documentation, making it accessible to non-programmers.

What types of data sources does Great Expectations support?

Great Expectations supports a wide range of data sources, including:

  • Relational databases like PostgreSQL, MySQL, and SQL Server.
  • Cloud storage solutions like AWS S3, Google Cloud Storage, and Azure Blob Storage.
  • File formats like CSV, Parquet, and Excel.
  • Big data frameworks like Apache Spark and Databricks. You can easily connect Great Expectations to these sources using the appropriate configuration for your datasource.

Can I use Great Expectations with streaming data?

Great Expectations is primarily designed for batch data validation. While it does not natively support streaming data pipelines, you can integrate it into frameworks like Apache Kafka or Spark Structured Streaming by validating snapshots or micro-batches of data periodically.

Is it possible to version control expectations and validation results?

Yes, you can version control expectations and configurations by storing them as YAML or JSON files in a Git repository. For validation results, you can set up a database or file-based store to track results over time and integrate them into your CI/CD pipelines for continuous monitoring.

How does Great Expectations handle schema evolution in datasets?

Great Expectations handles schema evolution through its flexible expectations framework. If your schema changes, you can:

  • Use expect_table_columns_to_match_set or similar expectations to validate column names dynamically.
  • Modify or create new expectation suites to adapt to the new schema.
  • Leverage schema inference tools to automatically update expectations for newly added columns.

Photo of Kurtis Pykes
Author
Kurtis Pykes
LinkedIn

Photo of Thalia Barrera
Author
Thalia Barrera

Thalia Barrera is a Senior Data Science Editor at DataCamp with a master’s in Computer Science and over a decade of experience in software and data engineering. Thalia enjoys simplifying tech concepts for engineers and data scientists through blog posts, tutorials, and video courses.

Topics

Learn more about data engineering with these courses!

track

Data Engineer

40hrs hr
Gain in-demand skills to efficiently ingest, clean, manage data, and schedule and monitor pipelines, setting you apart in the data engineering field.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

tutorial

Python Tutorial for Beginners

Get a step-by-step guide on how to install Python and use it for basic data science functions.
Matthew Przybyla's photo

Matthew Przybyla

12 min

tutorial

Python Exploratory Data Analysis Tutorial

Learn the basics of Exploratory Data Analysis (EDA) in Python with Pandas, Matplotlib and NumPy, such as sampling, feature engineering, correlation, etc.
Karlijn Willems's photo

Karlijn Willems

30 min

tutorial

Visualizing Data with Python and Tableau Tutorial

Learn how you can use Python to extend Tableau's data visualization capabilities.
Abid Ali Awan's photo

Abid Ali Awan

15 min

tutorial

Python Decorators Tutorial

In this tutorial, learn how to implement decorators in Python.
Derrick Mwiti's photo

Derrick Mwiti

11 min

tutorial

Kaggle Tutorial: EDA & Machine Learning

In this Kaggle tutorial, you'll learn how to approach and build supervised learning models with the help of exploratory data analysis (EDA) on the Titanic data.
Hugo Bowne-Anderson's photo

Hugo Bowne-Anderson

10 min

code-along

Getting Started with Machine Learning in Python

Learn the fundamentals of supervised learning by using scikit-learn.
George Boorman's photo

George Boorman

See MoreSee More