Great Expectations Tutorial: Validating Data with Python

Learn how to validate your data with Great Expectations in Python with this end-to-end tutorial!

Nov 28, 2024 · 8 min read

Data quality and consistency are like the foundation of a house—without a solid base, everything built on top risks collapsing. This is where data validation plays an important role. Data validation helps you make sure that your data is accurate, consistent, and reliable.

Great Expectations is an open-source data validation tool that allows you to identify data issues early and ensures your data meets the required quality standards.

In this guide, we will walk you through the process of using Great Expectations for data validation, with a practical end-to-end example to help you get started!

What is Great Expectations?

Great Expectations (GX) is an open-source framework that has become popular for managing and automating data validation in modern data pipelines.

Its Python-based framework is designed to help data teams guarantee the quality and consistency of their data. Users can define "expectations"—rules or tests that describe what valid data should look like—that automatically validate whether the data meets these standards.

Some benefits of Great Expectations include:

Automated data validation – Great Expectations automates the process of validating data, reducing manual effort and minimizing the risk of errors. It ensures that data consistently meets predefined standards.
Integration with data pipelines – It easily integrates with various data sources and platforms, including SQL databases, cloud storage, and ETL tools, allowing for data validation across different stages of your pipeline.
Clear, actionable validation results – The tool provides transparent validation results, making it easy to spot data quality issues and address them quickly.
Data documentation – Great Expectations can generate detailed, accessible documentation of your data validation processes, helping teams align on quality standards and providing a reference for future use.
Scalability and flexibility – As an open-source tool, Great Expectations is highly customizable and can scale with your data validation needs, offering flexibility to adjust to various use cases without high costs.

Now, let’s look at an end-to-end example!

Become a Data Engineer

Build Python skills to become a professional data engineer.

Get Started for Free

Setting Up Great Expectations

In this tutorial, you'll learn how to use GX Core, the open-source version of Great Expectations, to validate a Pandas DataFrame. We'll walk through setting up a context, registering a Pandas data source, defining expectations, and validating data batches.

Note: We recommend you follow along with the DataLab notebook, but you can also create your own Python script.

1. Installing Great Expectations

Prerequisites

Python 3.9 to 3.12 installed.
To avoid conflicts, it is highly recommended that you install Great Expectations within a virtual environment (disclaimer: the setup of virtual environments is beyond the scope of this article).
A sample dataset.

Note: If using the provided DataLab notebook, these prerequisites have already been satisfied. Feel free to skip them.

Use the following command to install GX via pip:

pip install great_expectations

This command installs the core package and all necessary dependencies.

2. Initializing the data context

Great Expectations requires a data context to manage configurations. We use an ephemeral data context to avoid persisting configurations.

import great_expectations as gx

# Get the Ephemeral Data Context
context = gx.get_context()
assert type(context).__name__ == "EphemeralDataContext"

Creating Your First Data Validation Suite

Now that GX is set up, let's create a data validation suite.

1. Connecting to a data source and creating a data asset

A data source connects Great Expectations to your data, while a data asset represents a specific subset of data (e.g., a table, DataFrame, or file).

In this case, we will prepare everything to connect to a DataFrame called inventory_parts_df. The sample dataset is available in the provided DataLab, and it gets created once we run the SQL block:

If you’re not using DataLab, create your own DataFrame with sample data.

Now, create your data source and asset:

# Add a Pandas Data Source
data_source = context.data_sources.add_pandas(name="inventory_parts")
# Add a Data Asset to the Data Source
data_asset = data_source.add_dataframe_asset(name="inventory_parts_asset")

2. Adding a batch definition

A batch definition identifies and organizes your data for validation. Here, we add a batch definition that covers the entire DataFrame:

# Define the Batch Definition name
batch_definition_name = "inventory_parts_batch"
# Add the Batch Definition
batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)
assert batch_definition.name == batch_definition_name

3. Retrieving a batch

A batch is a collection of data tied to a batch definition. To validate data, you'll need to retrieve and link the batch to your DataFrame, in this case inventory_parts_df:

# Define the Batch Parameters
batch_parameters = {"dataframe": inventory_parts_df}
# Retrieve the Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)

4. Creating a suite and defining expectations

Expectations are rules for validating data. In this example, we'll define the following simple expectations:

Ensure inventory_id values are non-null.
Ensure part_num values are unique.

# Create an Expectation Suite
expectation_suite_name = "inventory_parts_suite"
suite = gx.ExpectationSuite(name=expectation_suite_name)
# Add Expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="inventory_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="part_num")
)
# Add the Expectation Suite to the Context
context.suites.add(suite)

You can explore all the available expectations in the Expectation Gallery. We encourage you to add a few more!

After defining the expectations, GX outputs the expectation suite configuration:

{
  "name": "inventory_parts_suite",
  "id": "b2de0b69-0869-4163-8dde-6c09884483f7",
  "expectations": [
    {
      "type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "inventory_id"
      },
      "meta": {},
      "id": "53d6c42a-d190-412f-a113-783b706531f4"
    },
    {
      "type": "expect_column_values_to_be_unique",
      "kwargs": {
        "column": "part_num"
      },
      "meta": {},
      "id": "362a2bdc-616d-4b3a-b7f0-c73808caee78"
    }
  ],
  "meta": {
    "great_expectations_version": "1.2.4"
  },
  "notes": null
}

The suite includes the following details:

Suite name and ID: A unique name (inventory_parts_suite) and identifier to track and manage the suite.
Expectations: Each rule specifies:

The type of check (e.g., ensuring a column has no null values or unique entries).
Parameters, such as the column being validated.
Metadata and a unique ID for each expectation, allowing for easier tracking and customization.

Metadata: Version information for Great Expectations, ensuring compatibility with the tool.
Notes: A placeholder for adding descriptive comments about the suite (optional).

This structured output acts as both documentation and a reusable configuration for validating your dataset so your expectations are clearly defined, traceable, and ready for future use.

5. Validating the data

Finally, validate the batch against the defined expectations and evaluate the results.

# Validate the Data Against the Suite
validation_results = batch.validate(suite)
# Evaluate the Results
print(validation_results)

After running the validation, Great Expectations provides a detailed report on whether the dataset meets the defined expectations:

{
  "success": false,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "inventory_parts-inventory_parts_asset",
          "column": "inventory_id"
        },
        "meta": {},
        "id": "53d6c42a-d190-412f-a113-783b706531f4"
      },
      "result": {
        "element_count": 580069,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": [],
        "partial_unexpected_counts": [],
        "partial_unexpected_index_list": []
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    },
    {
      "success": false,
      "expectation_config": {
        "type": "expect_column_values_to_be_unique",
        "kwargs": {
          "batch_id": "inventory_parts-inventory_parts_asset",
          "column": "part_num"
        },
        "meta": {},
        "id": "362a2bdc-616d-4b3a-b7f0-c73808caee78"
      },
      "result": {
        "element_count": 580069,
        "unexpected_count": 568352,
        "unexpected_percent": 97.98006788847535,
        "partial_unexpected_list": [
          "48379c01",
          "paddle",
          "11816pr0005",
          "2343",
          "3003",
          "30176",
          "3020",
          "3022",
          "3023",
          "30357",
          "3039",
          "3062b",
          "3068b",
          "3069b",
          "3069b",
          "33291",
          "33291",
          "3795",
          "3941",
          "3960"
        ],
        "missing_count": 0,
        "missing_percent": 0.0,
        "unexpected_percent_total": 97.98006788847535,
        "unexpected_percent_nonmissing": 97.98006788847535,
        "partial_unexpected_counts": [
          {
            "value": "3069b",
            "count": 2
          },
          {
            "value": "33291",
            "count": 2
          },
          {
            "value": "11816pr0005",
            "count": 1
          },
          {
            "value": "2343",
            "count": 1
          },
          {
            "value": "3003",
            "count": 1
          },
          {
            "value": "30176",
            "count": 1
          },
          {
            "value": "3020",
            "count": 1
          },
          {
            "value": "3022",
            "count": 1
          },
          {
            "value": "3023",
            "count": 1
          },
          {
            "value": "30357",
            "count": 1
          },
          {
            "value": "3039",
            "count": 1
          },
          {
            "value": "3062b",
            "count": 1
          },
          {
            "value": "3068b",
            "count": 1
          },
          {
            "value": "3795",
            "count": 1
          },
          {
            "value": "3941",
            "count": 1
          },
          {
            "value": "3960",
            "count": 1
          },
          {
            "value": "48379c01",
            "count": 1
          },
          {
            "value": "paddle",
            "count": 1
          }
        ],
        "partial_unexpected_index_list": [
          0,
          3,
          4,
          5,
          6,
          7,
          8,
          9,
          10,
          11,
          12,
          13,
          14,
          15,
          16,
          17,
          18,
          19,
          20,
          21
        ]
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    }
  ],
  "suite_name": "inventory_parts_suite",
  "suite_parameters": {},
  "statistics": {
    "evaluated_expectations": 2,
    "successful_expectations": 1,
    "unsuccessful_expectations": 1,
    "success_percent": 50.0
  },
  "meta": {
    "great_expectations_version": "1.2.4",
    "batch_spec": {
      "batch_data": "PandasDataFrame"
    },
    "batch_markers": {
      "ge_load_time": "20241129T122532.416424Z",
      "pandas_data_fingerprint": "84a1e1939091fcf54324910def3b89cd"
    },
    "active_batch_definition": {
      "datasource_name": "inventory_parts",
      "data_connector_name": "fluent",
      "data_asset_name": "inventory_parts_asset",
      "batch_identifiers": {
        "dataframe": "<DATAFRAME>"
      }
    }
  },
  "id": null
}

This report details the quality of your data, highlighting successes and failures. Here's a simplified explanation of the results:

Overall validation: The validation result was partially successful: 50% of the expectations passed, and 50% failed. A failed expectation indicates a data quality issue that needs attention. In this case, one column did not meet the defined rule.

Expectation 1: inventory_id should have no missing values

Result: Passed
Explanation: Every value in the inventory_id column is present, with no null or missing entries. This indicates good data completeness for this column.

Expectation 2: part_num should have unique values

Result: Failed
Explanation: The part_num column contains 97.98% duplicate values, meaning only a few values are unique.
Highlights:

Example duplicate values include "3069b" and "33291".
The tool also shows how frequently these duplicates appear and their row positions, making it easier to locate and fix the issues.

Of course, this is just a sample dataset, and we purposefully included a passing and a failing expectation so you can see both validation results.

That's it! You've successfully run end-to-end data validations.

Integrating Great Expectations into Data Pipelines

In a production setting, validations must be embedded directly into the workflow to continuously monitor data quality at every stage.

In this section, we’ll discuss how you can integrate Great Expectations into your data pipelines.

These are examples to give you an idea, and extra configurations not included here may be required. Check out each tool's documentation for up-to-date syntax!

Integration with ETL tools

Integrating Great Expectations with popular ETL tools like Apache Airflow or Prefect is relatively straightforward. Embedding validation steps directly into the ETL processes will allow you to catch and address data issues in real time before they affect downstream analysis.

Let’s walk through a simple example of integrating Great Expectations with Prefect to run data validation as part of an automated ETL workflow:

from prefect import task, Flow
import great_expectations as ge
# Define a task to run Great Expectations validation
@task
def validate_data():
    context = ge.data_context.DataContext()
    batch_kwargs = {"path": "path/to/your/datafile.csv", "datasource": "your_datasource"}
    batch = context.get_batch(batch_kwargs, suite_name="your_expectation_suite")
    results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
	    
    # Check validation results and raise an alert if validation fails
    if not results["success"]:
        raise ValueError("Data validation failed!")
# Define your ETL flow
with Flow("ETL_with_GE_Validation") as flow:
    validation = validate_data()
# Execute the flow
flow.run()

In this example, we define a Prefect flow with a task for running Great Expectations validation.

The validate_data() task loads the Great Expectations context, retrieves the data batch, and applies the expectation suite.

If the data does not meet the validation criteria, the task raises an alert, stopping the workflow and preventing downstream errors.

Continuous data validation

You can schedule validation jobs using various tools, such as cron jobs on Unix-based systems or managed services like Apache Airflow. For this example, we’ll demonstrate how to schedule validation runs using Airflow, which is well-suited for orchestrating data pipelines.

Here’s how you can set up an Airflow DAG (Directed Acyclic Graph) to run Great Expectations validations daily:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
import great_expectations as ge
# Define the DAG and set the schedule to run daily
default_args = {
	  'owner': 'airflow',
	  'start_date': datetime(2024, 1, 1),
	  'retries': 1,
}
dag = DAG(
      'great_expectations_validation',
	default_args=default_args,
	schedule_interval='@daily',  # Runs once a day
)
# Define the function to run the validation
def run_validation():
    context = ge.data_context.DataContext()
    batch = context.get_batch(batch_kwargs, suite_name="your_expectation_suite")
    results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
    return results
# Set up the task in Airflow
validation_task = PythonOperator(
      task_id='run_great_expectations_validation',
      python_callable=run_validation,
      dag=dag,
)
# Set the task in the DAG
validation_task

In this example, we define a DAG that schedules a validation run once a day (@daily).

The run_validation() function executes the validation by loading the Great Expectations context and running the defined expectation suite against the data.

Best Practices for Data Validation with Great Expectations

Following best practices is always best advised for scalability and efficiency, and it’s no different for data validation with Great Expectations.

Start small and iterate

Begin with foundational data quality checks and gradually expand. It’s better to focus on basic expectations initially, as this helps to avoid overcomplicating the process, which makes for a smoother integration and easier troubleshooting. As your understanding of the dataset improves, you can add more complex validations.

Collaborate across teams

Data quality is not just a technical concern. Collaborate across business teams to define expectations and ensure the implemented validation aligns with the underlying business logic and goals. This cross-functional approach guarantees that data serves its intended purpose and meets the requirements of all stakeholders.

Automate where possible

Automate the process wherever feasible to integrate data validation into data pipelines. Integrating automated validation checks enables continuous monitoring of data quality without manual intervention, which significantly improves efficiency.

Conclusion

Great work! You’ve learned how to configure and validate data in Great Expectations. These techniques will help maintain high data quality and transparency in your workflows.

To continue building your skills, check out these resources:

ETL and ELT in Python: Learn how to transform and move data effectively.
Introduction to Data Quality: Explore the fundamentals of data quality management.
Cleaning Data in Python: Master data cleaning techniques to ensure accuracy and consistency.
Data Quality Dimensions Cheat Sheet: A handy guide to data quality dimensions.

Become a Data Engineer

Prove your skills as a job-ready data engineer.

Fast-Track My Data Career

How does Great Expectations compare to other data validation tools?

Do I need to know Python to use Great Expectations?

What types of data sources does Great Expectations support?

Can I use Great Expectations with streaming data?

Is it possible to version control expectations and validation results?

How does Great Expectations handle schema evolution in datasets?

Author

Kurtis Pykes

Author

Thalia Barrera

Topics

Python

Data Engineering

Learn more about data engineering with these courses!

Track

Data Engineer in Python

0 min

Gain in-demand skills to efficiently ingest, clean, manage data, and schedule and monitor pipelines, setting you apart in the data engineering field.

See Details

Start Course

Course

Introduction to Data Quality

2 hr

23.3K

Explore the basics of data quality management. Learn the key concepts, dimensions, and techniques for monitoring and improving data quality.

See Details

Start Course

Course

ETL and ELT in Python

4 hr

31.3K

Learn to build effective, performant, and reliable data pipelines using Extract, Transform, and Load principles.

See Details

Start Course

Tutorial

Python Tutorial for Beginners

Get a step-by-step guide on how to install Python and use it for basic data science functions.

Matthew Przybyla

Tutorial

Python Exploratory Data Analysis Tutorial

Learn the basics of Exploratory Data Analysis (EDA) in Python with Pandas, Matplotlib and NumPy, such as sampling, feature engineering, correlation, etc.

Karlijn Willems

Tutorial

Visualizing Data with Python and Tableau Tutorial

Learn how you can use Python to extend Tableau's data visualization capabilities.

Abid Ali Awan

Tutorial

How to Use Python Decorators (With Function and Class-Based Examples)

Learn Python decorators with hands-on examples. Understand closures, function and class-based decorators, and how to write reusable, elegant code.

Derrick Mwiti

Tutorial

Kaggle Tutorial: EDA & Machine Learning

In this Kaggle tutorial, you'll learn how to approach and build supervised learning models with the help of exploratory data analysis (EDA) on the Titanic data.

Hugo Bowne-Anderson

code-along

Getting Started with Machine Learning in Python

Learn the fundamentals of supervised learning by using scikit-learn.

George Boorman

See More See More

What is Great Expectations?

Become a Data Engineer

Setting Up Great Expectations

1. Installing Great Expectations

Prerequisites

2. Initializing the data context

Creating Your First Data Validation Suite

1. Connecting to a data source and creating a data asset

2. Adding a batch definition

3. Retrieving a batch

4. Creating a suite and defining expectations

5. Validating the data

Integrating Great Expectations into Data Pipelines

Integration with ETL tools

Continuous data validation

Best Practices for Data Validation with Great Expectations

Start small and iterate

Collaborate across teams

Automate where possible

Conclusion

Become a Data Engineer

FAQs

What types of data sources does Great Expectations support?

Can I use Great Expectations with streaming data?

Is it possible to version control expectations and validation results?

How does Great Expectations handle schema evolution in datasets?

Python Tutorial for Beginners

Python Exploratory Data Analysis Tutorial

Visualizing Data with Python and Tableau Tutorial

How to Use Python Decorators (With Function and Class-Based Examples)

Kaggle Tutorial: EDA & Machine Learning

Getting Started with Machine Learning in Python

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Data Engineer in Python

Introduction to Data Quality

ETL and ELT in Python

Python Tutorial for Beginners

Python Exploratory Data Analysis Tutorial

Visualizing Data with Python and Tableau Tutorial

How to Use Python Decorators (With Function and Class-Based Examples)

Kaggle Tutorial: EDA & Machine Learning

Getting Started with Machine Learning in Python

Data Engineer in Python