Data quality is the bedrock of trustworthy AI. No matter how advanced your models or pipelines, poor data quality will undermine results and erode user trust. As we covered in our Ultimate Guide to AI Workflow Testing and Validation in 2026, robust data validation is a critical sub-pillar of responsible AI development. This tutorial dives deep into practical, reproducible steps for validating data quality in AI workflows, focusing on frameworks, automation, and actionable checklists tailored for 2026.
Whether you’re building pipelines for machine learning, MLOps, or generative AI, this guide will help you implement best-in-class data quality validation. We’ll use Great Expectations (v0.18+), Pandas (v2.2+), and pytest (v8+), with all code and commands ready to run. For a broader perspective on testing automation, see our sibling article on Best Practices for Automated Regression Testing in AI Workflow Automation.
Prerequisites
- Python 3.10+ installed on your system
- pip (Python package manager)
- Basic understanding of Python and Pandas dataframes
- Familiarity with the command line
- Sample dataset in CSV or Parquet format (provided below for testing)
- Optional:
pytestfor automated testing
You should also be comfortable with the concepts of data pipelines and AI workflow orchestration. If you’re new to workflow orchestration, check out Getting Started with API Orchestration for AI Workflows (Beginner’s Guide 2026).
Step 1: Set Up Your Environment
-
Create a new project directory and set up a virtual environment:
mkdir ai-data-quality-demo cd ai-data-quality-demo python3 -m venv venv source venv/bin/activate
-
Install required dependencies:
pip install great-expectations==0.18.10 pandas==2.2.2 pytest==8.2.0
-
Verify installation:
python -c "import great_expectations; import pandas; import pytest; print('All set!')"
Step 2: Prepare a Sample Dataset
For demonstration, let’s use a simple customer data CSV. Save the following as customers.csv in your project directory:
customer_id,first_name,last_name,email,signup_date,age 1,Alice,Smith,alice@example.com,2024-01-15,29 2,Bob,Johnson,bob@example.com,2023-12-20,35 3,Charlie,Lee,charlie@example.com,2024-03-10,22 4,Denise,Kim,,2024-02-01,28 5,Edward,Wong,edward@example.com,2024-01-25,NaN 6,Fay,Li,fay@example.com,2024-03-05,41
This dataset intentionally includes missing values to illustrate data quality issues.
Step 3: Initialize Great Expectations
-
Initialize a new Great Expectations project:
great_expectations init
Follow the prompts. When asked about your data, select “Local file (e.g., CSV)”.
-
Organize your data:
mkdir data mv customers.csv data/
-
Confirm directory structure:
ls data/ great_expectations/ venv/
Step 4: Create a Data Source and Data Asset
-
Add a Pandas data source:
great_expectations datasource new
Choose “Pandas” as your execution engine, and point to
data/customers.csvwhen prompted. -
Test the data source:
great_expectations datasource list
You should see your new data source listed.
Step 5: Build a Data Quality Checklist (Expectation Suite)
Now, let’s define a robust checklist for data quality, using Great Expectations “expectations” to automate validation.
-
Create a new expectation suite:
great_expectations suite new
Name it
customer_data_quality. Choose “interactive” mode for step-by-step guidance. -
Explore and add expectations interactively:
You’ll be prompted to add expectations. Here are some key examples you should include:
expect_column_values_to_not_be_nulloncustomer_id,first_name,last_name,emailexpect_column_values_to_match_regexonemail(simple email pattern)expect_column_values_to_be_betweenonage(e.g., 18 to 99)expect_column_values_to_be_uniqueoncustomer_idexpect_column_values_to_not_be_nullonsignup_date
Here’s how to add one manually via Python:
import great_expectations as ge df = ge.read_csv("data/customers.csv") suite = ge.get_context().get_expectation_suite("customer_data_quality") df.expect_column_values_to_not_be_null("customer_id") df.expect_column_values_to_match_regex("email", r"[^@]+@[^@]+\.[^@]+") df.expect_column_values_to_be_between("age", 18, 99) df.expect_column_values_to_be_unique("customer_id") -
Save your expectation suite:
great_expectations suite edit customer_data_quality
Review and confirm your expectations in the interactive editor.
Step 6: Run Data Validation and Review Results
-
Run a validation checkpoint:
great_expectations checkpoint new
Name it
customer_data_checkpointand link it to your suite and data asset. -
Execute the checkpoint:
great_expectations checkpoint run customer_data_checkpoint
This will generate a validation report in
great_expectations/uncommitted/data_docs/local_site/. -
View the validation report:
open great_expectations/uncommitted/data_docs/local_site/index.html
[Screenshot Description: The report shows a summary of passed and failed expectations, with details for each column and expectation.]
Step 7: Automate Data Quality Checks with Pytest
Integrate your data quality suite into CI/CD or nightly batch jobs using pytest.
-
Create a test script
test_data_quality.py:import great_expectations as ge def test_customer_data_quality(): context = ge.get_context() result = context.run_checkpoint(checkpoint_name="customer_data_checkpoint") assert result["success"], "Data quality validation failed!" -
Run the test:
pytest test_data_quality.py
[Screenshot Description: Pytest output shows a test pass or fail, with traceback if the checkpoint fails.]
Step 8: Expand Your Checklist—What to Validate in 2026
As data and AI workflows evolve, so do the risks. Here’s a 2026-ready checklist for data quality validation:
- Schema consistency (column names, types, order)
- Value ranges and allowed categories
- Null value thresholds (not just any nulls, but % allowed)
- Uniqueness and primary key enforcement
- Referential integrity (foreign keys, joins)
- Format and regex checks (emails, phone numbers, IDs)
- Outlier detection (statistical or ML-based)
- Bias and representation checks (demographics, class balance)
- Drift detection (distribution changes over time)
- Data freshness and staleness
- Provenance and source tracking
For more on automating these checks, see How to Set Up Automated Data Quality Checks in AI Workflow Automation.
Common Issues & Troubleshooting
-
Great Expectations can’t find my data file:
Double-check your file path and thedatasourceconfiguration ingreat_expectations.yml. -
Validation fails on missing values:
Review your expectations—if some missing data is acceptable, adjust the expectation (e.g.,mostly=0.95). -
Pytest fails but GE report passes:
Confirm your checkpoint is up to date and includes all expectations. Re-rungreat_expectations suite editif needed. -
“ModuleNotFoundError” for Great Expectations:
Ensure your virtual environment is activated (source venv/bin/activate) and dependencies are installed. -
Data docs not rendering:
Try regenerating docs withgreat_expectations docs build
and check your browser’s cache.
Next Steps
You’ve now built a reproducible, automated data quality validation workflow using modern frameworks and a 2026-ready checklist. Next, consider:
- Integrating these checks into your CI/CD pipeline for every data update
- Extending expectations to new datasets and evolving schemas
- Adding advanced checks (e.g., bias, drift, outlier detection)
- Exploring additional frameworks or custom rules for domain-specific needs
- Reviewing the Ultimate Guide to AI Workflow Testing and Validation in 2026 for a broader perspective
- Learning about automated regression testing in AI workflows for holistic validation
Data quality is never “done”—it’s an ongoing process. With the right frameworks and checklists, you can build AI systems that are robust, fair, and trustworthy.
