In modern AI development, data quality is the linchpin of model accuracy, stability, and regulatory compliance. As data pipelines scale and automation increases, automated data quality checks become essential to catch errors early, reduce costly re-labeling, and maintain trust in AI outputs. This tutorial will walk you through building robust, automated data quality checks in your AI workflow automation, using open-source tools, Python, and practical automation strategies.
For a broader look at the evolving landscape of data labeling, automation, and best practices, see our parent guide: AI Data Labeling in 2026: Best Practices, Tools, and Emerging Automation Trends.
Prerequisites
- Python 3.9+ (tested with 3.10)
- Pandas (v1.5+), Great Expectations (v0.17+), pytest (v7.0+)
- Basic familiarity with command line, Python scripting, and dataframes
- Access to a sample dataset (CSV or Parquet), e.g., tabular data for classification or NLP
- Optional: Familiarity with CI/CD tools (e.g., GitHub Actions, GitLab CI) if you want to automate checks in pipelines
Note: This guide focuses on open-source tools and hands-on scripting. For enterprise platforms and advanced data cleansing, see our roundup: Best AI Data Cleansing Tools and Platforms for Enterprise Use in 2026.
-
Step 1: Install Required Tools and Set Up Your Project
We'll use Great Expectations for automated data quality checks, along with
pandasfor data handling. Start by creating a new project directory and setting up a virtual environment.$ mkdir ai-data-quality-demo $ cd ai-data-quality-demo $ python3 -m venv venv $ source venv/bin/activateNow install the required Python packages:
(venv) $ pip install pandas great_expectations pytestConfirm installations:
(venv) $ python -c "import pandas; import great_expectations; import pytest; print('All packages installed!')"Tip: If you use Jupyter, you can install
ipykernelto run notebooks in this environment. -
Step 2: Initialize Great Expectations in Your Project
Great Expectations is a leading open-source tool for data quality and validation. It lets you define, run, and automate data checks ("expectations") on any dataset. Initialize it in your project directory:
(venv) $ great_expectations initYou should see a folder structure like:
ai-data-quality-demo/ great_expectations/ expectations/ checkpoints/ ... venv/Screenshot Description: The terminal displays “Great Expectations directory created!” and the new
great_expectations/folder appears in your project. -
Step 3: Prepare a Sample Dataset
For this tutorial, we'll use a simple CSV file (
data/sample_data.csv) with columns:id,text,label.id,text,label 1,This is a positive example,positive 2,This is a negative example,negative 3,This example is missing a label, 4,Another positive sample,positivePlace this file in a new
data/directory:(venv) $ mkdir data (venv) $ nano data/sample_data.csvPaste the above content and save.
-
Step 4: Create and Save Data Quality Expectations
Now, let's create automated data quality checks ("expectations") for your dataset. We'll check:
- No missing
idorlabelvalues labelis always one of "positive" or "negative"idis unique
Start an interactive session to create expectations:
(venv) $ great_expectations suite newWhen prompted, enter a suite name (e.g.,
sample_suite) and choose "Pandas" as the backend.When asked for a data path, enter:
data/sample_data.csvYou'll enter a Jupyter notebook or CLI session to define expectations. Add the following code blocks:
validator.expect_column_values_to_not_be_null("id") validator.expect_column_values_to_not_be_null("label") validator.expect_column_values_to_be_in_set("label", ["positive", "negative"]) validator.expect_column_values_to_be_unique("id")Save and exit. The expectations will be stored in
great_expectations/expectations/sample_suite.json. - No missing
-
Step 5: Run Automated Data Quality Checks
You can now run your checks on any new dataset automatically. Use the
checkpointcommand:(venv) $ great_expectations checkpoint new sample_checkpointWhen prompted, link it to
sample_suiteanddata/sample_data.csv.Now run the checkpoint:
(venv) $ great_expectations checkpoint run sample_checkpointScreenshot Description: The terminal outputs a summary table showing passed and failed expectations. In our example, the missing label in row 3 will cause a failure.
Sample Output:
3 of 4 expectations passed FAILED: expect_column_values_to_not_be_null on column 'label' -
Step 6: Automate Checks in Your AI Workflow
To ensure data quality at every step, integrate these checks into your data ingestion, labeling, or model training pipelines. For example, add a
pytesttest:import great_expectations as ge def test_sample_data_quality(): context = ge.get_context() results = context.run_checkpoint(checkpoint_name="sample_checkpoint") assert results["success"], "Data quality checks failed!"Run with:
(venv) $ pytest tests/test_data_quality.pyFor CI/CD integration, add a step in your pipeline (example for GitHub Actions):
- name: Run data quality checks run: | source venv/bin/activate great_expectations checkpoint run sample_checkpointFor more on automating data labeling and annotation pipelines, see: Best Practices for Automating Data Labeling Pipelines in 2026 and How to Build Annotation Pipelines that Scale: Tooling, Automation, and QA for 2026.
-
Step 7: Customize and Extend Your Data Quality Checks
Great Expectations supports a wide range of checks, including:
- Text length constraints (
expect_column_value_lengths_to_be_between) - Regex pattern matching for emails, phone numbers, etc.
- Statistical checks (mean, min, max, quantiles)
- Custom Python functions for domain-specific logic
Example: Require all
textvalues to be at least 10 characters:validator.expect_column_value_lengths_to_be_between("text", min_value=10)For advanced scenarios (e.g., image data, NLP, or multi-modal checks), you can write custom expectations or integrate with other validation libraries.
Pro Tip: For regulated industries, you may need to log all data quality failures and provide audit trails. For more on compliance, see: Best Practices for Data Labeling in Highly Regulated Industries (Finance, Pharma, Defense).
- Text length constraints (
Common Issues & Troubleshooting
- Great Expectations fails to find your dataset: Double-check the file path in your checkpoint configuration. Use absolute paths if needed.
-
Python import errors: Ensure you’re running commands inside your virtual environment (
source venv/bin/activate). - Checkpoint not running all expectations: Make sure your expectation suite is linked to the correct checkpoint and dataset.
-
Jupyter notebook won’t launch: Install
ipykerneland restart your kernel. Alternatively, use the CLI mode for expectation creation. -
CI/CD pipeline fails on data quality step: Review the logs for failed expectations. Use
great_expectations checkpoint listandsuite listto debug.
Next Steps
- Expand to multiple datasets: Create separate expectation suites for different data sources or labeling tasks.
- Monitor data drift: Schedule regular checks to detect changes in data distribution over time.
- Integrate with annotation tools: Many modern labeling platforms offer webhooks or API callbacks to trigger data quality checks automatically.
- Explore advanced automation: For prompt chaining, RAG pipelines, and more, see Optimizing Prompt Chaining for Business Process Automation and How to Monitor RAG Systems: Automated Evaluation Techniques for 2026.
- Stay informed: The field of AI data quality is rapidly evolving. For the latest trends and tools, bookmark our parent guide: AI Data Labeling in 2026: Best Practices, Tools, and Emerging Automation Trends.
By following this hands-on tutorial, you’ve set up a reproducible, automated data quality check pipeline for your AI workflows. This foundation helps catch errors before they reach model training, streamlines annotation, and supports compliance and auditability at scale.
