Data quality management with great_expectations

3 min readNov 3, 2021

In recent years there is an ongoing trend to use big data to solve complex problems. It may be some business problems, recommendations, or some modern AI who learns who to play sudoku. What is the common part of these tasks? Right, it’s data that is a holy grail in our today’s world.

You can collect tons of the data from your users or the product and build big data pipelines to get valuable insights. But what if the nature of your data has been changed? For example, something is broken in the pipeline and suddenly you start getting anomaly values.

Why can it be bad? Well, there are a couple of reasons:

Let’s say, you train your machine learning model on some data inputs of the strict format. Then suddenly data format has been changed and your machine learning model may start predicting wrong values and you will make wrong decisions based on it.
If you ignore these anomalies then it may be possible that something in your pipeline is not working properly: it may be some ingestion lag into your database, some servers are down, or many more reasons. If you have a proper data quality setup it will help you to prevent these situations.

At MLbits we’re heavily using data quality control for our projects and today would like to talk a bit about our favorite tool — great expectations (https://github.com/great-expectations/great_expectations).

Why is it so good? Here are the main reasons how it helps to build robust data quality monitoring:

Know immediately that the data isn’t normal.
Identify exactly what changed in the data.
Determine the specific step where it went wrong
Prevent downstream steps from executing, limiting the amount of bad data exposure and future cleanup efforts.

Probably, you could find some problems which you have already faced or may be facing in the future. Ok, now let’s talk about the key features of the library which helps you with the problems above:

Can create rules for validating the data
Simplifying debugging data pipelines if (when) they break
Accelerating ETL and data normalization
Possible to run validation as a part of your CI checks
Automating verification of new data deliveries from vendors and other teams
Auto-generated UI docs
Both CLI and code-level usage
Alerts out of the box
Pluggable and extensible (supports CSV/pandas, s3, redshift, bigquery etc) and can be integrated into existing Airflow DAGs

Ok, looks exciting, how do I get started?

First of all, great_expectations has really cool docs and it’s worth going through it. But here is a quick tour.

Firstly you need to install it via pip install great_expectations and initialize the project using great_expectations and get the following project structure

great_expectations.yml stores all the config data

expectations folder stores the list of your rules

checkpoints stores checkpoints data

plugins for custom stuff

uncommitted — all the sensitive data you don’t want to share like access_keys

Key concepts you need to know before start using it:

Key concepts

Datasource — name of your data source (can be a local folder, database etc)
Data asset — one unit of the data within the source (can be CSV file, table etc)
Expectation suite — set of rules which you can reuse multiple times
Checkpoint — allows running different sets of rules
Profiling — auto analyze your data asset with 1000 random rows and create a UI report and the notebook with the rules (good start point)

It was just a brief intro to the great_expectations. If you‘re interested in working with us or want to set up data quality management for your organization, feel free to reach us at http://ml-bits.com/.

Data quality management with great_expectations

Written by Mlbits