Skip to content

tokopedia/dq-whistler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Whistler is an open source data quality and profiling tool. Along with profiling it also supports running custom constraints on the data. With underlining support of Apache Spark execution engine Whistler can be extended to data in magnitudes of GB's, TB's. .

🐣 Getting Started

1. Install Whistler

pip install dq-whistler

2. Create a Spark dataframe for you data

# Sample Data
Age,Description
1,"abc"
2,"abc1"
3,
4,"abc4"
10,"xyz"
12,"null"
17,"abc"
20,"abc3"
23,
# You can read data from all the supported sources as per Apache Spark module
df = spark.read.option("header", "true").csv("<your path>")

3. Create a config in the form of python dict or read it from any json file

config = [
   {
      "name": "Age",
      "datatype": "number",
      "constraints":[
         {
            "name": "gt_eq",
            "values": 5
         },
         {
            "name": "is_in",
            "values": [1, 23]
         }
         
      ]
   },
   {
      "name": "Description",
      "datatype": "string",
      "constraints":[
         {
            "name": "regex",
            "values": "([A-Za-z]+)"
         },
         {
            "name": "contains",
            "values": "abc"
         }
         
      ]
   }
]

4. Build an instance of Data Quality Analyzer and execute the checks

from dq_whistler import DataQualityAnalyzer

output = DataQualityAnalyzer(df, config).analyze()

print(output)
[
    {
        "col_name": "Age",
        "total_count": 9,
        "null_count": 0,
        "unique_count": 9,
        "topn_values": {
            "1": 1,
            "2": 1,
            "3": 1,
            "4": 1,
            "10": 1,
            "12": 1,
            "17": 1,
            "20": 1,
            "23": 1
        },
        "min": 1,
        "max": 23,
        "mean": 10.222222222222221,
        "stddev": 8.303279138054101,
        "quality_score": 0,
        "constraints": [
            {
                "name": "gt_eq",
                "values": 5,
                "constraint_status": "failed",
                "invalid_count": 4,
                "invalid_values": [
                    "1",
                    "2",
                    "3",
                    "4"
                ]
            },
            {
                "name": "is_in",
                "values": [
                    1,
                    23
                ],
                "constraint_status": "failed",
                "invalid_count": 7,
                "invalid_values": [
                    "2",
                    "3",
                    "4",
                    "10",
                    "12",
                    "17",
                    "20"
                ]
            }
        ]
    },
    {
        "col_name": "Description",
        "total_count": 9,
        "null_count": 2,
        "unique_count": 7,
        "topn_values": {
            "abc": 2,
            "abc1": 1,
            "xyz": 1,
            "abc4": 1,
            "abc3": 1
        },
        "quality_score": 0,
        "constraints": [
            {
                "name": "regex",
                "values": "([A-Za-z]+)",
                "constraint_status": "success",
                "invalid_count": 0,
                "invalid_values": []
            },
            {
                "name": "contains",
                "values": "abc",
                "constraint_status": "failed",
                "invalid_count": 2,
                "invalid_values": [
                    "xyz",
                    "null"
                ]
            }
        ]
    }
]

📦 Roadmap

The list below contains the functionality that contributors are planning to develop for this module

  • Visualization
    • Visualization of profiling output

🎓 Important Resources

👋 Contributing

✨ Contributors

About

A comprehensive DQ framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •