Generate dataset that mimic realistic workloads
The goal of datamimic is to provide a flexible and easy to use open source tool to support generating datasets with statistical properties that resembles realistic data. It provides ability of generating numerical/categorical data from a number of configurable distributions, with custom NDV (number of distinct values) ratios, nullability and sortedness.
The main motivation behind datamimic is to study the effectiveness of encodings, compressions and indexing techniques used in parquet format with respect to table scan with a predicate on a variety of datasets with distinct statistical properties.
It also serves as part of the testing infrastructures for exploring the idea of bringing additional indexes and data structures to parquet.
Run
pip install -r requirements.txt
inside project root directory.
python bin/main.py --help
Datamimic accepts a yaml configuration file, see an example at example.yaml, which specifies the desired data schema, statistical properties, data size and output format, etc.