Skip to content

SP24-CS511-Final-Project/datamimic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datamimic

Generate dataset that mimic realistic workloads

Goals

The goal of datamimic is to provide a flexible and easy to use open source tool to support generating datasets with statistical properties that resembles realistic data. It provides ability of generating numerical/categorical data from a number of configurable distributions, with custom NDV (number of distinct values) ratios, nullability and sortedness.

The main motivation behind datamimic is to study the effectiveness of encodings, compressions and indexing techniques used in parquet format with respect to table scan with a predicate on a variety of datasets with distinct statistical properties.

It also serves as part of the testing infrastructures for exploring the idea of bringing additional indexes and data structures to parquet.

Build

Run

pip install -r requirements.txt

inside project root directory.

Usage

python bin/main.py --help

Datamimic accepts a yaml configuration file, see an example at example.yaml, which specifies the desired data schema, statistical properties, data size and output format, etc.

About

Generate dataset that mimics realistic workloads

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages