datamimic

Generate dataset that mimic realistic workloads

Goals

The goal of datamimic is to provide a flexible and easy to use open source tool to support generating datasets with statistical properties that resembles realistic data. It provides ability of generating numerical/categorical data from a number of configurable distributions, with custom NDV (number of distinct values) ratios, nullability and sortedness.

The main motivation behind datamimic is to study the effectiveness of encodings, compressions and indexing techniques used in parquet format with respect to table scan with a predicate on a variety of datasets with distinct statistical properties.

It also serves as part of the testing infrastructures for exploring the idea of bringing additional indexes and data structures to parquet.

Build

Run

pip install -r requirements.txt

inside project root directory.

Usage

python bin/main.py --help

Datamimic accepts a yaml configuration file, see an example at example.yaml, which specifies the desired data schema, statistical properties, data size and output format, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
datamimic		datamimic
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.yaml		example.yaml
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datamimic

Goals

Build

Usage

About

Releases

Packages

Contributors 3

Languages

License

SP24-CS511-Final-Project/datamimic

Folders and files

Latest commit

History

Repository files navigation

datamimic

Goals

Build

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages