GitHub - mzelling/repliclust: High-level synthetic data generation with data set archetypes.

██████  ███████ ██████  ██      ██  ██████ ██      ██    ██ ███████ ████████ 
██   ██ ██      ██   ██ ██      ██ ██      ██      ██    ██ ██         ██    
██████  █████   ██████  ██      ██ ██      ██      ██    ██ ███████    ██    
██   ██ ██      ██      ██      ██ ██      ██      ██    ██      ██    ██    
██   ██ ███████ ██      ███████ ██  ██████ ███████  ██████  ███████    ██

High-Level Synthetic Data Generation with Data Set Archetypes

repliclust is a Python package for generating synthetic datasets with clusters based on high-level descriptions. Instead of manually setting low-level parameters like cluster centroids or covariance matrices, you can simply describe the desired characteristics of your data, and repliclust will automatically generate datasets that match those specifications.

What can this software do for you?

Simplify Synthetic Data Generation: Eliminate the need to fine-tune low-level simulation parameters. Describe your desired scenario, and let repliclust handle the rest.
Enhance Benchmark Quality: By controlling high-level aspects of the data, you can create more informative benchmarks that reveal the strengths and weaknesses of clustering algorithms under various conditions.
Accelerate Research: Quickly generate diverse datasets to test hypotheses, validate models, and perform robustness checks.

Key Features

Generate Data from High-Level Descriptions: Create datasets by specifying scenarios such as "clusters with very different shapes and sizes" or "highly overlapping oblong clusters."
Data Set Archetypes: Use archetypes to define the overall geometry of your datasets with intuitive parameters that summarize cluster overlaps, shapes, sizes, and distributions.
Flexible Cluster Shapes: Go beyond convex, blob-like clusters by applying nonlinear transformations, such as random neural networks for distortion or stereographic projections to create directional data.
Reproducible and Informative Benchmarks: Independently manipulate different aspects of the data to create benchmarks that effectively evaluate and compare clustering algorithms under various conditions.

Demo

Try our demo here!

Installation

pip install repliclust

Quickstart

The easiest way to get started using repliclust is to create synthetic datasets from high-level descriptions in English. We build on on the OpenAI API, so to use these features you must provide an OpenAI API key. You can set it as OPENAI_API_KEY=<your-api-key> in an .env file, or pass it to individual functions as a keyword argument openai_api_key="<your-api-key>".

Generating data directly:

import repliclust as rpl

X, y, _ = rpl.generate("three highly separated oblong clusters in 10D", openai_api_key="<your-api-key>")
rpl.plot(X,y)

Creating an archetype:

archetype = rpl.Archetype.from_verbal_description(
    "seven gamma-distributed clusters in 2D of very different shapes",
    openai_api_key="<your-api-key>"
)

Generating data from the archetype:

X, y = archetype.synthesize()

Making cluster shapes more irregular:

X_irregular = rpl.distort(X)
X_directional = rpl.wrap_around_sphere(X)

Documentation

User Guide: Learn how to generate datasets from high-level descriptions in the User Guide.
Reference: Explore the package Reference.

Citation

To reference repliclust in your work, please cite:

@article{Zellinger:2023,
  title   = {High-Level Synthetic Data Generation with Data Set Archetypes},
  author  = {Zellinger, Michael J and B{\"u}hlmann, Peter},
  journal = {arXiv preprint arXiv:2303.14301},
  doi     = {10.48550/arXiv.2303.14301},
  year    = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
.github/workflows		.github/workflows
docs		docs
repliclust		repliclust
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High-Level Synthetic Data Generation with Data Set Archetypes

What can this software do for you?

Key Features

Demo

Installation

Quickstart

Documentation

Citation

About

Releases

Packages

Languages

License

mzelling/repliclust

Folders and files

Latest commit

History

Repository files navigation

High-Level Synthetic Data Generation with Data Set Archetypes

What can this software do for you?

Key Features

Demo

Installation

Quickstart

Documentation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages