Skip to content

High-level synthetic data generation with data set archetypes.

License

Notifications You must be signed in to change notification settings

mzelling/repliclust

Repository files navigation

Tests Coverage

██████  ███████ ██████  ██      ██  ██████ ██      ██    ██ ███████ ████████ 
██   ██ ██      ██   ██ ██      ██ ██      ██      ██    ██ ██         ██    
██████  █████   ██████  ██      ██ ██      ██      ██    ██ ███████    ██    
██   ██ ██      ██      ██      ██ ██      ██      ██    ██      ██    ██    
██   ██ ███████ ██      ███████ ██  ██████ ███████  ██████  ███████    ██    

High-Level Synthetic Data Generation with Data Set Archetypes

repliclust is a Python package for generating synthetic datasets with clusters based on high-level descriptions. Instead of manually setting low-level parameters like cluster centroids or covariance matrices, you can simply describe the desired characteristics of your data, and repliclust will automatically generate datasets that match those specifications.

What can this software do for you?

  • Simplify Synthetic Data Generation: Eliminate the need to fine-tune low-level simulation parameters. Describe your desired scenario, and let repliclust handle the rest.

  • Enhance Benchmark Quality: By controlling high-level aspects of the data, you can create more informative benchmarks that reveal the strengths and weaknesses of clustering algorithms under various conditions.

  • Accelerate Research: Quickly generate diverse datasets to test hypotheses, validate models, and perform robustness checks.

Key Features

  • Generate Data from High-Level Descriptions: Create datasets by specifying scenarios such as "clusters with very different shapes and sizes" or "highly overlapping oblong clusters."

  • Data Set Archetypes: Use archetypes to define the overall geometry of your datasets with intuitive parameters that summarize cluster overlaps, shapes, sizes, and distributions.

  • Flexible Cluster Shapes: Go beyond convex, blob-like clusters by applying nonlinear transformations, such as random neural networks for distortion or stereographic projections to create directional data.

  • Reproducible and Informative Benchmarks: Independently manipulate different aspects of the data to create benchmarks that effectively evaluate and compare clustering algorithms under various conditions.

Demo

Try our demo here!

Installation

pip install repliclust

Quickstart

The easiest way to get started using repliclust is to create synthetic datasets from high-level descriptions in English. We build on on the OpenAI API, so to use these features you must provide an OpenAI API key. You can set it as OPENAI_API_KEY=<your-api-key> in an .env file, or pass it to individual functions as a keyword argument openai_api_key="<your-api-key>".

  • Generating data directly:
import repliclust as rpl

X, y, _ = rpl.generate("three highly separated oblong clusters in 10D", openai_api_key="<your-api-key>")
rpl.plot(X,y)
  • Creating an archetype:
archetype = rpl.Archetype.from_verbal_description(
    "seven gamma-distributed clusters in 2D of very different shapes",
    openai_api_key="<your-api-key>"
)
  • Generating data from the archetype:
X, y = archetype.synthesize()
  • Making cluster shapes more irregular:
X_irregular = rpl.distort(X)
X_directional = rpl.wrap_around_sphere(X)

Documentation

  • User Guide: Learn how to generate datasets from high-level descriptions in the User Guide.
  • Reference: Explore the package Reference.

Citation

To reference repliclust in your work, please cite:

@article{Zellinger:2023,
  title   = {High-Level Synthetic Data Generation with Data Set Archetypes},
  author  = {Zellinger, Michael J and B{\"u}hlmann, Peter},
  journal = {arXiv preprint arXiv:2303.14301},
  doi     = {10.48550/arXiv.2303.14301},
  year    = {2023}
}

About

High-level synthetic data generation with data set archetypes.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages