SynthCheck: a dashboard to evaluate synthetic data quality

Table of Contents

About The Project
Installation
Application Structure
License

About The Project

Machine Learning and Artificial Intelligence are increasingly being exploited to solve health-related problems, such as prognosis prediction from Electronic Health Records or detecting patterns in multi-omics data. Data plays a significant role in the development of such systems, but concerns have been raised when dealing with patient's data, with regulators underlying the need to protect patients' privacy. To this end, in recent years, there has been a growing proposal to replace original data (derived from real patients) with the use of synthetic data that mimic the main statistical characteristics of their real counterparts. Regardless of the methods employed to generate them, it is essential to assess the quality of the synthetic data. To address this constraint, we've created a Dash application that users can install and utilize on their computers. This application allows users to upload both original and synthetic data, generating various metrics to assess resemblance, utility, and privacy. Furthermore, users can download a report containing the obtained results. (DOI: 10.5220/0012558700003657)

↰ Back To Top

Installation

This repository provides a Conda environment configuration file (synthcheck_env.yml) to streamline the setup process. Follow these steps to create the environment:

Important

Make sure you have Conda installed. If not, install Conda before proceeding.

Steps to Create the Environment

Create the Conda Environment

Run the following command to create the environment using the provided .yml file:
```
conda env create -f synthcheck_env.yml
```
This command will set up a Conda environment named according to specifications in the synthcheck_env.yml file.
Activate the Environment

Once the environment is created, activate it using:
```
conda activate synthcheck_env
```

Running the Code

Once the virtual environment is activated, you can run the code using the following steps:

python SynthCheck_app.py

Additional Notes

To deactivate the environment, simply use:
```
conda deactivate
```
You can now work within this Conda environment to run the application.

↰ Back To Top

Application Structure

The application is organized into two main sections:

Data Upload for Quality Evaluation

The data upload process for quality evaluation is divided into several components:

1. Uploading Original and Synthetic Datasets

Users are prompted to upload two CSV files:

Original Dataset: it contains the dataset used when generating the synthetic data (example original dataset).
Synthetic Dataset: it comprises the synthetic data for quality evaluation purposes (example synthetic dataset).

Tip

Ensure that categorical feature categories are encoded with numerical values (e.g., 'benign' = 0 and 'malign' = 1).

2. Feature Type Descriptor File

In addition to the datasets, users are required to upload a descriptor file in CSV format (example feature type file). This file is structured with two columns:

Example:

Feature	Type
Age	numerical
Gender	categorical
Income	numerical
Education	categorical

Warning

The accepted values in the 'Type' column are exclusively 'numerical' and 'categorical'. Additionally, the file must include column headers.

Quality Assessment of Synthetic Data

The second section empowers users to perform a comprehensive quality assessment of the uploaded synthetic data. This section comprises three subsections, each dedicated to implementing distinct quality analyses.

Resemblance Section

This section provides access to three subsections:

URA Analysis: it conducts various statistical tests and distance metric comparisons for both numerical and categorical features.
MRA Analysis: it omputes metrics related to Multiple Resemblance Analysis such as correlation matrices, outliers analysis, variance explained analysis and UMAP method implementations.
DLA Analysis: it presents, for each classifier used in the Data Labeling Analysis, the values of performance metrics.

Utility Section

This section implements TRTR (Train on Real, Test on Real) and TSTR (Train on Synthetic, Test on Real) approaches for a selected target class and machine learning model.

Privacy Section

This section consists of three subsections dedicated to privacy evaluation:

SEA Analysis: it computes metrics like cosine similarity, Euclidean distance and Hausdorff distance, displaying corresponding density plots or values.
MIA Simulation: it simulates Membership Inference Attacks with adjustable attacker parameters and showcases attacker performance.
AIA Simulation: it allows simulation of Attribute Inference Attacks where the user sets the attacker's access to features, displaying recostruction performance metrics.

Each section provides options to download reports containing the displayed graphs and tables.

↰ Back To Top

License

Distributed under MIT License. See LICENSE for more information.

↰ Back To Top

Citation

If you use SynthCheck, please cite

Santangelo, G.; Nicora, G.; Bellazzi, R. and Dagliati, A. (2024). SynthCheck: A Dashboard for Synthetic Data Quality Assessment. In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF; ISBN 978-989-758-688-0; ISSN 2184-4305, SciTePress, pages 246-256. DOI: 10.5220/0012558700003657

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
example datasets		example datasets
LICENSE		LICENSE
README.md		README.md
SynthCheck_app.py		SynthCheck_app.py
synthcheck_env.yml		synthcheck_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthCheck: a dashboard to evaluate synthetic data quality

About The Project

Installation

Steps to Create the Environment

Running the Code

Additional Notes

Application Structure

Data Upload for Quality Evaluation

1. Uploading Original and Synthetic Datasets

2. Feature Type Descriptor File

Example:

Quality Assessment of Synthetic Data

Resemblance Section

Utility Section

Privacy Section

License

Citation

About

Releases

Packages

Languages

License

bmi-labmedinfo/SynthCheck

Folders and files

Latest commit

History

Repository files navigation

SynthCheck: a dashboard to evaluate synthetic data quality

About The Project

Installation

Steps to Create the Environment

Running the Code

Additional Notes

Application Structure

Data Upload for Quality Evaluation

1. Uploading Original and Synthetic Datasets

2. Feature Type Descriptor File

Example:

Quality Assessment of Synthetic Data

Resemblance Section

Utility Section

Privacy Section

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages