Table of Contents
Machine Learning and Artificial Intelligence are increasingly being exploited to solve health-related problems, such as prognosis prediction from Electronic Health Records or detecting patterns in multi-omics data. Data plays a significant role in the development of such systems, but concerns have been raised when dealing with patient's data, with regulators underlying the need to protect patients' privacy. To this end, in recent years, there has been a growing proposal to replace original data (derived from real patients) with the use of synthetic data that mimic the main statistical characteristics of their real counterparts. Regardless of the methods employed to generate them, it is essential to assess the quality of the synthetic data. To address this constraint, we've created a Dash application that users can install and utilize on their computers. This application allows users to upload both original and synthetic data, generating various metrics to assess resemblance, utility, and privacy. Furthermore, users can download a report containing the obtained results. (DOI: 10.5220/0012558700003657)
This repository provides a Conda environment configuration file (synthcheck_env.yml
) to streamline the setup process. Follow these steps to create the environment:
Important
Make sure you have Conda installed. If not, install Conda before proceeding.
-
Create the Conda Environment
Run the following command to create the environment using the provided
.yml
file:conda env create -f synthcheck_env.yml
This command will set up a Conda environment named according to specifications in the
synthcheck_env.yml
file. -
Activate the Environment
Once the environment is created, activate it using:
conda activate synthcheck_env
Once the virtual environment is activated, you can run the code using the following steps:
python SynthCheck_app.py
-
To deactivate the environment, simply use:
conda deactivate
-
You can now work within this Conda environment to run the application.
The application is organized into two main sections:
The data upload process for quality evaluation is divided into several components:
Users are prompted to upload two CSV files:
- Original Dataset: it contains the dataset used when generating the synthetic data (example original dataset).
- Synthetic Dataset: it comprises the synthetic data for quality evaluation purposes (example synthetic dataset).
Tip
Ensure that categorical feature categories are encoded with numerical values (e.g., 'benign' = 0 and 'malign' = 1).
In addition to the datasets, users are required to upload a descriptor file in CSV format (example feature type file). This file is structured with two columns:
Feature | Type |
---|---|
Age | numerical |
Gender | categorical |
Income | numerical |
Education | categorical |
Warning
The accepted values in the 'Type' column are exclusively 'numerical' and 'categorical'. Additionally, the file must include column headers.
The second section empowers users to perform a comprehensive quality assessment of the uploaded synthetic data. This section comprises three subsections, each dedicated to implementing distinct quality analyses.
This section provides access to three subsections:
-
URA Analysis: it conducts various statistical tests and distance metric comparisons for both numerical and categorical features.
-
MRA Analysis: it omputes metrics related to Multiple Resemblance Analysis such as correlation matrices, outliers analysis, variance explained analysis and UMAP method implementations.
-
DLA Analysis: it presents, for each classifier used in the Data Labeling Analysis, the values of performance metrics.
This section implements TRTR (Train on Real, Test on Real) and TSTR (Train on Synthetic, Test on Real) approaches for a selected target class and machine learning model.
This section consists of three subsections dedicated to privacy evaluation:
-
SEA Analysis: it computes metrics like cosine similarity, Euclidean distance and Hausdorff distance, displaying corresponding density plots or values.
-
MIA Simulation: it simulates Membership Inference Attacks with adjustable attacker parameters and showcases attacker performance.
-
AIA Simulation: it allows simulation of Attribute Inference Attacks where the user sets the attacker's access to features, displaying recostruction performance metrics.
Each section provides options to download reports containing the displayed graphs and tables.
Distributed under MIT License. See LICENSE
for more information.
If you use SynthCheck, please cite
Santangelo, G.; Nicora, G.; Bellazzi, R. and Dagliati, A. (2024). SynthCheck: A Dashboard for Synthetic Data Quality Assessment. In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF; ISBN 978-989-758-688-0; ISSN 2184-4305, SciTePress, pages 246-256. DOI: 10.5220/0012558700003657