AB_RESULTS_DIR
is the path to the test configuration directoryAB_PYTHON_DIR
is the path to the directory containing the python test programsAB_SHARP_DIR
is the path to the directory containing AB Sharp SynDiffix code (theadaptive-buckets-sharp
directory)SYNDIFFIX_PY_DIR
is the path to the directory root of the SynDiffix Python implementation.
These tests all assume a root directory whose location is defined by the environment variable AB_RESULTS_DIR
.
Underneath AB_RESULTS_DIR
are a set of directories, each of which can be used to build and measure synthetic data for a some set of datasets. Each of these directories can be thought of as an experiment with respect to the corresponding set of data sets. We refer to these as an experiment directory
.
Under each experiment directory
is a number of directories with pre-subscribed names. There are eight different types of directories:
csv
: contains CSV datasets for testing. The datasets must be split intotrain
andtest
files, each containing half of the data, randomly selected (seemisc/splitFile.py
). These two sets are in directories labeledtrain
andtest
.results
: contains the results of building synthetic data. Has one sub-directory per method (or different set of method parameters).measures
: contains the results of measuring data quality, ML efficacy measures, and privacy measures from the synthetic data.summaries
: contains graphs and csv files that summarize the measurementsruns
: contains the commands used to run the tests, including SLURM jobsorigMl
: the ML model scores against the original dataorigMl_samples
: temporary directory for the separate measurement samples for original ML measuresmeasures_samples
: temporary directory for the separate measurement samples for the ML measures of the synthetic data
To build synthetic data with SynDiffix, github/diffix/syndiffix
needs to be installed.
We use the SDV library to build synthetic with learning-based methods (see sdmManager.py
, and sdmTools.py
).
quicktester.py
can build CSV datasets according to a variety of specifications.
Other than this, one can put any CSV dataset they wish.
quicktester.py
can also build synthetic data and run quality measures on it.
oneModel.py
is used to build synthetic data on a SLURM cluster.
To build synthetic data with focus columns, it is necessary to first run sdmManager.py makeFocusRuns
. This will build the files focusBuildJobs.json
and batchFocus
in the run directory. Then run sbatch batchFocus
. This will place the resulting output in the directory syndiffix_focus
in the results directory. (Note that prior to running sdmManager.py makeFocusRuns
, you must have already run sdmManager.py makeMlRuns
. This is because makeFocusRuns
requires the file mlJobs.json
.)
This is currently just test software (prior to integrating features measure into syndiffix).
Run sdmManager.py makeFeatures --featuresDir=<featuresDir> --featuresType=<featuresType>
where <featuresDir>
is the directory holding the features json files, and <featuresType>
is 'univariate' or 'ml' or whatever else we decide. This creates the featuresDir, a file called featuresJobs.json
in the runs directory, a SLURM batchfile called batch_<featuresType>
in the runs directory, and a directory logs_<featuresType>
in the runs directory.
Run sbatch batch_<featuresType>
. This creates the SLURM jobs with oneFeaturesJob.py
There is a way to produce many syndiffix tables from the same original table, each with a different combination of columns.
- Place a file
colCombs.json
in the runs directory. See example undermisc
. - Run
sdmManager makeColCombs --synMethod=sdx_whatever
. This produces the filescolcombJobs.json
andbatchCombs
in the runs directory. - Do
sbatch batchCombs
- To create csv files with columns in order of least to most cardinality, edit and run
misc/prepCsvForSynthpop.py
. - Run
sdmManager.py makeSynthpopRuns
. This populates the directoryruns/synthpop_jobs
with R scripts, one per csv file. It also createsruns/batchSynthpop
. - Run
sbatch batchSynthpop
. This generates the synthpop output insynthpop_builds
. - To convert the synthpop output into the appropriate results files, edit and run
Rscripts/extractSynthpop.py
.
quicktester.py
can measure quality.
To do measures on the SLURM cluster, we have the following workflow:
- Use
sdmManager.py updateCsvInfo
whenever new tables are added to a csv directory, or when code has changed and you want to rerun the original ML measures. This creates the filecsvOrder.json
in the measures directory. Note that if a table is removed, thencsvOrder.json
should be removed first. Also removefocusColumns.json
in the runs directory before running this. - Run
sdmManager.py makeOrigMlRuns
to createbatchOrigMl
. Select a temporary directory to hold the measure samples. - Run
sbatch batchOrigMl
to run ML measures over the original (not synthesized) datasets. The purpose of this is to determine which ML measures (i.e. column and method) have the best quality. These make the measures to use for comparison with synthetic data. This creates theorigMl
directory and populates it with one json file per model. Note thatbatchOrigMlRuns
creates multiple measures per table/column/method combination, each such measure in a separate file. - Run
sdmManager.py makeMlRuns
to build the SLURM configuration information for doing ML measures (creates the filesmlJobs.json
,batchMl
, andbatchMl.sh
in theruns
directory).- The cmd line parameter
--synMethod=method
can be used to limit the created jobs to those of the synMethod only. In any event,makeMlRuns
will not schedule runs if measures already exist. You must manually remove existing measures if you want to rerun them. - The cmd line parameter
--limitToFeatures=True|False
determines whether the measure is made over the entire table, or only over the K features found withmakeFeatures
. If--limitToFeatures=True
, then the filegatherFeatures.sh
is additionally created in theruns
directory. - NOTE:
makeMlRuns
needs to be run on a machine with more memory than the "submit" machines. Suggest pinky03 or brain03.
- The cmd line parameter
- Run
batchMl.sh
to do the ML measures. Note that this makes multiple measures per table/column/method combination. - Run
sdmManager.py mergeMlMeasures
, which selects the best score of multiple ml measures, creates a json file containing that score, and places it in the appropriatemeasures
subdirectory. - Run
sdmManager.py makeQualRuns
to build the SLURM configuration information for doing quality measures (creates the filebatchQual
in theruns
directory) - In the
runs
directory, dosbatch batchQual
to do the 1dim and 2dim quality measures. - Run
summarize.py
to summarize the performance measures and place them in various plots.
We use the python package anonymeter
to measure privacy. This requires that we split all of our data files into train
and test
subsets. The basic idea is that train
is synthesized, and then we use test
as a control to test the effectiveness of attacks.
- Run
misc/splitFiles.py
to generate the csv split files. These are placed into directoriestrain
andtest
undercsv
respectively - Generate synthesized data (e.g. using
oneModel.py
) fromtrain
and put them in theresults
directory. - (Run
sdmManager.py updateCsvInfo
if you haven't before.) - Run
sdmManager.py makePrivRuns
to create a jobs fileprivJobs.json
and a batch scriptbatchPriv
, both placed in theruns
directory. Dosbatch batchPriv
. This places the privacy measures in themeasures
directory.
summarize.py
reads in data from the measures
directory and produces a set of summary graphs. This reads in a configuration file summarize.json
. An example of summarize.json
can be found in misc/
. summarize.json
can be used to do a number of things:
- Ignore listed synMethods
- Rename synMethods (used to change labels in the plots)
- Select the best ML measures from a pair of synMethods (note when using this feature, 2col quality scores are lost)
- Select which combinations of synMethods should be plotted together
The tests run off of prebuilt tables in .csv
files. These files are by default in the directory csvLib
. They can be in another directory as well.
This contains the output of the adaptive buckets programs. This directory has its own structure defined below. The test routines syntest.py
and quicktester.py
(among possibly others) by default place their output here.
Note that syntest.py
operates either with command line parameters or from a json
config file. The file can be found in the AB_RESULTS_DIR
, and by default is named syntest.json
. An example of the file can be found in sampleConfigs
.
This contains the output of tests that measure the quality of the data in synResults
.
Note that synmeasure.py
operates either with command line parameters or from a json
config file. The file can be found in the AB_RESULTS_DIR
. An example of the file can be found in sampleConfigs/synmeasure.json
.
Each test operates over a given data source, with given versions of the algorithms (forest, harvest, clustering), given parameters for each algorithm, and the implementation (python or F#).
synResults
contains one directory for each combination of versions, parameters, and implementation. The directory name is formatted as:
implementation.for_param_version.har_version.cl_param_version.md_version
implementation
is either 'py' or 'fs' (for python or F#). Later we will have 'pg' as well.
Each param
is a tag defining the parameter set used for the respective algorithm (for
for forest, har
for harvest, cl
for clustering, and md
for microdata). (Note that at the moment, the default parameters have the tag g0
.)
Each version
is the version of the algorithm (v1
, v2
, ...).
An example is py.for_g0_v3.har_v5.cl_g0_v1.md_v1
.
The json files containing the results have the format:
sourceDataFile.directoryName.json
An example is 4dimAirportCluster.csv.py.for_g0_v3.har_v5.cl_g0_v1.md_v1.json
.
The results file produced by controller.py
has the following structure:
{
"elapsedTime": 0.5984406471252441,
"colNames": [
"'c0'",
"'c1'",
"'c2'",
"'c3'"
],
"originalRows": [
[
"'a'",
"'b'",
0.9787379841057392,
2.240893199201458
],
[
"'a'",
"'b'",
0.9500884175255894,
-0.1513572082976979
],
...
],
"synRows": [
[
"'a'",
"'a'",
-0.48094763821008923,
1.3420975576293672
],
[
"'a'",
"'a'",
1.6458741734699058,
-1.6682721485003236
],
...
"params": {
"forest": {
"sing": 10,
"range": 50,
"lcf": 5,
"dependence": 10,
"threshSd": 1.0,
"noiseSd": 1.0,
"lcdBound": 2,
"pName": "g0",
"version": "v3"
},
"cluster": {
"thresholds": [
null,
null,
0.05,
0.2,
0.35
],
"pName": "g0"
},
"baseConfig": "4dimDependentTextWithNoise.csv",
"clustered": true,
"fileName": "4dimDependentTextWithNoise.csv.py.for_g0_v3.har_v5.cl_g0_v1.md_v1",
"harvest": {
"version": "v5"
},
"microdata": {
"version": "v1"
}
},
... other stuff
}
Python code is formatted using:
autopep8 --in-place --recursive test_syndiffix/