This repository contains datasets from the Paper Small molecule machine learning: All models are wrong, some may not even be helpful alongside Jupyter Notebooks for visualization of MCES distances. Files too large for GitHub are hosted at OSF.
file | description |
---|---|
biostructures.csv |
Biomolecular structures (SMILES and InChI-key first block) |
biostructures_20k.csv |
Subsample of biomolecular structures used throughout the paper |
subsampled_instances_20k.csv |
Subsample of pairs of biomolecular structures used for runtime and threshold evaluations |
mces_distances.npz |
Compressed numpy-object containing all computed MCES distances alongside SMILES. Hosted externally at doi:10.17605/OSF.IO/5SXFE. |
umap_df.csv |
Computed UMAP embeddings for various datasets |
umap_embedding_biostructures.pkl |
umap-learn object allowing projection of new structures onto the computed UMAP embedding. Hosted externally at doi:10.17605/OSF.IO/5SXFE. |
Visualization of precomputed UMAP embeddings as well as for new structures is possible via the
python-script umap_vis.py. If you just want to use the visualization, download this
repository and run python umap_vis.py
.
To project MCES distances of a new dataset onto the existing UMAP embedding, use the Jupyter Notebook umap_embedding.ipynb.
A python installation with version >= 3.9 is required (3.9.18 is was used in development). Packages required are:
umap-learn=0.5.3
numba=0.53.1
scipy=1.7.1
pandas
numpy
plotly
rdkit
dash
gunicorn
A conda (or mamba) environment with all necessary packages installed can be created with
conda env create -f conda_env.yml
# to activate:
conda activate umap_mces
A docker container for the visualization can be built with the provided Dockerfile.
For the special case of self-hosting the docker container via reverse proxy, the environment
variable PROXY_PREFIX_REQUESTS
might have to be set with the docker run option docker run -e PROXY_PREFIX_REQUESTS='...' ...
.