Skip to content
/ eDICE Public

Epigenomic Data Imputation via Contextualised Embeddings

License

Notifications You must be signed in to change notification settings

alex-hh/eDICE

Repository files navigation

eDICE

DOI

Epigenomic Data Imputation via Contextualised Embeddings

eDICE architecture

This repository contains the code for the model presented in the paper Getting Personal with Epigenetics: Towards Individual-Specific Epigenomic Imputation with Machine Learning.

Overview

The edice folder contains the source code used to perform the experiments presented in the paper.

The scripts folder contains the code to train eDICE on the Roadmap dataset as well as the code used to apply transfer learning for individualized predictions on the ENTEx dataset.

The r folder contains the code used to perform the differential peak analysis using R and DiffBind.

System requirements

Hardware requirements

eDICE requires only a standard computer with enough RAM to support the in-memory operations. However, the use of a GPU accelerator is recommended for the analysis of larger datasets.

Software requirements

The eDICE models were trained on computers operating on Ubuntu 16.04 and Ubuntu 22.04.

eDICE was developed using python 3.9. We recommend setting up a suitable environment using Anaconda. The environment and the package can be setup from the cloned eDICE folder as follows

conda create -n edice python==3.9
conda activate edice
pip install -r requirements.txt
python setup.py install

This operation will install all the package dependencies for eDICE, which should require only a few minutes on a typical computer. Alternatively, the requirements.txt file lists the dependencies used to perform the experiments presented in the paper.

The r folder contains the pipeline used for the differential peak analysis, which was performed using R version 4.1.0 (2021-05-18) and DiffBind version 3.2.7.

Demo

Sample data for a minimal run of the training script is provided in the folder edice/data/roadmap.

Ensure that a suitable environment is setup and active (see Software requirements).

To run the sample training script, the command is:

python scripts/train_roadmap.py --experiment_name "myRoadmapExperiment" --train_splits "train" --epochs 20 --transformation "arcsinh" --embed_dim 256 --lr 0.0003 --n_targets 120

The sample script produces a trained edice model located in the oputputs folder, as well as saving the predictions for the test tracks as a .npz file. A typical run of this example script requires approximately 40 minutes on a standard laptop.

Full data and trained models to run the Roadmap training and ENTEx transfer learning scripts are available at Data for reproducing the training of eDICE model.

To reproduce the model used for validation on the Roadmap dataset, download the roadmap_tracks_shuffled.h5 file from the linked dataset, move it to a data directory e.g. data/roadmap/roadmap_tracks_shuffled.h5 together with the idmap.json and the predict_splits.json files, include the annotations folder in the data folder, and run:

python scripts/train_roadmap.py --experiment_name "eDICE_Roadmap" --dataset "RoadmapRnd" --data_dir "data" --split_file "data/roadmap/predictd_splits.json" --train_splits "train" "val" --epochs 50 --transformation "arcsinh" --embed_dim 256 --lr 0.0003 --n_targets 120

Run eDICE on your data

To run eDICE on custom data, the epigenomic tracks must be provided in a suitable HDF5 format. Utility functions to preprocess data in a suitable manner are under development. Once the data is processed, run the train_eDICE.py script as:

python scripts/train_eDICE.py --experiment_name "myCustomExperiment" --dataset_filepath "roadmap/SAMPLE_chr21_roadmap_train.h5" --data_dir "sample_data" --idmap "sample_data/roadmap/idmap.json" --dataset_name "mySampleRoadmap" --split_file "sample_data/roadmap/predictd_splits.json" --gap_file "annotations/hg19gap.txt" --blacklist_file "annotations/hg19-blacklist.v2.bed" --train_splits "train" --epochs 20 --transformation "arcsinh" --embed_dim 256 --lr 0.0003 --n_targets 120

License

This project is covered under the MIT License.

Funding

This project has received funding from the European Union's Framework Programme for Research and Innovation Horizon 2020 (2014-2020) under the Marie Skłodowska-Curie Grant Agreement No. 813533-MSCA-ITN-2018

Citation

For usage of this package please cite the original paper Getting personal with epigenetics: towards individual-specific epigenomic imputation with machine learning.

About

Epigenomic Data Imputation via Contextualised Embeddings

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages