The Dart-Eval preprint is available here: (Insert Preprint Link)
All data is available for download at Synapse project syn59522070
.
The commands in this section reproduce the results for each task in the paper. The output files mirror the structure of the Synapse project.
Prior to running analyses, set the $DART_WORK_DIR
environment variable. This directory will be used to store intermediate files and results.
Additionally, download the genome reference files from syn60581044
into $DART_WORK_DIR/refs
, keeping the file names. These genome references are used across all tasks.
In the following commands, $MODEL
represents the evaluated DNALM architecture, one of dnabert2
, gena_lm
, hyenadna
, and nucleotide_transformer
. $MODEL_SPECIFIC_NAME
represents the specific version of each model, namely one of DNABERT-2-117M
, gena-lm-bert-large-t2t
, hyenadna-large-1m-seqlen-hf
, and nucleotide-transformer-v2-500m-multi-species
.
All inputs, intermediate files, and outputs for this task are available for download at syn60581046
.
This task utilizes the set of ENCODE v3 candidate cis-regulatory elements (cCREs). A BED-format file of cCRE genomic coordinates is available at syn62153306
. This file should be downloaded to $DART_WORK_DIR/task_1_ccre/input_data/ENCFF420VPZ.bed
.
python -m dnalm_bench.task_1_paired_control.dataset_generators.encode_ccre --ccre_bed $DART_WORK_DIR/task_1_ccre/input_data/ENCFF420VPZ.bed --output_file $DART_WORK_DIR/task_1_ccre/processed_inputs/ENCFF420VPZ_processed.tsv
This script expands each element to 350 bp, centered on the midpoint of the element. The output file is a TSV with the following columns:
chrom
: chromosomeinput_start
: start position of the length-expanded elementinput_end
: end position of the length-expanded elementccre_start
: start position of the original cCREccre_end
: end position of the original cCREccre_relative_start
: start position of the original cCRE relative to the length-expanded elementccre_relative_end
: end position of the original cCRE relative to the length-expanded elementreverse_complement
: 1 if the element is reverse complemented, 0 otherwise
python -m dnalm_bench.task_1_paired_control.zero_shot.encode_ccre.$MODEL
Extract final-layer embeddings
python -m dnalm_bench.task_1_paired_control.supervised.encode_ccre.extract_embeddings.probing_head_like
Train probing-head-like ab initio model
python -m dnalm_bench.task_1_paired_control.supervised.encode_ccre.ab_initio.probing_head_like
Evaluate probing-head-like ab initio model
python -m dnalm_bench.task_1_paired_control.supervised.encode_ccre.eval_ab_initio.probing_head_like
Extract final-layer embeddings from each model
python -m dnalm_bench.task_1_paired_control.supervised.encode_ccre.extract_embeddings.$MODEL
Train probing models
python -m dnalm_bench.task_1_paired_control.supervised.encode_ccre.train_classifiers.$MODEL
Evaluate probing models
python -m dnalm_bench.task_1_paired_control.supervised.encode_ccre.eval_probing.$MODEL
Train fine-tuned models
python -m dnalm_bench.task_1_paired_control.supervised.encode_ccre.finetune.$MODEL
Evaluate fine-tuned models
python -m dnalm_bench.task_1_paired_control.supervised.encode_ccre.eval_finetune.$MODEL
All inputs, intermediate files, and outputs for this task are available for download at syn60581043
.
This task utilizes the set of HOCOMOCO v12 transcription factor sequence motifs. A MEME-format file of motifs is available at syn60756095
. This file should be downloaded to $DART_WORK_DIR/task_2_footprinting/input_data/H12CORE_meme_format.meme
.
Additionally, this task utilizes a set of sequences and shuffled negatives generated from Task 1.
python -m dnalm_bench.task_2_5_single.dataset_generators.transcription_factor_binding.h5_to_seqs $DART_WORK_DIR/task_1_ccre/embeddings/probing_head_like.h5 $DART_WORK_DIR/task_2_footprinting/processed_data/raw_seqs_350.txt
python -m dnalm_bench.task_2_5_single.dataset_generators.motif_footprinting_dataset --input_seqs $DART_WORK_DIR/task_2_footprinting/processed_data/raw_seqs_350.txt --output_file $DART_WORK_DIR/task_2_footprinting/processed_data/footprint_dataset_350.txt --meme_file $DART_WORK_DIR/task_2_footprinting/input_data/H12CORE_meme_format.meme
python -m dnalm_bench.task_2_5_single.experiments.task_2_transcription_factor_binding.embeddings.$MODEL
python -m dnalm_bench.task_2_5_single.experiments.task_2_transcription_factor_binding.footprint_eval_embeddings --input_seqs $DART_WORK_DIR/task_2_footprinting/processed_data/footprint_dataset_350_v1.txt --embeddings $DART_WORK_DIR/task_2_footprinting/outputs/embeddings/$MODEL_SPECIFIC_NAME.h5 --output_file $DART_WORK_DIR/task_2_footprinting/outputs/evals/embeddings/$MODEL_SPECIFIC_NAME.tsv
python -m dnalm_bench.task_2_5_single.experiments.task_2_transcription_factor_binding.likelihoods.$MODEL
python -m dnalm_bench.task_2_5_single.experiments.task_2_transcription_factor_binding.footprint_eval_likelihoods --input_seqs $DART_WORK_DIR/task_2_footprinting/processed_data/footprint_dataset_350_v1.txt --likelihoods $DART_WORK_DIR/task_2_footprinting/outputs/likelihoods/$MODEL_SPECIFIC_NAME.tsv --output_file $DART_WORK_DIR/task_2_footprinting/outputs/evals/likelihoods/$MODEL_SPECIFIC_NAME.tsv
dnalm_bench/task_2_5_single/experiments/eval_footprinting_likelihood.ipynb
- figure production for likelihood-based evaluation
dnalm_bench/task_2_5_single/experiments/eval_footprinting_embedding.ipynb
- figure production for embedding-based evaluation
dnalm_bench/task_2_5_single/experiments/footprinting_pairwise.ipynb
- cross-model pairwise production plots
dnalm_bench/task_2_5_single/experiments/footprinting_conf_intervals.ipynb
- confidence interval calculation
All inputs, intermediate files, and outputs for this task are available for download at syn60581042
.
This task utilizes ATAC-Seq experimental readouts from five cell lines. Input files are available at syn60581166
. This directory should be cloned to $DART_WORK_DIR/task_3_peak_classification/input_data
.
Using the input peaks from ENCODE, generate a consensus peakset:
python -m dnalm_bench.task_2_5_single.dataset_generators.peak_classification.make_consensus_peakset
Then, generate individual counts matrices for each sample, using input BAM files from ENCODE and the consensus peakset:
python -m dnalm_bench.task_2_5_single.dataset_generators.peak_classification.generate_indl_counts_matrix GM12878 $BAM_FILE
python -m dnalm_bench.task_2_5_single.dataset_generators.peak_classification.generate_indl_counts_matrix H1ESC $BAM_FILE
python -m dnalm_bench.task_2_5_single.dataset_generators.peak_classification.generate_indl_counts_matrix HEPG2 $BAM_FILE
python -m dnalm_bench.task_2_5_single.dataset_generators.peak_classification.generate_indl_counts_matrix IMR90 $BAM_FILE
python -m dnalm_bench.task_2_5_single.dataset_generators.peak_classification.generate_indl_counts_matrix K562 $BAM_FILE
Concatenate the counts matrices and generate DESeq inputs:
python -m dnalm_bench.task_2_5_single.dataset_generators.peak_classification.generate_merged_counts_matrix
Finally, run DESeq for each cell type to obtain differentially accessible peaks for each cell type:
Rscript dnalm_bench.task_2_5_single.dataset_generators.peak_classification.DESeqAtac.R
The final output consists of the differentially accessible peaks, available at syn61788656
.
Use FIMO to generate motif scores for each peak sequence.
The following notebook contains information on how to produce the zero-shot clustering results, using the motif counts from FIMO:
dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.baseline.zero_shot_clustering_baseline.ipynb
This depends on the final-layer embeddings generated for the probed models.
python -m dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.cluster.run_clustering_subset $DART_WORK_DIR/task_3_peak_classification/embeddings/$MODEL_SPECIFIC_NAME.h5 $DART_WORK_DIR/task_3_peak_classification/processed_inputs/peaks_by_cell_label_unique_dataloader_format.tsv $DART_WORK_DIR/task_3_peak_classification/processed_inputs/indices_of_new_peaks_in_old_file.tsv $DART_WORK_DIR/task_3_peak_classification/clustering/$MODEL_SPECIFIC_NAME/
Here, $AB_INITIO_MODEL
is one of probing_head_like
or chrombpnet_like
(ChromBPNet-like).
Extract final-layer embeddings (probing_head_like
only)
python -m dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.extract_embeddings.$AB_INITIO_MODEL
Train ab initio models
python -m dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.baseline.$AB_INITIO_MODEL
Evaluate ab initio models
python -m dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.eval_baseline.$AB_INITIO_MODEL
Extract final-layer embeddings from each model
python -m dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.extract_embeddings.$MODEL
Train probing models
python -m dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.train.$MODEL
Evaluate probing models
python -m dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.eval_probing.$MODEL
Train fine-tuned models
python -m dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.finetune.$MODEL
Evaluate fine-tuned models
python -m dnalm_bench.task_2_5_single.experiments.task_3_peak_classification.eval_finetune.$MODEL
All inputs, intermediate files, and outputs for this task are available for download at syn60581041
.
This task utilizes DNAse-Seq experimental readouts from five cell lines. Input files are available at syn60581050
. This directory should be cloned to $DART_WORK_DIR/task_4_peak_classification/input_data
.
For this task, let $CELL_TYPE
represent one of the following cell lines: GM12878
, H1ESC
, HEPG2
, IMR90
, or K562
.
Extract final-layer embeddings from each model. This should be done for each value of $CATEGORY
in ['peaks', 'nonpeaks', 'idr']
.
python -m dnalm_bench.task_2_5_single.experiments.task_4_chromatin_activity.extract_embeddings.$MODEL $CELL_TYPE $CATEGORY
Train probing models
python -m dnalm_bench.task_2_5_single.experiments.task_4_chromatin_activity.train.$MODEL
Evaluate probing models
python -m dnalm_bench.task_2_5_single.experiments.task_4_chromatin_activity.eval_probing.$MODEL
Train fine-tuned models
python -m dnalm_bench.task_2_5_single.experiments.task_4_chromatin_activity.finetune.$MODEL
Evaluate fine-tuned models
python -m dnalm_bench.task_2_5_single.experiments.task_4_chromatin_activity.eval_finetune.$MODEL
Evaluate ChromBPNet Models
python -m dnalm_bench.task_2_5_single.experiments.task_4_chromatin_activity.eval_ab_initio.chrombpnet_baseline $CELL_TYPE $CHROMBPNET_MODEL_FILENAME
All inputs, intermediate files, and outputs for this task are available for download at syn60581045
.
This task utilizes genomic QTL variants from two studies: African caQTLs (Degorter et al.) and Yoruban dsQTLs (Degner et al.). Input TSV files of variants and experimental effect sizes are available at syn60756043
and syn60756039
. These files should be downloaded to $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/Afr.CaQTLS.tsv
and $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/yoruban.dsqtls.benchmarking.tsv
respectively.
python -m dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.zero_shot_embeddings.$MODEL $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/Afr.CaQTLS.tsv Afr.CaQTLS $DART_WORK_DIR/refs/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
python -m dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.zero_shot_embeddings.$MODEL $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/yoruban.dsqtls.benchmarking yoruban.dsqtls.benchmarking $DART_WORK_DIR/refs/male.hg19.fa
python -m dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.zero_shot_likelihoods.$MODEL $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/Afr.CaQTLS.tsv Afr.CaQTLS $DART_WORK_DIR/refs/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
python -m dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.zero_shot_likelihoods.$MODEL $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/yoruban.dsqtls.benchmarking yoruban.dsqtls.benchmarking $DART_WORK_DIR/refs/male.hg19.fa
python -m dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.probed_log_counts.$MODEL $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/Afr.CaQTLS.tsv $DART_WORK_DIR/task_5_variant_effect_prediction/outputs/probed/$MODEL/Afr.CaQTLS.tsv $DART_WORK_DIR/refs/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
python -m dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.probed_log_counts.$MODEL $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/yoruban.dsqtls.benchmarking $DART_WORK_DIR/task_5_variant_effect_prediction/outputs/probed/$MODEL/yoruban.dsqtls.benchmarking.tsv $DART_WORK_DIR/refs/male.hg19.fa
python -m dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.finetuned_log_counts.$MODEL $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/Afr.CaQTLS.tsv $DART_WORK_DIR/task_5_variant_effect_prediction/outputs/finetuned/$MODEL/Afr.CaQTLS.tsv $DART_WORK_DIR/refs/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
python -m dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.finetuned_log_counts.$MODEL $DART_WORK_DIR/task_5_variant_effect_prediction/input_data/yoruban.dsqtls.benchmarking $DART_WORK_DIR/task_5_variant_effect_prediction/outputs/fine_tuned/$MODEL/yoruban.dsqtls.benchmarking.tsv $DART_WORK_DIR/refs/male.hg19.fa
Helper functions called in the evaluation notebooks: dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.variant_tasks.py
Zero Shot Evaluation Notebook: dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.Zero_Shot_Final.ipynb
Probed Evaluation Notebook: dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.Probed_Final_Counts.ipynb
Finetuned Evaluation Notebook: dnalm_bench.task_2_5_single.experiments.task_5_variant_effect_prediction.Finetuned_Final_Counts.ipynb