-
Notifications
You must be signed in to change notification settings - Fork 34
Generate contribution score bigwigs
ChromBPNet models have two heads - (1) profile head (to predict the shape of a profile) and (2) counts head (to predict the total counts in a profile). Please use the chrombpnet_nobias.h5
model for this. Here we will provide tools to derive sequence contribution scores for either or both the heads.
chrombpnet contribs_bw [-h] -m MODEL_H5 -r REGIONS -g GENOME -c CHROM_SIZES -op OUTPUT_PREFIX [-pc {counts,profile} [{counts,profile} ...]] [-os OUTPUT_PREFIX_STATS] [-t TQDM]
[-d DEBUG_CHR [DEBUG_CHR ...]]
required arguments:
-m MODEL_H5, --model-h5 MODEL_H5
Path model .h5 file
-r REGIONS, --regions REGIONS
10 column bed file of regions for contribution score predictions
-g GENOME, --genome GENOME
Genome fasta
-c CHROM_SIZES, --chrom-sizes CHROM_SIZES
Chromosome sizes 2 column tab-separated file
-op OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
Output prefix for bigwig files
optional arguments:
-pc {counts,profile} [{counts,profile} ...], --profile-or-counts {counts,profile} [{counts,profile} ...]
use either counts or profile or both for running shap (defaults to running both)
-os OUTPUT_PREFIX_STATS, --output-prefix-stats OUTPUT_PREFIX_STATS
Output stats on bigwig
-t TQDM, --tqdm TQDM Use tqdm. If yes then you need to have it installed.
-d DEBUG_CHR [DEBUG_CHR ...], --debug-chr DEBUG_CHR [DEBUG_CHR ...]
Run for specific chromosomes only (e.g. chr1 chr2) for debugging
Note: Note that prefix can include a directory path and prefix for the output file. Make sure that the directory in output_prefix exists. Make sure that regions in the input bed file can be expanded to inputlen (default to 2114) regions without overflowing out of the chromosomes. If this condition is not satisfied the program will return with a error or provided unexpected output. From version 0.1.6 onwards this case is handled by the code.
Note: If you are doing large scale runs, keep in mind that the disk usage of
output_prefix
.profile_scores.h5 andoutput_prefix
.counts_scores.h5 can be lot around 3Gb each. One way to compress this is to store onlyshap
field in the .h5 files and delete theprojected_shap
andraw
key fields. This structure is however needed for TFModisco runs, so you can reconstruct them temporarily as follows - theraw
field is simply the one-hot encoding of the regions inoutput_prefix
.interpreted_regions.bedand the
projected_shapis equivalent to
shapmultiplied by the
raw` value. For more information refer to this link - https://github.com/kundajelab/chrombpnet/blob/a101087d90f193e1cc6dba1bad1b5fc9761aa129/chrombpnet/evaluation/interpret/interpret.py#L43.
The following two files are created using the output_prefix
as prefix for the output when both profile_or_counts
includes both profile
and counts
. If only one of the option is used, only one of the files is generated. Note that prefix can include a directory path and prefix for the output file.
-
output_prefix
.profile_scores.h5: A h5 file containing a dictionary with contribution scores for the profile head of the chrombpnet model. The dictionary consists of the following keysraw
,shap
andprojected_shap
. The values of each of these keys are explained below. All the values have the shape(L,4,inputlen)
where L is the number of inputs given inregions
bed file andinputlen
is inferred from the input length of themodel_h5
. This.h5
file can be input tochrombpnet modisco_motifs
command for running TF-Modisco.-
raw
key consists of one hot encoding of a sequence from theregions
bed file. -
shap
key consists of hypothetical contribution scores obtained from doing deepshap. These values will be used in the MODISCO pipeline (for de-novo motif discovery) next. -
projected_shap
key consists of values obtained by multiplyingshap
values withraw
values. These values will be used in building browser tracks.
-
-
output_prefix
.profile_scores.bw: Contribution scores stored in bigwig format. Can be visualized with dynseq track in Washu Genomics browser. -
output_prefix
.counts_scores.h5: A h5 file containing a dictionary with contribution scores for the count head of the chrombpnet model. Has similar structure as theoutput_prefix
.profile_scores.h5. This.h5
file can be input tochrombpnet modisco_motifs
command for running TF-Modisco -
output_prefix
.counts_scores.bw: Contribution scores stored in bigwig format. Can be visualized with dynseq track in Washu Genomics browser. -
output_prefix
.interpreted_regions.bed: A bed file with regions used for interpretations. The script filters regions whose resulting sequences dont have the same length asinput_len
inferred from the model. This bed file might have regions fewer thanregions
provided as input.