Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Gene Expression Training Preprocessing #161

Open
wkl1990 opened this issue May 5, 2023 · 3 comments
Open

Question about Gene Expression Training Preprocessing #161

wkl1990 opened this issue May 5, 2023 · 3 comments

Comments

@wkl1990
Copy link

wkl1990 commented May 5, 2023

Hi Dave (@davek44 ),

I recently read your 2018 Basenji paper, where you referred to cell-type-specific gene expression. In the paper, you mentioned that you made predictions in the 128-bp bin containing each transcription start site (TSS), and for each gene outside the training set, you summed their various TSS values to compute accuracy statistics.

I was wondering if you could clarify whether you filtered the bigwig data outside the TSS or the training set outside the TSS. I'm new to Basenji and would greatly appreciate your help in understanding this aspect of preprocessing.

Thank you!

@davek44
Copy link
Contributor

davek44 commented May 6, 2023

I'm not sure what you mean by "filter the bigwig data". We train on the whole genome, other than highly repetitive and unmappable regions.

@wkl1990
Copy link
Author

wkl1990 commented May 6, 2023

Hello @davek44 , thank you for your response. To clarify, do you mean training the model on the entire genome but only making predictions on the TSS region? Additionally, I am curious about how you generated the bigwig file for the expression data. Were they created in the same way as the DNase data, directly from the bam file? If I use regular RNA-seq data, would I just keep the TSS reads to generate the bigwig signal?

@davek44
Copy link
Contributor

davek44 commented May 18, 2023

We train on the entire genome, and we make predictions across entire sequences. The model doesn't understand the concept of a TSS. You, the analyst, need to go in afterwards and pull out predictions at TSS if that's what you're interested in.

All BigWig files were created using a similar workflow from BAM files.

You cannot use RNA-seq. Only 5' RNA sequencing techniques like CAGE, GRO-seq, or PRO-seq will work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants