Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question and suggestion for running with large bulk sample set #92

Open
FGOSeal opened this issue Jul 10, 2024 · 1 comment
Open

Question and suggestion for running with large bulk sample set #92

FGOSeal opened this issue Jul 10, 2024 · 1 comment

Comments

@FGOSeal
Copy link

FGOSeal commented Jul 10, 2024

Dear developers:
First of all, thank you for developing this great algorithms, it really helped my research.
Recently, I'm trying to run BayesPrism with ScRef 115000cells19500genes on bulk 17500samples39000genes using 16029 protein_coding genes. The program ran on a HPC node with 80 cores and 2T RAM through LSF and was killed at about 2/3 time point (about 16 days) considering the 'Estimated time to complete' (about 25 days) in the 'Run Gibbs sampling' part of stdout. I'm not sure if it is the HPC not stable, or the memory is not sufficient. I tested with different bulk sample sizes and watched the information displayed by top program, I think the memory usage of BayesPrism can be divided into 4 stages.
Stage-1: Before 'Run Gibbs sampling', when inputing 17500 bulk samples, the max VIRT is about 70G, max RES is about 60G.
Stage-2: 'Run Gibbs sampling', when inputing 40/100/17500 bulk samples, there are up to 80 processes, each VIRT about 14G, RES about 3.5G.
Stage-3: maybe the 'Run Gibbs sampling using updated reference', when inputing 40/100 bulk samples, there are up to 80 processes, each VIRT about 14G, RES about 1.5G.
Stage-4: after all Stage-3 processes disappered, there is a new process. When inputing 40 bulk samples, it use about RES=48GB=1.2G40. When inputing 100 bulk samples, it use about RES=120GB=1.2G100.
Inputing 40/100/200 samples run are all finished successfully. Now I'm inputing all bulk samples again.
So my first question is, does the Stage-4 RES really=1.2GN_bulk and the running with 17500 samples will definitely fail when reaching stage-4 ? If so, I will stop my current calculation.
I tried calculate the first 100 samples or the first 200 samples in 2 run. The results of the same sample are not exactly the same. Repeating calculation of the first 100 samples gives exactly the same results. So it is the normalization on the input bulk samples makes the differences between 100-sample-run and 200-sample-run.
So my second question is, is it possible for me to split my bulk sample set and run in several batches and get the same/similar results as inputing all bulk sample in one run ? Right now all I can do is subsampling. But it will be good if BayesPrism can better support large sample set.
Finally, if the Stage-4 RES=1.2G
N_bulk is right, maybe you can add a function to estimate the approximate max RAM consumption and warn the user at an early time point.

Best regards
Yi-hua Jiang

@tinyi
Copy link
Collaborator

tinyi commented Jul 18, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants