Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate big mutation file w/o failure #216

Open
inodb opened this issue Aug 16, 2022 · 4 comments
Open

Annotate big mutation file w/o failure #216

inodb opened this issue Aug 16, 2022 · 4 comments

Comments

@inodb
Copy link
Member

inodb commented Aug 16, 2022

We can use this file to test:

https://github.com/cBioPortal/datahub/blob/master/public/difg_glass_2019/data_mutations.txt

@ozguzMete
Copy link
Contributor

I have used 20G heap size (probably I was going to use a lot more) for a ~1.5G file (pog570_bcgsc_2020/data_mutations.txt) just to load... We shouldn't touch a single character of a line until we really need it because it actually takes 3-5 secs to load that file using a buffered reader

@ozguzMete
Copy link
Contributor

ozguzMete commented Sep 9, 2022

its runtime-wise problem is solved by #227
its memory-wise problem could be solved by not using a giant Map<String, VariantAnnotation> gnResponseVariantKeyMap

@ozguzMete
Copy link
Contributor

ozguzMete commented Sep 9, 2022

@inodb @rmadupuri @sheridancbio

For some reason we use giant Map<String, VariantAnnotation> gnResponseVariantKeyMap
do we really need this? I have my doubts...

let me show you what's going on step by step

  1. we sorted and partitioned our query data
  2. for each partition we send a POST request
  3. each POST request returns a list of OriginalVariantQuery that is produced by each data in the partition
  4. we store the items of this list in gnResponseVariantKeyMap
  5. we repeat steps 2-4 until there is no partition left
  6. we start to use this map in order to convert each mutation record into an annotated record

these steps suggest that there should be fewer OriginalVariantQuery than genomicLocations and for some reason, we should use the last inserted OriginalVariantQuery

It sounds to me that this is unnecessary. These steps should be converted into this:

  1. we sorted and partitioned our query data
  2. for each partition we send a POST request
  3. each POST request returns a list of OriginalVariantQuery that is produced by each data in the partition
  4. we store the items of this list in a new data structure
  5. we start to use this data structure in order to convert each mutation record into an annotated record
  6. repeat 2-5

and now, garbage collector can start to clean unused POST response data
also we can introduce multi threading (without solving memory issue this will be "meh")

If these steps can't be changed, using a smaller version VariantAnnotation 'might' help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants