Annotate big mutation file w/o failure #216

inodb · 2022-08-16T18:03:00Z

We can use this file to test:

https://github.com/cBioPortal/datahub/blob/master/public/difg_glass_2019/data_mutations.txt

rmadupuri · 2022-08-16T18:05:41Z

Another test file if needed:

https://github.com/cBioPortal/datahub/blob/10a57682c24513b9eec3943378169c9b1bc6f3e6/public/pog570_bcgsc_2020/data_mutations.txt

ozguzMete · 2022-08-30T20:38:15Z

I have used 20G heap size (probably I was going to use a lot more) for a ~1.5G file (pog570_bcgsc_2020/data_mutations.txt) just to load... We shouldn't touch a single character of a line until we really need it because it actually takes 3-5 secs to load that file using a buffered reader

ozguzMete · 2022-09-09T12:46:51Z

its runtime-wise problem is solved by #227
its memory-wise problem could be solved by not using a giant Map<String, VariantAnnotation> gnResponseVariantKeyMap

ozguzMete · 2022-09-09T13:17:23Z

@inodb @rmadupuri @sheridancbio

For some reason we use giant Map<String, VariantAnnotation> gnResponseVariantKeyMap
do we really need this? I have my doubts...

let me show you what's going on step by step

we sorted and partitioned our query data
for each partition we send a POST request
each POST request returns a list of OriginalVariantQuery that is produced by each data in the partition
we store the items of this list in gnResponseVariantKeyMap
we repeat steps 2-4 until there is no partition left
we start to use this map in order to convert each mutation record into an annotated record

these steps suggest that there should be fewer OriginalVariantQuery than genomicLocations and for some reason, we should use the last inserted OriginalVariantQuery

It sounds to me that this is unnecessary. These steps should be converted into this:

we sorted and partitioned our query data
for each partition we send a POST request
each POST request returns a list of OriginalVariantQuery that is produced by each data in the partition
we store the items of this list in a new data structure
we start to use this data structure in order to convert each mutation record into an annotated record
repeat 2-5

and now, garbage collector can start to clean unused POST response data
also we can introduce multi threading (without solving memory issue this will be "meh")

If these steps can't be changed, using a smaller version VariantAnnotation 'might' help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotate big mutation file w/o failure #216

Annotate big mutation file w/o failure #216

inodb commented Aug 16, 2022

rmadupuri commented Aug 16, 2022

ozguzMete commented Aug 30, 2022

ozguzMete commented Sep 9, 2022 •

edited

Loading

ozguzMete commented Sep 9, 2022 •

edited

Loading

Annotate big mutation file w/o failure #216

Annotate big mutation file w/o failure #216

Comments

inodb commented Aug 16, 2022

rmadupuri commented Aug 16, 2022

ozguzMete commented Aug 30, 2022

ozguzMete commented Sep 9, 2022 • edited Loading

ozguzMete commented Sep 9, 2022 • edited Loading

ozguzMete commented Sep 9, 2022 •

edited

Loading

ozguzMete commented Sep 9, 2022 •

edited

Loading