Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize BAM split generation for cloud stores #169

Open
tomwhite opened this issue Nov 23, 2017 · 3 comments
Open

Optimize BAM split generation for cloud stores #169

tomwhite opened this issue Nov 23, 2017 · 3 comments
Milestone

Comments

@tomwhite
Copy link
Member

Finding BAM split boundaries is currently slow for cloud stores like S3 and GCS. The goal of this issue is to characterize the problem, and implement fixes (e.g. finding splits in parallel on the client).

@tomwhite tomwhite added this to the 8.0.0 milestone Nov 23, 2017
@ryan-williams
Copy link
Contributor

Two important bits of spark-bam that deal with this, fwiw:

  1. computing splits on workers, in parallel (cf. diagrams)
  2. using a block-LRU-caching inputstream/channel abstraction

@ryan-williams
Copy link
Contributor

I guess another thing worth adding here is that I had to guard against unreasonably-large memory-allocations in BAMRecordCodec (at non-record-start positions where the first 4 bytes of the candidate BAM-record are arbitrary data but are interpreted as a 4-byte int, and an array of that many bytes is allocated).

Without optimizing around that, evaluating hadoop-bam's guessing-logic on all positions in a file often slowed to a crawl, seemingly in parts of files where the average 4-byte windows corresponded to large integers, which caused large bogus-sized allocations at each checked virtual-position, and resulted in memory-pressure and slowdowns.

Here's some relevant code in a BAMRecordCodec shim that I wrote for this reason.

@tomwhite
Copy link
Member Author

Thanks for the info @ryan-williams! That's very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants