High memory usage for huge files #160

lcoombe · 2017-11-21T18:11:07Z

Hello,

I'm running the following command:
mlr --tsvlite filter $Depth < 5 preARCS.bed.depth.tsv

I would expect this command to stream through the file, but it appears that the 400GB file being filtered is being read into memory?
From 'top':
0.399t 0.248t 808 R 76.3 10.1 1819:03 mlr --tsvlite filter $Depth < 5 preARCS.bed.depth.tsv

The output file is being written to OK:

Rname   Pos     Depth
1       1       0
1       2       0
1       3       0
1       4       0

The command is also going quite slowly -- it has been running for ~30h now. Any idea why the memory usage is so high?

(cc: @sjackman)

The text was updated successfully, but these errors were encountered:

johnkerl · 2017-11-21T20:09:48Z

sounds like a memory leak. can you send me at least some of the file contents as a paste perhaps, for a repro?

sjackman · 2017-11-21T20:30:55Z

The contents of the input are the same as the output listed above. Rname, Pos, and Depth are all integers. There's something around 20-billion rows.

johnkerl · 2017-11-21T21:09:00Z

@Icoombe valgrind not showing me anything on CentOS :(. What platform is this on?

sjackman · 2017-11-21T21:19:57Z

I believe (correct me if I'm wrong)

❯❯❯ uname -a
Linux hpce706 3.10.0-229.14.1.el7.x86_64 #1 SMP Tue Sep 15 15:05:51 UTC 2015 x86_64 GNU/Linux
❯❯❯ mlr --version
Miller 5.2.2

lcoombe · 2017-11-21T21:28:07Z

Yup, that's right @sjackman

johnkerl · 2017-11-23T01:36:05Z

got a repro; will keep digging

johnkerl · 2017-11-23T01:43:22Z

valgrind shows no memory leaks at exit but if I run a large enough file, I can see RSS growth in htop. Which would explain why I haven't seen this in valgrind runs. :^/

johnkerl · 2017-11-23T01:59:50Z

Short answer: try

mlr --tsvlite filter '$Depth < 5' < preARCS.bed.depth.tsv

or

mlr --no-mmap --tsvlite filter '$Depth < 5' preARCS.bed.depth.tsv

The issue is that Miller uses mmap by default, as it's maybe 10% faster than using stdio. But what this does is page the file in a bit at a time. Quite obvious in retrospect. :^/

Besides this being a great FAQ issue, the better fix would be either to (a) disable mmap if input files are over a certain size, or (b) make mmap simply not be the default ever.

sjackman · 2017-11-23T07:00:34Z

Ah, cool. In that case, the high memory usage may be a red herring, and not the cause of the slowness.
@lcoombe Can you test whether < or --no-mmap is in fact faster than mmap?
@johnkerl You could try posix_madvise POSIX_MADV_SEQUENTIAL to hint that the pages may be freed after they're read.
http://man7.org/linux/man-pages/man3/posix_madvise.3.html
Note that even <foo.tsv could use mmap on the stdin file descriptor, when it's backed by a file rather than a pipe/stream.

lcoombe · 2017-11-23T17:31:00Z

Looks like using --no-mmap fixed the issue!

I tested this command:
mlr --no-mmap --tsvlite filter '$Depth < 5' preARCS.bed.depth.tsv
The run finished in just under 8 hours, and uses negligible memory. I killed my original command after 48h.

Thanks @johnkerl !

sjackman · 2017-11-23T19:17:10Z

That's super-interesting to me. At least in theory that shouldn't be the behaviour. The OS should map the file to virtual memory, page it in as it's accessed, detect the sequential access pattern, and page out files as they're no longer needed. I can't explain this unexpected behaviour. I'd be curious to learn whether posix_madvise POSIX_MADV_SEQUENTIAL resolves it.

johnkerl · 2017-11-24T04:35:47Z

No change in htop usage using any madvise flags on either MacOSX or CentOS. :( Thanks for the idea though @sjackman!

sjackman · 2017-11-24T16:50:27Z

Ah, well. Worth a shot.

Besides this being a great FAQ issue, the better fix would be either to (a) disable mmap if input files are over a certain size, or (b) make mmap simply not be the default ever.

Either of these would suit me. Making --no-mmap the default adding a --mmap option would be the easier.

johnkerl · 2017-11-27T02:32:55Z

Ahoy, I was too impatient. With various madvise flags I saw memory usage shoot straight up just as before. But if I wait longer ... MADV_DONTNEED lets pages get reclaimed, not right away, but as soon as there starts to be page pressure which is precisely the right situation.

sjackman · 2017-11-27T02:40:57Z

MADV_DONTNEED: Do not expect access in the near future. https://linux.die.net/man/2/madvise
Do you mean advising MADV_DONTNEED after each page has been processed by mlr?

johnkerl · 2017-11-27T02:45:52Z

... same with MADV_SEQUENTIAL.

These are advisory. In cat/filter/put/ etc. (streaming) contexts pages can be after-freed. In sort/tac/ etc. (non-streaming) contexts, if pages are after-freed in the ingest pages, then record fields with mmap-backed pointers are accessed in random order, pages can be faulted back in.

johnkerl · 2017-11-27T02:56:18Z

... also, I'm overnarrating. I should dig a bit more before posting. I ran without madvise flags and also saw RSS dropping off after the initial ramp-up, on my Mac laptop. I need to try experiment more thoroughly.

johnkerl · 2017-11-29T02:56:26Z

OK so: I ran with the as-is code, with madvise and MADV_DONTNEED, and with madvise and MADV_SEQUENTIAL. This on MacOSX. In all three cases the RSS shot up steadily with progress through the file; then began to knock down in the face of page pressure. But that is false comfort, because in all three cases the Miller executable was nonetheless OOM-killed. (The data file was larger than system memory + swap.)

So. This burns, really, because (a) I should have caught it sooner (it's obvious in retrospect), and (b) I put serious time a couple years ago into supporting mmapped I/O for its performance benefits. If I make mmapped I/O non-default then essentially no one will use it, and Miller will be suddenly slower (not a lot, but it will be a performance regression) as of the next release.

My thought is to use mmap below some file-size threshold and stdio above, where the threshold defaults to something like a few GB but is itself specifiable. This way we get (out of the box) faster-by-default for non-huge files, and non-OOM for huge files -- and detailed control for those who seek it out.

sjackman · 2017-11-29T16:38:32Z

This behaviour is so strange. I don't understand it at all. The old pages should be dropped from resident memory, and there's no reason the OOM should be invoked.

Your workaround seems reasonable to me.

johnkerl · 2017-11-29T17:18:22Z

madvise is advisory, not mandatory, and maybe MacOSX isn't participating in this. Maybe it would work fine on other platforms. But I don't want to get platform-specific in the code ...

johnkerl · 2017-12-05T23:26:11Z

P.S. I tried a hugefile (> RAM+swap) on CentOS and the mmap failed before the madvise was even reached.

johnkerl added active bug labels Nov 21, 2017

johnkerl changed the title ~~Filter: high memory usage~~ High memory usage for huge files Jan 1, 2018

johnkerl closed this as completed in 430d18d Jan 2, 2018

treynr mentioned this issue Aug 24, 2018

High memory usage and large files #181

Closed

Fuco1 mentioned this issue Jul 18, 2019

High memory usage with thousand files on input. #256

Closed

johnkerl removed the active label Sep 2, 2019

johnkerl mentioned this issue Jan 26, 2020

Bug: segfault on missing final newline #301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage for huge files #160

High memory usage for huge files #160

lcoombe commented Nov 21, 2017

johnkerl commented Nov 21, 2017

sjackman commented Nov 21, 2017

johnkerl commented Nov 21, 2017

sjackman commented Nov 21, 2017 •

edited

Loading

lcoombe commented Nov 21, 2017

johnkerl commented Nov 23, 2017

johnkerl commented Nov 23, 2017

johnkerl commented Nov 23, 2017 •

edited

Loading

sjackman commented Nov 23, 2017 •

edited

Loading

lcoombe commented Nov 23, 2017

sjackman commented Nov 23, 2017

johnkerl commented Nov 24, 2017

sjackman commented Nov 24, 2017

johnkerl commented Nov 27, 2017

sjackman commented Nov 27, 2017 •

edited

Loading

johnkerl commented Nov 27, 2017

johnkerl commented Nov 27, 2017

johnkerl commented Nov 29, 2017 •

edited

Loading

sjackman commented Nov 29, 2017

johnkerl commented Nov 29, 2017

johnkerl commented Dec 5, 2017

High memory usage for huge files #160

High memory usage for huge files #160

Comments

lcoombe commented Nov 21, 2017

johnkerl commented Nov 21, 2017

sjackman commented Nov 21, 2017

johnkerl commented Nov 21, 2017

sjackman commented Nov 21, 2017 • edited Loading

lcoombe commented Nov 21, 2017

johnkerl commented Nov 23, 2017

johnkerl commented Nov 23, 2017

johnkerl commented Nov 23, 2017 • edited Loading

sjackman commented Nov 23, 2017 • edited Loading

lcoombe commented Nov 23, 2017

sjackman commented Nov 23, 2017

johnkerl commented Nov 24, 2017

sjackman commented Nov 24, 2017

johnkerl commented Nov 27, 2017

sjackman commented Nov 27, 2017 • edited Loading

johnkerl commented Nov 27, 2017

johnkerl commented Nov 27, 2017

johnkerl commented Nov 29, 2017 • edited Loading

sjackman commented Nov 29, 2017

johnkerl commented Nov 29, 2017

johnkerl commented Dec 5, 2017

sjackman commented Nov 21, 2017 •

edited

Loading

johnkerl commented Nov 23, 2017 •

edited

Loading

sjackman commented Nov 23, 2017 •

edited

Loading

sjackman commented Nov 27, 2017 •

edited

Loading

johnkerl commented Nov 29, 2017 •

edited

Loading