-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory usage for huge files #160
Comments
sounds like a memory leak. can you send me at least some of the file contents as a paste perhaps, for a repro? |
The contents of the input are the same as the output listed above. Rname, Pos, and Depth are all integers. There's something around 20-billion rows. |
@Icoombe valgrind not showing me anything on CentOS :(. What platform is this on? |
I believe (correct me if I'm wrong)
|
Yup, that's right @sjackman |
got a repro; will keep digging |
valgrind shows no memory leaks at exit but if I run a large enough file, I can see RSS growth in htop. Which would explain why I haven't seen this in valgrind runs. :^/ |
Short answer: try
or
The issue is that Miller uses Besides this being a great FAQ issue, the better fix would be either to (a) disable mmap if input files are over a certain size, or (b) make mmap simply not be the default ever. |
Ah, cool. In that case, the high memory usage may be a red herring, and not the cause of the slowness. |
Looks like using I tested this command: Thanks @johnkerl ! |
That's super-interesting to me. At least in theory that shouldn't be the behaviour. The OS should map the file to virtual memory, page it in as it's accessed, detect the sequential access pattern, and page out files as they're no longer needed. I can't explain this unexpected behaviour. I'd be curious to learn whether |
No change in |
Ah, well. Worth a shot.
Either of these would suit me. Making |
Ahoy, I was too impatient. With various |
|
... same with These are advisory. In |
... also, I'm overnarrating. I should dig a bit more before posting. I ran without |
OK so: I ran with the as-is code, with So. This burns, really, because (a) I should have caught it sooner (it's obvious in retrospect), and (b) I put serious time a couple years ago into supporting mmapped I/O for its performance benefits. If I make mmapped I/O non-default then essentially no one will use it, and Miller will be suddenly slower (not a lot, but it will be a performance regression) as of the next release. My thought is to use mmap below some file-size threshold and stdio above, where the threshold defaults to something like a few GB but is itself specifiable. This way we get (out of the box) faster-by-default for non-huge files, and non-OOM for huge files -- and detailed control for those who seek it out. |
This behaviour is so strange. I don't understand it at all. The old pages should be dropped from resident memory, and there's no reason the OOM should be invoked. Your workaround seems reasonable to me. |
|
P.S. I tried a hugefile (> RAM+swap) on CentOS and the |
Hello,
I'm running the following command:
mlr --tsvlite filter $Depth < 5 preARCS.bed.depth.tsv
I would expect this command to stream through the file, but it appears that the 400GB file being filtered is being read into memory?
From 'top':
0.399t 0.248t 808 R 76.3 10.1 1819:03 mlr --tsvlite filter $Depth < 5 preARCS.bed.depth.tsv
The output file is being written to OK:
The command is also going quite slowly -- it has been running for ~30h now. Any idea why the memory usage is so high?
(cc: @sjackman)
The text was updated successfully, but these errors were encountered: