-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Pipes #164
Comments
There's some discussion here about ringbuf begin slow compared to this other lock free ring buffer. Could be an interesting option once I look into things: |
Here is where I left off with testing when we first got pipes up in RustPOSIX, so a similar slowdown. I need to break this down more but I'm inclined that the ringbuf mechanism is what needs to be optimized. |
I experimented with getting rid of NaCl's VMIOStart and VMIOEnded functions, since they were causing overhead and from my understanding seemed unnecessary. This didn't improve performance at the topline, but did get rid of the performance divergence at increased writes. Now we seem to have a flat ~1.5x overhead. |
This is slightly odd. I would have expected Lind to get better as buffer sizes are smaller. |
I went back and profiled the initial test case from here, and then I also re-made that test with the rtrb crate that I mentioned above (and profiled it). Profile for original pipe ringbuf The results were negligible. They ran at basically the exact same speed. The profile seem to suggest that the proportion of the write and read sections taken up by memcpy (which is a good estimate of how efficient each is) looks the same as well. Looking at the graph from the original test, the isolated Rust ringbuf barely was faster that native, and thats without all the other cruft that comes from setting up bash etc. So some slowdown should be expected from there. So my question now is can I juice one of these implementations to gain some significant performance. |
At @moyix suggestion I modified my write_to_pipe/read_from_pipe programs such that the actual piping is isolated from other parts of the program (loading, exit, etc). This was the first time where we can actually see that piping is faster currently in Lind. For a write with buffer sizes of 2^16 for 1GB, the piping portion is 34% faster in Lind. It still slows down a bit as buffers get smaller. For 2^8 buffers, Native is 4% faster, and for 2^2, Native is 15% faster. This at least has an explanation, as you can see from the following flamegraphs we begin to see some lock contention in NaCl while retrieving the Desc's on each write. |
Some more promising results. I realized that the 1GB of data transfer I've been testing is arbitrary, and with recent results seeming like startup/shutdown were slowing Lind down vs Native, I decided to just increase the amount of data transfer. Transferring 100 GB with 2^16 byte buffers has Lind clocking in at 16 seconds vs 20 second Native. Which is the first real result we've seen where a full Lind run is faster than Native without any timing tricks. Pretty neat! I also was able to make Lind run faster at 2^8 bytes by removing some locks in NaCl that are unnecessary now with how RustPOSIX is set up. OTOH it's still slower at 2^2 bytes because of even more NaCl locking in their DescRef/Unref mechanism, which is kind of a headache. |
Bad news/good news. I got more weird time results, so I delved deeper and finally realized that I had never merged in the monotonic timer for Lind that we had switched to previously. This explains some of the weird results I've had over the past month or so. I think the good news is better than the lost headbanging time. With an actual comparable timer, Lind seems to be faster than Native for full program runs at all buffersizes (all the way down to 2^2). Great success! |
Above I referenced NaCl's VMIOStart and VMIOEnded functions, that accumulated significant overhead in experiments that used smaller buffers (more writes/reads). I was able to speed this up significantly (obviously) by just tossing it, thinking it was unnecessary with how we have RustPOSIX set up. @jesings pointed out that this isn't true. NaCl uses this to track memory regions that are in use for Read/Write so that it can't mmap/mprotect those regions while they're in use (which would break the security model). These functions uses an interval tree to keep track of regions. I spent some time successfully getting NaCl to compile with C11 so that we could use stdatomic to solve some of the concurrency issues, but in the case of these VMIO functions its the actual interval tree thats causing a ton of overhead, not the locks. We've been brainstorming solutions here. The best one I can think of is to toss the interval tree, and disallow threads within a cage of entering mmap/mprotect at the same time as another thread is doing read/write, using atomic spinlocks to not do kernel accesses. Would love to hear some feedback regarding this solution, or any other ideas. |
What information do the nodes of the interval tree need to contain? If it's just a boolean "is this page currently being used for R/W?" it might be faster to use a bitmap, which would only be 128KiB per cage (1 bit per page; each cage has 4GiB address space so this is (4GiB / 4096 bytes) bits needed). Of course, if it needs to set/clear a big range of pages on every I/O operation this may not be faster... |
That was something else we discussed, though we weren't sure it made sense to allocate that much memory for this per cage. I'm going to try this solution out. |
Alright, well after digging into the docs further it seems like this is actually Windows specific, as the Linux OS mmap would never return pages that actually had data in them. So this was a bit of a false alarm. I'm going to remove these functions as before and leave some notes that this was done. |
The work here is done (at least for now). I still need to collect and format data from here, so I'll finish this up over the next day or two and leave this open as a reminder. |
Either way, I went back to diagnose why our advantage actually decreases as buffer size decreases (# of writes increases), which is against our hypothesis. It's easy to see from these flamegraphs that all the concurrency primitives we have to use in RustPOSIX cause more slowdown per run, which makes increased writes slower. We've looked into using Dashmap and Parking Lot for concurrent hashmaps and better mutexes. I think it's worth trying these out, even though it adds more |
Bash is now able to run pipe scripts. Initial results have us ~1.5x slower than native at higher buffer sizes, though it seems to diverge as buffer sizes get smaller.
I'll dig into optimizing this in this issue.
The text was updated successfully, but these errors were encountered: