Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Pipes #164

Closed
rennergade opened this issue Oct 22, 2021 · 16 comments
Closed

Optimize Pipes #164

rennergade opened this issue Oct 22, 2021 · 16 comments
Assignees

Comments

@rennergade
Copy link
Contributor

Bash is now able to run pipe scripts. Initial results have us ~1.5x slower than native at higher buffer sizes, though it seems to diverge as buffer sizes get smaller.

lindvsnative-1

I'll dig into optimizing this in this issue.

@rennergade rennergade self-assigned this Oct 22, 2021
@rennergade
Copy link
Contributor Author

There's some discussion here about ringbuf begin slow compared to this other lock free ring buffer. Could be an interesting option once I look into things:

mgeier/rtrb#39

@rennergade
Copy link
Contributor Author

Here is where I left off with testing when we first got pipes up in RustPOSIX, so a similar slowdown. I need to break this down more but I'm inclined that the ringbuf mechanism is what needs to be optimized.

@rennergade
Copy link
Contributor Author

I experimented with getting rid of NaCl's VMIOStart and VMIOEnded functions, since they were causing overhead and from my understanding seemed unnecessary. This didn't improve performance at the topline, but did get rid of the performance divergence at increased writes. Now we seem to have a flat ~1.5x overhead.

LindRustPipe-novm

@JustinCappos
Copy link
Member

Bash is now able to run pipe scripts. Initial results have us ~1.5x slower than native at higher buffer sizes, though it seems to diverge as buffer sizes get smaller.

This is slightly odd. I would have expected Lind to get better as buffer sizes are smaller.

@rennergade
Copy link
Contributor Author

I went back and profiled the initial test case from here, and then I also re-made that test with the rtrb crate that I mentioned above (and profiled it).

Profile for original pipe ringbuf

Profile for rtrb ringbuf

The results were negligible. They ran at basically the exact same speed. The profile seem to suggest that the proportion of the write and read sections taken up by memcpy (which is a good estimate of how efficient each is) looks the same as well.

Looking at the graph from the original test, the isolated Rust ringbuf barely was faster that native, and thats without all the other cruft that comes from setting up bash etc. So some slowdown should be expected from there.

So my question now is can I juice one of these implementations to gain some significant performance.

@rennergade
Copy link
Contributor Author

At @moyix suggestion I modified my write_to_pipe/read_from_pipe programs such that the actual piping is isolated from other parts of the program (loading, exit, etc).

This was the first time where we can actually see that piping is faster currently in Lind. For a write with buffer sizes of 2^16 for 1GB, the piping portion is 34% faster in Lind.

It still slows down a bit as buffers get smaller. For 2^8 buffers, Native is 4% faster, and for 2^2, Native is 15% faster. This at least has an explanation, as you can see from the following flamegraphs we begin to see some lock contention in NaCl while retrieving the Desc's on each write.

2^8
2^4

@rennergade
Copy link
Contributor Author

Some more promising results.

I realized that the 1GB of data transfer I've been testing is arbitrary, and with recent results seeming like startup/shutdown were slowing Lind down vs Native, I decided to just increase the amount of data transfer.

Transferring 100 GB with 2^16 byte buffers has Lind clocking in at 16 seconds vs 20 second Native. Which is the first real result we've seen where a full Lind run is faster than Native without any timing tricks. Pretty neat!

I also was able to make Lind run faster at 2^8 bytes by removing some locks in NaCl that are unnecessary now with how RustPOSIX is set up. OTOH it's still slower at 2^2 bytes because of even more NaCl locking in their DescRef/Unref mechanism, which is kind of a headache.

@rennergade
Copy link
Contributor Author

Bad news/good news.

I got more weird time results, so I delved deeper and finally realized that I had never merged in the monotonic timer for Lind that we had switched to previously. This explains some of the weird results I've had over the past month or so.

I think the good news is better than the lost headbanging time. With an actual comparable timer, Lind seems to be faster than Native for full program runs at all buffersizes (all the way down to 2^2). Great success!

lindvsnative

@rennergade
Copy link
Contributor Author

Above I referenced NaCl's VMIOStart and VMIOEnded functions, that accumulated significant overhead in experiments that used smaller buffers (more writes/reads). I was able to speed this up significantly (obviously) by just tossing it, thinking it was unnecessary with how we have RustPOSIX set up.

@jesings pointed out that this isn't true. NaCl uses this to track memory regions that are in use for Read/Write so that it can't mmap/mprotect those regions while they're in use (which would break the security model). These functions uses an interval tree to keep track of regions.

I spent some time successfully getting NaCl to compile with C11 so that we could use stdatomic to solve some of the concurrency issues, but in the case of these VMIO functions its the actual interval tree thats causing a ton of overhead, not the locks.

We've been brainstorming solutions here. The best one I can think of is to toss the interval tree, and disallow threads within a cage of entering mmap/mprotect at the same time as another thread is doing read/write, using atomic spinlocks to not do kernel accesses.

Would love to hear some feedback regarding this solution, or any other ideas.

@moyix
Copy link

moyix commented Dec 19, 2021

What information do the nodes of the interval tree need to contain? If it's just a boolean "is this page currently being used for R/W?" it might be faster to use a bitmap, which would only be 128KiB per cage (1 bit per page; each cage has 4GiB address space so this is (4GiB / 4096 bytes) bits needed).

Of course, if it needs to set/clear a big range of pages on every I/O operation this may not be faster...

@rennergade
Copy link
Contributor Author

That was something else we discussed, though we weren't sure it made sense to allocate that much memory for this per cage. I'm going to try this solution out.

@rennergade
Copy link
Contributor Author

Alright, well after digging into the docs further it seems like this is actually Windows specific, as the Linux OS mmap would never return pages that actually had data in them. So this was a bit of a false alarm.

I'm going to remove these functions as before and leave some notes that this was done.

@rennergade
Copy link
Contributor Author

The work here is done (at least for now). I still need to collect and format data from here, so I'll finish this up over the next day or two and leave this open as a reminder.

@rennergade
Copy link
Contributor Author

  1. Going back to collect data showed that a more recent RustPOSIX commit (that implemented refcounting for advisory locks) somehow made this slower at smaller buffer sizes, though seemingly completely unrelated. We think its a very strange memory layout thing, but its proved basically impossible to diagnose.

Either way, I went back to diagnose why our advantage actually decreases as buffer size decreases (# of writes increases), which is against our hypothesis. It's easy to see from these flamegraphs that all the concurrency primitives we have to use in RustPOSIX cause more slowdown per run, which makes increased writes slower.

We've looked into using Dashmap and Parking Lot for concurrent hashmaps and better mutexes. I think it's worth trying these out, even though it adds more unsafe code to our codebase.

2^4
2^8
2^16

@rennergade
Copy link
Contributor Author

dashvsnative
dashvsnative-zoom

The above graphs show the improvement in pipe speed after adding dashmap/parking lot. Seems like a 10-25% improvement over native, which is awesome.

Should be able to generate final data once dashmap is merged, and close this issue.

@rennergade
Copy link
Contributor Author

rennergade commented Apr 29, 2022

small-plot

We get a very similar graph now that DashMap is merged, even with the addition of Vmmap/EFAULT checking in NaCl now. The caching seems to work well!

I'll need to spruce up these graphs for the paper, but I think this issue is finally able to be closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants