Optimize Pipes #164

rennergade · 2021-10-22T20:21:10Z

Bash is now able to run pipe scripts. Initial results have us ~1.5x slower than native at higher buffer sizes, though it seems to diverge as buffer sizes get smaller.

I'll dig into optimizing this in this issue.

rennergade · 2021-10-22T20:35:29Z

There's some discussion here about ringbuf begin slow compared to this other lock free ring buffer. Could be an interesting option once I look into things:

mgeier/rtrb#39

rennergade · 2021-10-27T23:49:11Z

Here is where I left off with testing when we first got pipes up in RustPOSIX, so a similar slowdown. I need to break this down more but I'm inclined that the ringbuf mechanism is what needs to be optimized.

rennergade · 2021-10-29T18:43:31Z

I experimented with getting rid of NaCl's VMIOStart and VMIOEnded functions, since they were causing overhead and from my understanding seemed unnecessary. This didn't improve performance at the topline, but did get rid of the performance divergence at increased writes. Now we seem to have a flat ~1.5x overhead.

JustinCappos · 2021-11-02T13:04:28Z

Bash is now able to run pipe scripts. Initial results have us ~1.5x slower than native at higher buffer sizes, though it seems to diverge as buffer sizes get smaller.

This is slightly odd. I would have expected Lind to get better as buffer sizes are smaller.

rennergade · 2021-11-05T00:56:52Z

I went back and profiled the initial test case from here, and then I also re-made that test with the rtrb crate that I mentioned above (and profiled it).

Profile for original pipe ringbuf

Profile for rtrb ringbuf

The results were negligible. They ran at basically the exact same speed. The profile seem to suggest that the proportion of the write and read sections taken up by memcpy (which is a good estimate of how efficient each is) looks the same as well.

Looking at the graph from the original test, the isolated Rust ringbuf barely was faster that native, and thats without all the other cruft that comes from setting up bash etc. So some slowdown should be expected from there.

So my question now is can I juice one of these implementations to gain some significant performance.

rennergade · 2021-11-11T00:43:43Z

At @moyix suggestion I modified my write_to_pipe/read_from_pipe programs such that the actual piping is isolated from other parts of the program (loading, exit, etc).

This was the first time where we can actually see that piping is faster currently in Lind. For a write with buffer sizes of 2^16 for 1GB, the piping portion is 34% faster in Lind.

It still slows down a bit as buffers get smaller. For 2^8 buffers, Native is 4% faster, and for 2^2, Native is 15% faster. This at least has an explanation, as you can see from the following flamegraphs we begin to see some lock contention in NaCl while retrieving the Desc's on each write.

2^8
2^4

rennergade · 2021-11-15T23:07:02Z

Some more promising results.

I realized that the 1GB of data transfer I've been testing is arbitrary, and with recent results seeming like startup/shutdown were slowing Lind down vs Native, I decided to just increase the amount of data transfer.

Transferring 100 GB with 2^16 byte buffers has Lind clocking in at 16 seconds vs 20 second Native. Which is the first real result we've seen where a full Lind run is faster than Native without any timing tricks. Pretty neat!

I also was able to make Lind run faster at 2^8 bytes by removing some locks in NaCl that are unnecessary now with how RustPOSIX is set up. OTOH it's still slower at 2^2 bytes because of even more NaCl locking in their DescRef/Unref mechanism, which is kind of a headache.

rennergade · 2021-11-16T04:08:37Z

Bad news/good news.

I got more weird time results, so I delved deeper and finally realized that I had never merged in the monotonic timer for Lind that we had switched to previously. This explains some of the weird results I've had over the past month or so.

I think the good news is better than the lost headbanging time. With an actual comparable timer, Lind seems to be faster than Native for full program runs at all buffersizes (all the way down to 2^2). Great success!

rennergade · 2021-12-19T20:34:12Z

Above I referenced NaCl's VMIOStart and VMIOEnded functions, that accumulated significant overhead in experiments that used smaller buffers (more writes/reads). I was able to speed this up significantly (obviously) by just tossing it, thinking it was unnecessary with how we have RustPOSIX set up.

@jesings pointed out that this isn't true. NaCl uses this to track memory regions that are in use for Read/Write so that it can't mmap/mprotect those regions while they're in use (which would break the security model). These functions uses an interval tree to keep track of regions.

I spent some time successfully getting NaCl to compile with C11 so that we could use stdatomic to solve some of the concurrency issues, but in the case of these VMIO functions its the actual interval tree thats causing a ton of overhead, not the locks.

We've been brainstorming solutions here. The best one I can think of is to toss the interval tree, and disallow threads within a cage of entering mmap/mprotect at the same time as another thread is doing read/write, using atomic spinlocks to not do kernel accesses.

Would love to hear some feedback regarding this solution, or any other ideas.

moyix · 2021-12-19T21:00:59Z

What information do the nodes of the interval tree need to contain? If it's just a boolean "is this page currently being used for R/W?" it might be faster to use a bitmap, which would only be 128KiB per cage (1 bit per page; each cage has 4GiB address space so this is (4GiB / 4096 bytes) bits needed).

Of course, if it needs to set/clear a big range of pages on every I/O operation this may not be faster...

rennergade · 2021-12-19T21:45:22Z

That was something else we discussed, though we weren't sure it made sense to allocate that much memory for this per cage. I'm going to try this solution out.

rennergade · 2021-12-19T23:13:14Z

Alright, well after digging into the docs further it seems like this is actually Windows specific, as the Linux OS mmap would never return pages that actually had data in them. So this was a bit of a false alarm.

I'm going to remove these functions as before and leave some notes that this was done.

rennergade · 2022-01-10T20:54:55Z

The work here is done (at least for now). I still need to collect and format data from here, so I'll finish this up over the next day or two and leave this open as a reminder.

rennergade · 2022-01-26T03:21:48Z

Going back to collect data showed that a more recent RustPOSIX commit (that implemented refcounting for advisory locks) somehow made this slower at smaller buffer sizes, though seemingly completely unrelated. We think its a very strange memory layout thing, but its proved basically impossible to diagnose.

Either way, I went back to diagnose why our advantage actually decreases as buffer size decreases (# of writes increases), which is against our hypothesis. It's easy to see from these flamegraphs that all the concurrency primitives we have to use in RustPOSIX cause more slowdown per run, which makes increased writes slower.

We've looked into using Dashmap and Parking Lot for concurrent hashmaps and better mutexes. I think it's worth trying these out, even though it adds more unsafe code to our codebase.

2^4
2^8
2^16

rennergade · 2022-04-05T19:24:58Z

The above graphs show the improvement in pipe speed after adding dashmap/parking lot. Seems like a 10-25% improvement over native, which is awesome.

Should be able to generate final data once dashmap is merged, and close this issue.

rennergade · 2022-04-29T18:09:47Z

We get a very similar graph now that DashMap is merged, even with the addition of Vmmap/EFAULT checking in NaCl now. The caching seems to work well!

I'll need to spruce up these graphs for the paper, but I think this issue is finally able to be closed!

rennergade self-assigned this Oct 22, 2021

rennergade closed this as completed Apr 29, 2022

rennergade mentioned this issue May 4, 2022

More Pipe Optimization #241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Pipes #164

Optimize Pipes #164

rennergade commented Oct 22, 2021

rennergade commented Oct 22, 2021

rennergade commented Oct 27, 2021

rennergade commented Oct 29, 2021

JustinCappos commented Nov 2, 2021

rennergade commented Nov 5, 2021

rennergade commented Nov 11, 2021

rennergade commented Nov 15, 2021

rennergade commented Nov 16, 2021

rennergade commented Dec 19, 2021

moyix commented Dec 19, 2021

rennergade commented Dec 19, 2021

rennergade commented Dec 19, 2021

rennergade commented Jan 10, 2022

rennergade commented Jan 26, 2022

rennergade commented Apr 5, 2022

rennergade commented Apr 29, 2022 •

edited

Loading

Optimize Pipes #164

Optimize Pipes #164

Comments

rennergade commented Oct 22, 2021

rennergade commented Oct 22, 2021

rennergade commented Oct 27, 2021

rennergade commented Oct 29, 2021

JustinCappos commented Nov 2, 2021

rennergade commented Nov 5, 2021

rennergade commented Nov 11, 2021

rennergade commented Nov 15, 2021

rennergade commented Nov 16, 2021

rennergade commented Dec 19, 2021

moyix commented Dec 19, 2021

rennergade commented Dec 19, 2021

rennergade commented Dec 19, 2021

rennergade commented Jan 10, 2022

rennergade commented Jan 26, 2022

rennergade commented Apr 5, 2022

rennergade commented Apr 29, 2022 • edited Loading

rennergade commented Apr 29, 2022 •

edited

Loading