-
Notifications
You must be signed in to change notification settings - Fork 0
Ring Buffer Between Two Threads
For generating the output, the data flow is unidirectional and converging, i.e. we are always passing data from one computation to the next without any feedback and each computation only has at most one consumer. Because of this, we only need thread synchronizations between two threads at a time and only need to pass data from one to the next.
In this page, we'll test what is the best strategy to do this. In general, we want to make sure the workers are always doing useful work (or if there is an imbalance in the workload the one with more workload should never be idling). This means that,
-
The threads are not spending time waiting for one another.
There should be enough buffer between each worker so that time jitter in one worker does not directly affect another.
-
The threads are not fighting with each other.
We need to minimize the bus traffic and flip-flop caused by the synchronization. These could cause one thread to spend a lot of time trying to do synchronization even though both sides are actually ready to go.
More concretely, here is a list of issues/options to investigate.
-
Whether a worker is memory bound or computational bound.
Note that here, "computation" includes all the works that is not used for synchronizing with the other thread. It might include real computation, I/O, or synchronization with a third thread. What matters is whether the thread is saturating the bandwidth of a memory channel.
Based on what we measured with a single thread, I would assume that we are most likely not going to be memory bound unless the buffer is in the main memory. OTOH, this might change if a thread is consuming the output from more than one threads.
-
The balance of the workload on the producer and consumer side.
Of course we want the workload to be as balanced as possible but it'll never be perfect. It'll be interesting to see how the workload balance affects the performance. Due to the symmetry of the ring buffer, I would expect the case where the producer or the consumer being faster to be very similar to each other.
-
Buffer size
A larger buffer means less probability of conflict. OTOH, we would most likely still want the buffer to be in the
L3
cache to maximize the memory bandwidth. (Do note that if we are compute bound overall we might get away with always hitting the main memory. This might be the case since even with1.25 GS/s
sampling rate we only need5 GB/s
offloat32
which is within the bandwidth limit.) -
Frequency and granularity of buffer synchronization
For a ring buffer, each side owns a section of the buffer and will pass it to the side when it finishes working on it (whether it being read or write). Two related questions are how often this synchronization should be done and the basic unit of data to be passed at one time.
More frequent synchronization allows minimizing the wait time on the other thread but also causes more slowdown on the current thread. Similarly, a finer granularity allows more flexible synchronization strategy and also more frequent synchronization. OTOH, with a very coarse granularity, say if the whole buffer is split into a few different blocks which are synchronized one at a time, we could simplify the synchronization logic and potentially even use different flags in the memory to control the ownership of the blocks to reduce contention.
-
Back-off strategy
As in a spin lock, when one thread fails to acquire a new section of memory in the ring buffer, it needs to back off from checking again to reduce bus contention and to allow the other thread to release the buffer. We may need to implement an exponential back-off strategy using
pause
on x86. This may not be necessary on ARM when using thewfe
andsev
pair. -
Cache hint / manipulation
Before handing off memory from one thread to another, we in principle want to make sure that it is not in a modified state in our private cache. This is ideally done with a "push for sharing" instruction which will be coming to the next generation of server CPUs from Intel but is unfortunately not available at the moment. The next best option may be the
clwb
instruction which should be available on Skylake-X (i.e. "Skylake (server)") as well as Zen2. However, it is a mandatory control instead of a hint so it may actually have a negative impact on the performance. On ARM,dc cvac
(AArch64) ordccmvac
(AArch32) should have a similar function toclwb
on x86.
Note that we'll only test two threads here so the communication channel between them can use the full memory bandwidth available to a core. This will likely change when more threads are added.
-
First, we'll test the case where there's no computation on either the consumer or the producer thread. This should tell us the maximum memory bandwidth available to us. We'll also test the case where no actual memory is accessed (i.e. simply passing the synchronization counter back and forth) to see what's the maximum stress we can put on the synchronization itself.
For the backoff strategy, we use a loop of
pause
instructions on x86. As we'll see in the tests below, for most reasonable buffer size, the synchronization isn't the most time consuming part so we saw minimum differences between backoff strategies. We currently use a start point of8
pauses with each iteration increasing the count by1/4
.I tested
clwb
,clflush
andclflushopt
on the machines that support them. All seems to yield much worse performance and much higher cache misses than without. I guess we'll basically have to wait forcldemote
to have any useful cache hint on the producer thread.All the tests run with the producer thread pinned to core 0 and the consumer thread pinned to core 1. I think this does not matter on current Intel CPUs since the core to core delays are fairly constant. AMD (and likely ARM) CPUs may have clusters within the package/die and one may need to be more careful when connecting threads together.
We use two different implementations of the ring buffer. First is
BlockRing
, which only allows passing "blocks" of memory at ones and the ownership of each block is recorded as a separate flag (aligned to64
bytes so that each flag is in a different cache line to reduce false sharing). Second isDataPipe
, which allows the producer/consumer to transfer an arbitrary number of elements at a time. The ownership of the memory is recorded by two read/write pointers with optimization to minimize conflicting access.In addition to the normal tests that use the buffer to do memory transfer (with the producer writes to the buffer and the consumer reads from it), we also do a test at each condition without any memory read/write on the buffer. This allows us to estimate the overhead due to the synchronization or when the two threads fight each other.
We vary the granularity of the data transfer by changing the number of blocks. For
BlockRing
, this is simply the number of blocks in the ring. ForDataPipe
, the effective block size is the maximum allowed transfer size, and due to an implementation detail, the producer must remain at least one element behind the consumer. Because of this, theBlockRing
transfer will have exactly one transfer per block (shown as R/W per Block in plots below) whereas this will be equal or greater than one for theDataPipe
implementation, which mostly happens when the producer catches up to the consumer. Other terminology we'll use in the data below included,-
Pipe Stall
When a read/write attempt does not immediately return enough memory.
-
Sync
Attempts made during pipe stall (including the first one) to obtain new memory from the other thread. (It is so named since for
DataPipe
each additional attempt will cause the thread to sync its copy of the read/write pointers by reading from the shared one.) Each sync will trigger an increase in the backoff (by the factor of1.25
mentioned above).
We also vary the total buffer sizes. Note that since we would like to make use of all cores, it is the buffer sizes that are below
L3/core
that's the most interesting to us.The code for the test is in
cpu-two-threads-ring-buffer.cpp
.-
Intel Core i9-10885H
For the raw data, see these CSV files when using
BlockRing
(for read and write) with the corresponding no memory access test (for read and write), and usingDataPipe
(for read and write) with the corresponding no memory access test (for read and write).First, let's look at the performance.
The solid lines show the time it takes (per
KiB
of data) when the data is actually passed through the buffer whereas the dashed lines show the corresponding condition when there's no memory access on the buffer.Looking at the overhead for the no-memory-access data for small buffer sizes. It seems that the overhead for
BlockRing
depends more strongly on the number of blocks (potentially due to the larger number of flag variables) whereas the overhead forDataPipe
is fairly constant. For block number up to about32
, theBlockRing
seems to have a lower overhead. Possibly because of/related to this, the performance for theBlockRing
is slightly better for a wider range of buffer sizes belowL3/core
even though the two have roughly equal performance for1
to4 MiB
buffer sizes. We can also see the drop in performance when the buffer gets bigger than4 MiB
before getting into the main memory as we've previously seen in the write bandwidth tests.The maximum achieved bandwidth is about
25 B/ns
which is lower than the34 B/ns
measured for the write bandwidth to34 B/ns
but isn't that far from the28 B/ns
when writing to a larger buffer inL3
. Maybe the drop in performance in both cases are due to interaction with other cores orL3
slices or memory controller or cache associativity.Next, let's check the values for some other counters.
For
BlockRing
For
DataPipe
The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.
It's unsurprising that there's cache misses when the buffer size equals to/larger than
L3
. However, what's interesting is that the cache miss starts as early as1/2
L3
size, (also about the same4
to5 MiB
size threshold when we observed an decrease inL3
bandwidth before). The read (but not really the write) miss rate also depends significantly on the block number with smaller blocks causing less miss. This trend isn't surprising since a smaller block means that it can be read by the consumer before being evicted from the cache but I don't think this should happen when the whole buffer can fit inL3
. Maybe the cache miss counter also includes some event when the data has to be transferred from a differentL3
slice? Or maybe it's something related to how cache associativity works atL3
level.For the R/W per block, the one for
BlockRing
is exactly1
as expected. It is higher, and more so for fewer blocks, forDataPipe
. The write count seems to be consistently higher than the read count, suggesting that the producer is more likely to catch up to the consumer. This is despite the higher cache miss rate for the producer which is a bit strange. The trend also seems to be fairly regular, with the write one for large buffer size approaches the read/normal number for half the block number (twice the block size).The pipe stall percentage clearly shows three different regimes.
-
When the buffer is small, the throughput is limited by the synchronization overhead.
-
When the buffer is big and there are a lot of cache misses, the throughput is limited by the main memory bandwidth.
-
In between the two cases above, the throughput is limited by the
L3
bandwidth. This is also the case we care about the most.
Somehow for case 1 and 2 the producer can catch up with the consumer whereas for case 3 the consumer catches up with the producer most of the time. Now I do understand that for intermediate size where the synchronization overhead is insignificant, there is an advantage for the consumer thread when it can consume "hot" data produced that is still in the cache which might help it to catch up. However, I don't really see a good reason for the producer to consistently catch up in the other case (maybe because stores don't have to wait for the results??), i.e. why would the synchronization and cache miss overhead hit the consumer thread harder than the producer thread.
Comparing the two implementations regarding the stall percentage. The
DataPipe
appears to have a lower stall percentage for case 3. However, since it issues more read/write, the total number of stalls is actually about the same. On the other hand, the write stall seems to be much higher forDataPipe
in this region whereas it's almost zero forBlockRing
. It also seems that for8
to64
blocks, there is a fairly wide range of buffer sizes forBlockRing
where there are almost no stalls at all.Finally, we can look at the sync per stalls. There's a fairly strong anti correlation between the sync per stall and the stall count so it's likely that most of the stalls actually have a fairly small number of sync but there are a few outliers that hang for a long time. This could be related to other things on the machine, or the initial part so I'll not pay too much attention to it as long as it doesn't affect the performance too much.
-
-
Intel Core i7-6700K
For the raw data, see these CSV files when using
BlockRing
(for read and write) with the corresponding no memory access test (for read and write), and usingDataPipe
(for read and write) with the corresponding no memory access test (for read and write).First, the performance.
The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.
And then the more detailed numbers.
For
BlockRing
For
DataPipe
The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.
Overall very similar just faster. The average sync per stall is lower. The R/W per block for
DataPipe
seems cleaner and shows the switch over for large buffer sizes more clearly (still not exactly sure why...) -
Intel Core i9-7900X
For the raw data, see these CSV files when using
BlockRing
(for read and write) with the corresponding no memory access test (for read and write), and usingDataPipe
(for read and write) with the corresponding no memory access test (for read and write).First, the performance.
The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.
And then the more detailed numbers.
For
BlockRing
For
DataPipe
The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.
There are some differences this time. We did not measure to large enough buffer size for the producer to catch up to the consumer again this time. The pipe stalls for the producer also ended at much lower buffer sizes but seems to go back up for a little bit for
4
to16
blocks.
In general, using a buffer size that's about or slightly lower than
L3/core
, it seems that theBlockRing
gives slightly more consistent performance. A block count of about16
also seems to be about the best tradeoff. Another slight advantage when using theBlockRing
is that this way of managing the buffer is more likely what we need to do for the GPU since we have much less flexibility when scheduling a job there. Using it on the CPU would make the two more consistent though there may not be too much code sharing. -
-
Next, let's run the tests between different CPU pairs to see if there's any structure in the cpu-cpu bandwidth.
For all the tests we'll use
1 MiB
buffer and16
blocks.The code for the test is the same file as above but updated to allow changing the core we pin the thread to.
cpu-two-threads-ring-buffer.cpp
.-
Intel Core i9-10885H
For the raw data, see these CSV files when using
BlockRing
(for read and write), and usingDataPipe
(for read and write). -
Intel Core i7-6700K
For the raw data, see these CSV files when using
BlockRing
(for read and write), and usingDataPipe
(for read and write). -
Intel Core i9-7900X
For the raw data, see these CSV files when using
BlockRing
(for read and write), and usingDataPipe
(for read and write).
I don't really see any obvious structure which is what we expect, and the variation seems to be about
10
to15 %
so it doesn't matter that much. It does seem that writing from core with lower number to higher number somehow has a slightly better bandwidth especially between the "middle" cores (the upper right part in the middle is generally brighter than the lower left part) though this isn't very significant for thei9-7900X
. Maybe it has something to do with the ring bus. -
-
Finally, let's add computation to the test.
The producer thread will compute the sum of a number of sine functions (channels) (
amp * sin(freq * t)
, same as the one used in single thread performance test). The consumer thread will read from the ring buffer and compute the sum of it with a potentially different number of sine functions. We will record the performance as the time it takes to compute each sine function on the threads with more channels.We'll test using buffer size around and below
L3/core
and we'll use only theBlockRing
implementation since it gives a more consistent result in this range.Optionally, each thread may have a local buffer for the sums of sine functions. When this is enabled, instead of using
pause
for back off, the thread will first fill this local buffer with computational results which will be used later to write to or add with the ring buffer to produce the final results. For simplicity, we fixed the local buffer size to be32 KiB
and the thread will check for ring buffer availability once after filling the first16 KiB
of it.This test should be very closed to what we'll actually do to generate the data. The real one will need more logic on each thread to generate the parameters and to accumulate the time/phase. There'll also be more threads and at least some threads will need to consume and produce at the same time and potentially consume from more than one other thread.
We run all the tests by writing from core 0 to core 1.
The code for the test is in
cpu-two-threads-ring-buffer-compute.cpp
.For the plots below, the performance/throughput plots (left) uses solid lines for no local buffer and dashed lines for
32 KiB
of local buffer. The pipe stall plots (right) uses solid lines for read with no local buffer, dotted lines for write with no local buffer, dashed lines for read with32 KiB
of local buffer, and dashed dotted line for write with32 KiB
of local buffer. The colors are the same as what is used above.The red dotted horizontal line marked the performance calculated from the pure computation test. Note that due to a frequency change on i7-6700K the performance number for the red line there is scaled accordingly.
-
Intel Core i9-10885H
CPU frequency governor: powersave
For the raw data, see these CSV files for read and write.
With equal amount of computation on the producer and consumer.
With 3 times more computation on the producer than the consumer.
With 3 times less computation on the producer than the consumer.
-
Intel Core i7-6700K
CPU frequency governor: powersave, however, the frequency appears to be
4.6 GHz
rather than the4.11 GHz
observed in the previous test. The performance limit below has been scaled accordingly though it still seems to be off a little...For the raw data, see these CSV files for read and write.
With equal amount of computation on the producer and consumer.
(Note that the drop in the performance for 8 channels does not seem to be caused by the CPU frequency change.)
With 3 times more computation on the producer than the consumer.
With 3 times less computation on the producer than the consumer.
-
Intel Core i9-7900X
For the raw data, see these CSV files for read and write).
With equal amount of computation on the producer and consumer.
With 3 times more computation on the producer than the consumer.
With 3 times less computation on the producer than the consumer.
There are very few things that are worth discussing for each CPU separately so I'll just do a summary for all the ones together. The performance for a higher number of channels is much closer to the limit. This is especially true for smaller (about
512 KiB
) buffers maybe because of better usage of cache? (Still, things have to go throughL3
so I'm not sure) The asymmetric one gets closer to the limit likely because of the larger number of channels on one thread.The local buffer does seem to reduce the stall in the pipe though the effect on the overall performance is minimum. There doesn't seem to be too many stalls in general, which is good. Maybe we'll include a small one to help with some edge cases.
-
In general, there isn't anything too surprising
and we are fairly close to the computational limited throughput overall,
which is the regime we want to be in.
Unless there's some catastrophic interference when we add more pipes to each thread
we should be able to sustain this performance to all cores.
The use of complex AVX512
instructions on all threads will of course downclock
the CPUs on this generation, but we should still be able to generate
more than 100 traps at 625 MS/s
on the CPU only.
Next we'll add more threads and test combining the computation on multiple threads to a single output.