-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with multithreaded code and CPU dispatching. #65
Comments
Hi, thanks for raising this issue. Always interesting to get some perspective from a library user. I'll address your points specifically below, but first I want to clarify the design philosophy behind this library. It runs a bit counter to the way you intend to use it. The concept behind this library is "compile once, run anywhere". It was intended to be compiled for all the SIMD architectures that the compiler will support, and then at runtime, use CPU feature detection to decide which codecs to actually use. This should make the library portable across all similar architectures, and make it distributable as part of a package system or binary distribution. This benefit trickles down to the user if they also distribute their software through a centralized repo. The optimization focus in this library is not multithreaded, but multicore. It supports OpenMP, which greatly accelerates encoding and decoding of large pieces of data. (But I think you raise a valid point here. The library should assume that it can be called from multiple threads, and not share common state.) Are these good design principles? Perhaps not, and I'm personally a bit unhappy with the early decision to implement my own CPU dispatcher, but that's the way things currently stand. I do have some future plans to allow the user to do their own CPU dispatching and call architecture-specific codecs directly, but they won't be finished in the short term. There is indeed a user flags option to force a codec, but this is a bit of a misfeature. The flags are mainly there as a testing tool so that I can run a specific individual codec for testing and benchmarking. The idea was that end users should not be using the flags, they should be using runtime CPU detection. I recognize that it's not the best choice in hindsight, and intend to fix it, but here we are. As to the specific points:
static __thread struct codec codec = { NULL, NULL }; This will make
I think this library could be made to fit your use-case, but as you see there are some different tradeoffs in play which would require either some redesign on my end (slow, unfortunately), or patching from your end. I'm open to suggestions on how and where to improve. |
Thank you for the answer! The question is resolved. |
To be honest I wouldn't mind if you got rid of the mutex and would use thread local for the codec. In a HPC environment you can have 200 cores doing the same thing, which in this case would cause a huge perf hit because of false sharing, and I'm not keen on maintaining a separate fork. This also why I hope that eventually you'll merge the CMake build, we'll probably use it as a library, one other thing I'm not keen about maintaining :). |
Suppose we are calling
base64_encode
orbase64_decode
in a loop (for different inputs) and doing it from multiple threads (for different data).If we pass non-zero
flags
to these routines, it will write to a single global variable repeatedly incodec_choose_forced
function and it will lead to "false sharing" and poor scalability.There is no method to pre-initialize the choice of codec. (Actually, there is: we can simply call one of the encode/decode routines in advance with empty input, but it looks silly). If we don't do that and if we run our code with thread-sanitizer, it will argue about data race on codec function pointers. In fact, it is safe, because it is a single pointer - single machine word that is (supposedly) placed in aligned memory location. But we have to annotate it as _Atomic and store/load with memory_order_relaxed. Look at the similar issue here: Make dynamic dispatch free of TSan warnings simdjson/simdjson#256
Suppose we use these routines in a loop for short inputs. They have a branch to check if encoders/decoders were initialized. We want to move these branches out of the loop: check for CPU and call specialized implementation directly. But architecture specific methods are not exported and we cannot do that. We also have to pay for two non-inlined function calls.
All these issues was found while integrating this library to ClickHouse: ClickHouse/ClickHouse#8397
The text was updated successfully, but these errors were encountered: