-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Add support for disablging threading for emscripten #35176
Comments
I'd say this is non-trivial but should be doable. Most non-I/O components generally have a way to disable threading. There are probably some exceptions however and we could clean those up. I/O, on the other hand, tends to rely on the I/O thread pool being available. What does I/O look like in emscripten? I'm thinking of both "local disk I/O" (e.g. read a parquet file from local disk) and network I/O (e.g. read a parquet file from S3)? Does emscripten have APIs for these things? Do they have async variants? |
On emscripten - in browser, local disk is memory based, and may or may not be synced to some kind of permanent storage (via an asynchronous syncfs call). Access to this disk is synchronous, but very quick because it is in memory. In node, you can use the real file system directly. Network is weird, because it is hosted in browsers typically - for http / https one can call out to javascript to use the fetch api, which is asynchronous. Right now there's only async I/O for network with the exception of xmlhttprequest if you're in a web-worker, which is a hacky workaround for synchronous http access. In theory there's also a websockets wrapper which turns socket calls in C into websocket calls to the hosting server, but I don't know how well it works. Basically, as I understand it, the potential in emscripten for arrow is:
Personally, for what I want, I just want core arrow with file support to work on emscripten - I think that is a decent starting point before getting into complexities. |
One thought, but if ThreadPool was to wrap a singleton SerialExecutor if ARROW_DISABLE_THREADING was set, things might just work well enough for a first attempt at an unthreaded build? I don't know whether there'd be any deadlocks in i/o vs compute tasks though? |
If everything is tasks and a single That being said, I'm sure there are a few bugs / things that will need to be converted. However, I'm not sure we can just wrap the global thread pools with a serial executor. The challenge with the serial executor is that it has to co-opt the calling thread. This means we create the executor when the call starts.
However, if we are combining I/O and CPU into a single pool...then it should be possible to create a special serial executor that creates a serial executor when a task is first submitted. |
Something like...
|
I did some work on this - it's currently functional for quite a lot of things but failing some tests. I'm working through them. It has to keep the concept of multiple executors because loads of the other logic relies on that, but all active tasks from any executors are dispatched in turn whenever anything waits for any task or future. |
…35672) As previously discussed in #35176 this is a patch that adds an option `ARROW_ENABLE_THREADING`. When it is turned off, arrow threadpool and serial executors don't spawn threads, and instead run tasks in the main thread when futures are waited for. It doesn't mess with threading in projects included as dependencies, e.g. multithreaded malloc implementations because if you're building for a non threaded environment, you can't use those anyway. Basically where this is at is that it runs the test suite okay, and I think should work well enough to be a backend for pandas on emscripten/pyodide. What this means is: 1) It is possible to use arrow in non-threaded emscripten/webassembly environments (with some build patches specific to emscripten which I'll put in once this is in) 2) Most of arrow just works, albeit slower in parts. Things that don't work and probably won't: 1) Server stuff that relies on threads. Not a massive problem I think because environments with threading restrictions are currently typically also restricted from making servers anyway (i.e. they are web browsers) 2) Anything that relies on actually doing two things at once (for obvious reasons) Things that don't work yet and could be fixed in future: 1) use of asynchronous file/network APIs in emscripten which would mean I/O could work efficiently in one thread. 2) asofjoin - right now the implementation relies on std::thread - it needs refactoring to work with threadpool like everything else in arrow, but I'm not sure I am expert enough in the codebase to do it well. * Closes: #35176 Lead-authored-by: Joe Marshall <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…ten (apache#35672) As previously discussed in apache#35176 this is a patch that adds an option `ARROW_ENABLE_THREADING`. When it is turned off, arrow threadpool and serial executors don't spawn threads, and instead run tasks in the main thread when futures are waited for. It doesn't mess with threading in projects included as dependencies, e.g. multithreaded malloc implementations because if you're building for a non threaded environment, you can't use those anyway. Basically where this is at is that it runs the test suite okay, and I think should work well enough to be a backend for pandas on emscripten/pyodide. What this means is: 1) It is possible to use arrow in non-threaded emscripten/webassembly environments (with some build patches specific to emscripten which I'll put in once this is in) 2) Most of arrow just works, albeit slower in parts. Things that don't work and probably won't: 1) Server stuff that relies on threads. Not a massive problem I think because environments with threading restrictions are currently typically also restricted from making servers anyway (i.e. they are web browsers) 2) Anything that relies on actually doing two things at once (for obvious reasons) Things that don't work yet and could be fixed in future: 1) use of asynchronous file/network APIs in emscripten which would mean I/O could work efficiently in one thread. 2) asofjoin - right now the implementation relies on std::thread - it needs refactoring to work with threadpool like everything else in arrow, but I'm not sure I am expert enough in the codebase to do it well. * Closes: apache#35176 Lead-authored-by: Joe Marshall <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Describe the enhancement requested
I've built most of arrow (pyarrow and dependencies) for emscripten. It would be good to have a way to disable threading, as a lot of emscripten use is in browsers where threading may not be available.
At the moment I just put in dummy pthreads, which means some functionality in e.g.datasets fails because it assumes threading is available.
Component(s)
C++
The text was updated successfully, but these errors were encountered: