-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Share data.table among R sessions by reference #3104
Comments
There might be some way to use |
Thank you for the suggestion. I had similar attempts but after reading up I believe you simply can't directly, since these are externalptr objects which are of no use outside their corresponding R session. |
File-backed data.tables (#1336) might help (?), though I imagine they will have to be locked for editing/writing (like |
There is only one way to share memory across processes: one of the processes has to allocate a new shared memory region using Now, a difficult part is to have a These are all quite hard questions, and I don't know the answers. However, it is possible in principle. At least Python |
Moreover, how could the process B know whether the memory is still valid or not (rather than garbage), since the process A may have been killed already. |
@st-pasha and @shrektan , The There is one caveat to be aware of with this type of file-backed shared-memory objects: some (all?) HPC clusters with multiple nodes hate these. If you create a file-backed shared memory object and try to access it from multiple R sessions the object essentially locks up those processes due to the consistency checks made by the HPC filesystem (because of the possibility those processes might be spread over multiple nodes - even if you explicitly ask for the same node). This is something I had the (dis)pleasure of learning when trying to publish my NetRep R package in my PhD, after paper acceptance passing software review I discovered this problem and ended up having to rip out the internals and quickly learn C++ so I could parallelise the code (through casting to a C++ armadillo matrix and writing multithreaded code that operated on those C++ objects in shared memory). |
Actually we could probably just do something simple with the bigmemory package: write a function that converts each column to a big.matrix object, and another that can load those matrix/columns and wrap them in a data.table in your new R session. I might play with this over the weekend. |
I used bigmemory for a while and it's not ideal. Attaching a big.matrix can take quite a while. Additionally the package tends to accumulate temp files that you'll might have to clean up yourself once in a while. |
Since you mentioned feather, you may want to have a look at the fst package (https://github.com/fstpackage/fst) if you haven't already. The roadplan is promising fstpackage/fst#117 |
fst is great, I used it in my last project and never had any issues with it. I didn't know their roadmap though. Feather/apache arrow is interesting due to it's promise to be able to share data by reference within quite a rich ecosystem of languages and services. |
It is my understanding that both At the same time, |
Follow here for R implementation of arrow |
The C API offers
Do you feel this is a problem? Properly
This might be more of a headache. Maybe I'm mostly name-checking here. Does anyone with intimate familiarity of the |
@nbenn Thanks, this hits the mark. If R has a custom memory allocator mechanism, then it will certainly know to call the user-provided custom de-allocator when the time comes. |
@sritchie73 can you shed more light on your experiences with file-backed shared-memory objects in HPC environments? I gather you had problems with file-backed I would not expect the file system to interfere with management of shared memory. Furthermore, for example for applying a function to a |
I guess this depends on
|
@nbenn digging into my old emails, the file system was GPFS, something to do with a conflict with the way the Boost headers used by From my limited understanding and experience, it seemed like the filesystem would lock I/O access to the file-backed shared memory objects if multiple processes were trying to access it. My understanding is this was the filesystem's way of ensuring consistency of files across multiple physical nodes. This problem was present whether or not you actually created a backing file on disk, or let The way I got around this was to move all my parallel code from R into C++. I wrote a multithreaded procedure, where each thread gained access to my large matrices via a pointer to each matrix passed to each thread. Use of shared memory in this way worked fine. However, this was a completely different problem to sharing objects across R sessions. |
what about using disk.frame ? https://github.com/xiaodaigh/disk.frame it supports most of dplyr verbs and data.table syntax |
I'm looking into ways of sharing a data.table among several R processes on the same maschine by reference, is there already one that I missed? I'm looking for functionality analogous to this:
https://www.rdocumentation.org/packages/bigmemory/versions/3.12/topics/describe%2C%20attach.big.matrix
Thank you for the great work on this package.
The text was updated successfully, but these errors were encountered: