-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add section to the patterns doc about avoiding simultaneous updates to single file #2040
Conversation
@quentinsf, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ulzha, @dhurlburtusa and @Tarrasch to be potential reviewers. |
@quentinsf I'm curious as to why you have a need to have multiple tasks writing to the same file. Maybe I'm missing something but that feels like a strange construction to me. I like the idea of making |
Hi @chrispalmer - yes, I agree, it isn't immediately obvious why you might want to do this! But my searches would suggest I’m not completely alone. In our case, we’re building up HDF5 files from experimental sessions. If you’re not familiar with HDF, imagine a ZIP file with a directory-type hierarchy, but the things at the nodes are tables of data rather than generic files. It’s a great format, and you can easily come back and replace or add to a file later on, so it’s arguably more like a filesystem - one that can cope with terabytes of data and can have interesting metadata attached to each table. But it is still a single file, and it gets confused if more than one process tries to update it at once. (I guess you’d get something similar with a big XML or JSON file, where you can’t simply append to the end of it, or, indeed if you were adding stuff to a ZIP or tar file.) We’re creating one file for each of 50 experimental sessions, but some of the the tables that will go in it take a long time to create (e.g. an hour), and some of the source data needed to create them is still coming in. I want my colleagues to be able to use the data available so far, but to add to the file when new data or new analyses arrive. (“I’ve added the ECG data to each session now if you want to use it”) Each file is currently a couple of gigabytes, so recreating them is a nuisance, and part of the idea of using this format is that you don’t need to. It’s quick and easy to find out whether a particular table exists in a file, so I have an HDF5TableTarget which is created from an HDFFile and a table name. When new data comes in, the dependencies all work nicely, and the only problem is ensuring that no two workers are trying to create targets that share the same HDFFile. One way is to only use a single worker, but that also slows me down and stops me updating 10 HDF files with 10 workers. So this seemed like a good solution? Apologies for the long essay, but I hope that explains why I wanted to do it :-) All the best, |
If this is to be merged, it should come with a warning:
(something like that). Does that sound fair @QuentinFra? :) |
Yes, it certainly makes sense to warn the inexperienced. If people still have concerns, then I'm happy to write it up as a more detailed blog post instead, which people could find by Googling. But I do think it should be online somewhere, since this method allowed me to keep using Luigi rather than finding another tool. If others would like to keep it, then I suggest the following text. I'll work out how to tweak the PR...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now! :)
@Tarrasch @quentinsf : Nice contribution. |
Description
Adds a section to the 'Luigi patterns' documentation about how to avoid multiple workers trying to write to a single destination simultaneously.
Motivation and Context
I had created tasks and targets that would create new tables in an existing HDF5 file. The problem was that if two simultaneous processes tried to write to the same file, it would be corrupted. But I couldn't find examples of how to avoid this, except by using workers=1.
Finally, I came across daveFNbuck's example on making 'resources' dynamic, at the end of #1362, which was a perfect and neat solution to my file problem, and thought something along those lines deserved a place in the documentation: it would have saved me many, many hours...
Have you tested this? If so, how?
Have built with the tox -e docs and the output looks good.