Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section to the patterns doc about avoiding simultaneous updates to single file #2040

Merged
merged 2 commits into from
May 10, 2017

Conversation

quentinsf
Copy link
Contributor

Description

Adds a section to the 'Luigi patterns' documentation about how to avoid multiple workers trying to write to a single destination simultaneously.

Motivation and Context

I had created tasks and targets that would create new tables in an existing HDF5 file. The problem was that if two simultaneous processes tried to write to the same file, it would be corrupted. But I couldn't find examples of how to avoid this, except by using workers=1.

Finally, I came across daveFNbuck's example on making 'resources' dynamic, at the end of #1362, which was a perfect and neat solution to my file problem, and thought something along those lines deserved a place in the documentation: it would have saved me many, many hours...

Have you tested this? If so, how?

Have built with the tox -e docs and the output looks good.

@mention-bot
Copy link

@quentinsf, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ulzha, @dhurlburtusa and @Tarrasch to be potential reviewers.

@chrispalmer
Copy link

@quentinsf I'm curious as to why you have a need to have multiple tasks writing to the same file. Maybe I'm missing something but that feels like a strange construction to me.

I like the idea of making resources a dynamic property, and it's something I've used myself. I'm just not sure that the example of writing to a shared file is the best example.

@quentinsf
Copy link
Contributor Author

Hi @chrispalmer - yes, I agree, it isn't immediately obvious why you might want to do this! But my searches would suggest I’m not completely alone.

In our case, we’re building up HDF5 files from experimental sessions. If you’re not familiar with HDF, imagine a ZIP file with a directory-type hierarchy, but the things at the nodes are tables of data rather than generic files. It’s a great format, and you can easily come back and replace or add to a file later on, so it’s arguably more like a filesystem - one that can cope with terabytes of data and can have interesting metadata attached to each table. But it is still a single file, and it gets confused if more than one process tries to update it at once. (I guess you’d get something similar with a big XML or JSON file, where you can’t simply append to the end of it, or, indeed if you were adding stuff to a ZIP or tar file.)

We’re creating one file for each of 50 experimental sessions, but some of the the tables that will go in it take a long time to create (e.g. an hour), and some of the source data needed to create them is still coming in. I want my colleagues to be able to use the data available so far, but to add to the file when new data or new analyses arrive. (“I’ve added the ECG data to each session now if you want to use it”) Each file is currently a couple of gigabytes, so recreating them is a nuisance, and part of the idea of using this format is that you don’t need to.

It’s quick and easy to find out whether a particular table exists in a file, so I have an HDF5TableTarget which is created from an HDFFile and a table name. When new data comes in, the dependencies all work nicely, and the only problem is ensuring that no two workers are trying to create targets that share the same HDFFile. One way is to only use a single worker, but that also slows me down and stops me updating 10 HDF files with 10 workers.

So this seemed like a good solution? Apologies for the long essay, but I hope that explains why I wanted to do it :-)

All the best,
Quentin

@Tarrasch
Copy link
Contributor

Tarrasch commented Mar 5, 2017

If this is to be merged, it should come with a warning:

Writing to the same file is fragile and difficult, luckily in 99% of cases you don't need this. If you however are convinced that you're required to do this, please read on

(something like that). Does that sound fair @QuentinFra? :)

@quentinsf
Copy link
Contributor Author

Yes, it certainly makes sense to warn the inexperienced. If people still have concerns, then I'm happy to write it up as a more detailed blog post instead, which people could find by Googling. But I do think it should be online somewhere, since this method allowed me to keep using Luigi rather than finding another tool.

If others would like to keep it, then I suggest the following text. I'll work out how to tweak the PR...

Updating a single file from several tasks is almost always a bad idea, and you
need to be very confident that no other good solution exists before doing this.
If, however, you have no other option, then you will probably at least need to ensure that
no two tasks try to write to the file simultaneously.

By turning 'resources' into a Python property, it can return a value dependent on
the task parameters or other dynamic attributes:

Copy link
Contributor

@Tarrasch Tarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now! :)

@visjedi
Copy link

visjedi commented May 10, 2017

@Tarrasch @quentinsf : Nice contribution.
Do we already have a small testcase to check this with luigi local targets? I shall be happy to contribute if one of you can give me some leads.

@Tarrasch Tarrasch merged commit ecc96a5 into spotify:master May 10, 2017
This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants