Add section to the patterns doc about avoiding simultaneous updates to single file #2040

quentinsf · 2017-02-24T16:00:11Z

Description

Adds a section to the 'Luigi patterns' documentation about how to avoid multiple workers trying to write to a single destination simultaneously.

Motivation and Context

I had created tasks and targets that would create new tables in an existing HDF5 file. The problem was that if two simultaneous processes tried to write to the same file, it would be corrupted. But I couldn't find examples of how to avoid this, except by using workers=1.

Finally, I came across daveFNbuck's example on making 'resources' dynamic, at the end of #1362, which was a perfect and neat solution to my file problem, and thought something along those lines deserved a place in the documentation: it would have saved me many, many hours...

Have you tested this? If so, how?

Have built with the tox -e docs and the output looks good.

…o a single file.

mention-bot · 2017-02-24T16:00:13Z

@quentinsf, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ulzha, @dhurlburtusa and @Tarrasch to be potential reviewers.

chrispalmer · 2017-03-02T00:26:11Z

@quentinsf I'm curious as to why you have a need to have multiple tasks writing to the same file. Maybe I'm missing something but that feels like a strange construction to me.

I like the idea of making resources a dynamic property, and it's something I've used myself. I'm just not sure that the example of writing to a shared file is the best example.

quentinsf · 2017-03-02T11:40:41Z

Hi @chrispalmer - yes, I agree, it isn't immediately obvious why you might want to do this! But my searches would suggest I’m not completely alone.

In our case, we’re building up HDF5 files from experimental sessions. If you’re not familiar with HDF, imagine a ZIP file with a directory-type hierarchy, but the things at the nodes are tables of data rather than generic files. It’s a great format, and you can easily come back and replace or add to a file later on, so it’s arguably more like a filesystem - one that can cope with terabytes of data and can have interesting metadata attached to each table. But it is still a single file, and it gets confused if more than one process tries to update it at once. (I guess you’d get something similar with a big XML or JSON file, where you can’t simply append to the end of it, or, indeed if you were adding stuff to a ZIP or tar file.)

We’re creating one file for each of 50 experimental sessions, but some of the the tables that will go in it take a long time to create (e.g. an hour), and some of the source data needed to create them is still coming in. I want my colleagues to be able to use the data available so far, but to add to the file when new data or new analyses arrive. (“I’ve added the ECG data to each session now if you want to use it”) Each file is currently a couple of gigabytes, so recreating them is a nuisance, and part of the idea of using this format is that you don’t need to.

It’s quick and easy to find out whether a particular table exists in a file, so I have an HDF5TableTarget which is created from an HDFFile and a table name. When new data comes in, the dependencies all work nicely, and the only problem is ensuring that no two workers are trying to create targets that share the same HDFFile. One way is to only use a single worker, but that also slows me down and stops me updating 10 HDF files with 10 workers.

So this seemed like a good solution? Apologies for the long essay, but I hope that explains why I wanted to do it :-)

All the best,
Quentin

Tarrasch · 2017-03-05T03:26:47Z

If this is to be merged, it should come with a warning:

Writing to the same file is fragile and difficult, luckily in 99% of cases you don't need this. If you however are convinced that you're required to do this, please read on

(something like that). Does that sound fair @QuentinFra? :)

quentinsf · 2017-03-05T09:24:23Z

Yes, it certainly makes sense to warn the inexperienced. If people still have concerns, then I'm happy to write it up as a more detailed blog post instead, which people could find by Googling. But I do think it should be online somewhere, since this method allowed me to keep using Luigi rather than finding another tool.

If others would like to keep it, then I suggest the following text. I'll work out how to tweak the PR...

Updating a single file from several tasks is almost always a bad idea, and you
need to be very confident that no other good solution exists before doing this.
If, however, you have no other option, then you will probably at least need to ensure that
no two tasks try to write to the file simultaneously.

By turning 'resources' into a Python property, it can return a value dependent on
the task parameters or other dynamic attributes:

Tarrasch

Looks good to me now! :)

visjedi · 2017-05-10T15:07:55Z

@Tarrasch @quentinsf : Nice contribution.
Do we already have a small testcase to check this with luigi local targets? I shall be happy to contribute if one of you can give me some leads.

Add section to the patterns doc about avoiding simultaneous updates t…

599707d

…o a single file.

dlstadther approved these changes Feb 24, 2017

View reviewed changes

Add extra warning about writing one file from multiple tasks

a99a470

Tarrasch approved these changes Mar 6, 2017

View reviewed changes

Tarrasch merged commit ecc96a5 into spotify:master May 10, 2017

This was referenced Jun 29, 2022

no mo enum 34 #3180

Closed

enum34 be gone #3181

Closed

mdragilev mentioned this pull request Jun 28, 2024

for S3 contrib package move to boto3 Affirm/luigi#26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add section to the patterns doc about avoiding simultaneous updates to single file #2040

Add section to the patterns doc about avoiding simultaneous updates to single file #2040

quentinsf commented Feb 24, 2017

mention-bot commented Feb 24, 2017

chrispalmer commented Mar 2, 2017

quentinsf commented Mar 2, 2017

Tarrasch commented Mar 5, 2017

quentinsf commented Mar 5, 2017

Tarrasch left a comment

visjedi commented May 10, 2017 •

edited

Loading

Add section to the patterns doc about avoiding simultaneous updates to single file #2040

Add section to the patterns doc about avoiding simultaneous updates to single file #2040

Conversation

quentinsf commented Feb 24, 2017

Description

Motivation and Context

Have you tested this? If so, how?

mention-bot commented Feb 24, 2017

chrispalmer commented Mar 2, 2017

quentinsf commented Mar 2, 2017

Tarrasch commented Mar 5, 2017

quentinsf commented Mar 5, 2017

Tarrasch left a comment

Choose a reason for hiding this comment

visjedi commented May 10, 2017 • edited Loading

visjedi commented May 10, 2017 •

edited

Loading