Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

targets: Implement temporary_path decorator #1909

Merged
merged 1 commit into from
Nov 9, 2016

Conversation

Tarrasch
Copy link
Contributor

@Tarrasch Tarrasch commented Nov 3, 2016

Description

As has been detailed in #1519, there are many ways to do
temporary files in luigi. The use case for temporary files are always
about achieving atomic writes.

This PR finally describes the problem in the luigi documentation and
tries to interlink essential parts. Also it implements a decorator
that is meant to be able to unify many of the existing implementations,
and as a new, clean and recommended way to do temporary files and atomic
moves using luigi.

Have you tested this? If so, how?

There are included unit tests, although they only run with a lot mocked. I've run something similar in production for over half a year now. I intend to also try this a little bit (at least for hdfs+local disk)

@mention-bot
Copy link

@Tarrasch, thanks for your PR! By analyzing the history of the files in this pull request, we identified @jcrobak, @freider and @gpoulin to be potential reviewers.

@Tarrasch
Copy link
Contributor Author

Tarrasch commented Nov 3, 2016

@ulzha, do you mind reading the docs I included in this. Maybe there's been other experiences or best practices from Spotify you want to share?

@erikbern
Copy link
Contributor

erikbern commented Nov 3, 2016

looks reasonable, although consider using contextlib.contextmanager for this

@Tarrasch
Copy link
Contributor Author

Tarrasch commented Nov 4, 2016

@erikbern, I tried but contextlib.contextmanager doesn't really work here. The reason is because this contexmanager is not as simple as init-run-finalize. temporary_path() will finalize conditionally, depending on if there was an exception raised, or not.

@erikbern, does the docs section look good?

@Tarrasch
Copy link
Contributor Author

Tarrasch commented Nov 8, 2016

@ulzha @dlstadther @daveFNbuck, mind reviewing this? Or at least say you have not had big problems with atomic writes like I've had.

@daveFNbuck
Copy link
Contributor

I like the documentation and agree with the importance of atomic writes, but I've mostly just used core code for it and not had to write any myself so I haven't had any big problems. I'm still in favor of anything that makes it easier though.

if exc_type is None:
# There were no exceptions
self.target.fs.rename_dont_move(self._temp_path, self.target.path)
return False # False means we don't suppress the exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we clean up the temporary file on failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this actually. And I think it's best to not clean them up. I usually find the temporary files to be invaluable traces left when debugging. I'm however up for in an upcoming patch add a kwarg like delete_on_failure=False.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's reasonable. I have a daily cron job to clean up all the leftover tmp files from MRs so HDFS doesn't fill up, so I'm mostly ok either way.

self._temp_path = '{}-luigi-tmp-{:010}{}'.format(
self.target.path.rstrip('/'),
num,
'/' if self.target._has_trailing_slash() else '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way you're handling slashes here could cause problems with targets on Windows machines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from @daveFNbuck 's comment about deleting tmp file parts upon failure, nothing stands out to me as odd, lacking, or error prone. Personally, I don't have issues with file writes not completing (perhaps my output isn't as large/complex as others').

with target.temporary_path() as tmp_path:
assert re.match(pattern, tmp_path)
self.fs.rename_dont_move.assert_called_once_with(tmp_path, target.path)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To save you the trouble of reading the Travis report, your only issue was a flake8 complaint. You included too may \n here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dooooh

As has been detailed in spotify#1519, there are many ways to do
temporary files in luigi. The use case for temporary files are always
about achieving atomic writes.

This PR finally describes the problem in the luigi documentation and
tries to interlink essential parts. Also it implements a decorator
that is meant to be able to unify many of the existing implementations,
and as a new, clean and recommended way to do temporary files and atomic
moves using luigi.
@Tarrasch Tarrasch merged commit de9230f into spotify:master Nov 9, 2016
@Tarrasch Tarrasch deleted the add-temporary-file-cm branch November 9, 2016 02:42
This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants