Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Commit

Permalink
fix race condition when extracting files with cached_path (#5227)
Browse files Browse the repository at this point in the history
* fix race condition when extracting files with cached_path

* add warning when directory already exists
  • Loading branch information
epwalsh authored May 27, 2021
1 parent d662977 commit 12155c4
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 1 deletion.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- When `PretrainedTransformerIndexer` folds long sequences, it no longer loses the information from token type ids.
- Fixed documentation for `GradientDescentTrainer.cuda_device`.
- Fixed the potential for a race condition with `cached_path()` when extracting archives. Although the race condition
is still possible if used with `force_extract=True`.
- Fixed `wandb` callback to work in distributed training.


Expand Down
18 changes: 17 additions & 1 deletion allennlp/common/file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,10 @@ def cached_path(
force_extract : `bool`, optional (default = `False`)
If `True` and the file is an archive file, it will be extracted regardless
of whether or not the extracted directory already exists.
!!! Warning
Use this flag with caution! This can lead to race conditions if used
from multiple processes on the same file.
"""
if cache_dir is None:
cache_dir = CACHE_DIRECTORY
Expand Down Expand Up @@ -325,12 +329,24 @@ def cached_path(

if extraction_path is not None:
# If the extracted directory already exists (and is non-empty), then no
# need to extract again unless `force_extract=True`.
# need to create a lock file and extract again unless `force_extract=True`.
if os.path.isdir(extraction_path) and os.listdir(extraction_path) and not force_extract:
return extraction_path

# Extract it.
with FileLock(extraction_path + ".lock"):
# Check again if the directory exists now that we've acquired the lock.
if os.path.isdir(extraction_path) and os.listdir(extraction_path):
if force_extract:
logger.warning(
"Extraction directory for %s (%s) already exists, "
"overwriting it since 'force_extract' is 'True'",
url_or_filename,
extraction_path,
)
else:
return extraction_path

logger.info("Extracting %s to %s", url_or_filename, extraction_path)
shutil.rmtree(extraction_path, ignore_errors=True)

Expand Down

0 comments on commit 12155c4

Please sign in to comment.