Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataPipe] pipe opener #403

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

[DataPipe] pipe opener #403

wants to merge 2 commits into from

Conversation

tmbdev
Copy link
Contributor

@tmbdev tmbdev commented May 13, 2022

This PR adds a filter that works similarly to FileOpener but uses subprocesses for opening files. The filter also opens local files.

The use of subprocesses allows for easy asynchronous I/O and avoids having to use third party libraries to access object stores and cloud storage.

Files are specified as URLs and the subprocess command line is constructed based on the URL schema.

Out of the box, PipeOpener permits access to local files, S3, GCS, HTTP, HTTPS, and AIStore. Other URL schemas can be supported simply by adding "schema=commandline" arguments to the pipeline. The generic "pipe:" schema allows arbitrary commands to be used as data sources.

PipeOpener is particularly useful with WebDataset, since WebDataset benefits greatly from asynchronous I/O and location independent datasets. But PipeOpener can be used with any remote files and provides a simple, portable way of accessing S3 and other such servers.

FileCache is a useful complement for PipeOpener, allowing the construction of pipelines that incrementally download datasets and still provide local file-based access to the dataset.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022
@VitalyFedyunin VitalyFedyunin changed the title pipe opener [DataPipe] pipe opener May 19, 2022
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting placeholder review to return back later. Several concerns:

  1. We are not controlling pool size, meaning any buffer type DataPipe (ex shuffle) will spawn crazy amount of subprocesses.
  2. Popen is error prone in terms of security (ex: pipe: rm -rf /)
  3. It will be hard to implement snapshotting as we would need to create synchronization points somehow.

@tmbdev
Copy link
Contributor Author

tmbdev commented May 20, 2022

(1) I'm not sure why you think that. You can watch the subprocesses being spawned one at a time by using the verbose flag.

(2) Other parts of torch aren't safe against malicious arguments. Furthermore, use of the class is under developer control.

In any case, how about I add an allow_pipe flag to make this feature more explicit to users? The pipe: feature really is very useful in practice, allowing data access with scripts like "pipe:ssh host cat /..." etc.

(3) If you want to support snapshotting of either compressed or streaming files, snapshots always need to restart opening and reading the current file at the start in the opener class. It's the tar archive reader that needs to deal with restarting in the middle. It can do that by keeping a set of files in the current stream it has already returned and skipping them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants