Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

followup changes to allow unsupported datasets #261

Merged
merged 2 commits into from
Apr 24, 2024

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Apr 24, 2024

Stack from ghstack (oldest at bottom):

  1. Remove build_dataloader_fn as we only use HF data loading for now. And it helps allow unsupported dataset if user specify a dataset_path
  2. If user uses an unsupported dataset without specifying dataset_path, we should still throw.

tianyu-l added a commit that referenced this pull request Apr 24, 2024
ghstack-source-id: ce288a19c67fccd0751c6fd92ae14a161da8bfa3
Pull Request resolved: #261
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 24, 2024
raise ValueError(
f"Dataset {dataset_name} is not supported. "
f"Supported datasets are: {list(_supported_datasets.keys())}."
)
Copy link
Contributor

@fegin fegin Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also error out if users pass both a supported dataset_name and dataset_path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this was a decision discussed long ago. The point is for any dataset, we'd like to offer two ways to use data, one is to download from HF hub, the other is to use local files if a path is provided. In the latter case, user still needs to be clear about the correspondence between dataset name and dataset path. Let's still keep this to make it less error-prone.

1. Remove `build_dataloader_fn` as we only use HF data loading for now. And it helps allow unsupported dataset if user specify a `dataset_path`
2. If user uses an unsupported `dataset` without specifying `dataset_path`, we should still throw.

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Apr 24, 2024
ghstack-source-id: 34b380d251e0a80ac5328fdaeb33a1e488f9c735
Pull Request resolved: #261
@tianyu-l tianyu-l merged commit 9ea236b into gh/tianyu-l/10/base Apr 24, 2024
4 checks passed
tianyu-l added a commit that referenced this pull request Apr 24, 2024
ghstack-source-id: 34b380d251e0a80ac5328fdaeb33a1e488f9c735
Pull Request resolved: #261
@tianyu-l tianyu-l deleted the gh/tianyu-l/10/head branch April 24, 2024 18:16
tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024
ghstack-source-id: 34b380d251e0a80ac5328fdaeb33a1e488f9c735
Pull Request resolved: pytorch#261
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: 34b380d251e0a80ac5328fdaeb33a1e488f9c735
Pull Request resolved: pytorch#261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants