-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Support None
partition field name
#28417
[Datasets] Support None
partition field name
#28417
Conversation
None
partitioning field nameNone
partition field name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. What happens if you specify not enough labels for directory partitioning? E.g., there are 4 dir labels, but you only specify ["one", "two", "three"]? Do we currently crash or ignore the 4th field?
I'm wondering if we should just require a prefix to be specified, ignoring all dirs after that. Then, DirPartitioning([]) is equivalent to no partitioning.
The program raises an
We could do that. It'd satisfy the ImageNet scenario ( That being said, the prefix approach won't support paths like |
Hmm, it seems bad to raise assertion error due to directory structure. The prefix strategy seems better to me, but I'd defer to @clarkzinzow @c21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think that allowing None
to indicate a non-partition directory is a good way to go; it's more flexible than the prefix which wouldn't support intermixed partitions and non-partitions.
Agreed with @ericl that we should raise a proper error rather than an AssertionError
, any chance we could fix that in this PR?
I changed it in the "Add |
@bveeramani Perfect, thank you! This PR looks good to merge, then. |
Some datasets are stored like ``` root/cat/images/cat1.png root/cat/images/cat2.png ... root/dog/images/dog.png ... ``` To read these datasets, we want to parse the label (e.g., `"cat"` or `"dog"`) but not `"images"`: ``` read_images(..., partitioning=Partitioning("dir", labels=["label", None], base_dir=root])) ``` This PR makes it possible to exclude certain directory partitions (e.g., `"images"`). Signed-off-by: PaulFenton <[email protected]>
Signed-off-by: Balaji Veeramani [email protected]
Why are these changes needed?
Some datasets are stored like
To read these datasets, we want to parse the label (e.g.,
"cat"
or"dog"
) but not"images"
:This PR makes it possible to exclude certain directory partitions (e.g.,
"images"
).Related issue number
See #28413 and #28256
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.