-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove name ambiguity in dataset configuration #1111
Conversation
Dataset configuration i.e. data_configs are represeneted as objects in config and so have two name entries - - the key of the object itself - DataConfig::name field Use cases around the code use either/or and expect both these names to be identifical with no enforcement during validation. Removing the duplication by making the data_configs a list of objects so the key isn't required anymore. Updated the code to use DataConfig::name for all look-ups.
} | ||
"data_configs": [ | ||
{ "name": "dataset_1", ...}, | ||
{ "name": "dataset_2", ...}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trajepl , should we keep the key or keep the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to direcly remove the name field from data_configs, and keep the original "data_configs": {'xx': {}...}
"pad_to_max_len": false | ||
} | ||
"data_configs": [{ | ||
"name": "oasst1_train", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the name instead of key would have the possibility that there are multiple dataset have the same name. we need dedup them in our code.
For me, it seems we could leverage key since the unique key is guaranteed by json loading itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dedup them in our code
Yes, but this happens once during validation. Not necessarily a hot code path.
Having DataConfig::name makes it easier to identify object when debugging/printing. Having the key doesn't provide any such advantage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hitesh and I previously discussed the two options. I was originally for removing the name field and keeping the dict key. But the name for debugging purpose sounds useful too.
I think it the decision depends on whether the debugging usefulness outweights the benefits of keeping the keys (1. consistent with other fields like systems, evaluators, etc, 2.guaranteed unique keys 3. direct index into data_configs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Remove name ambiguity in dataset configuration Dataset configuration i.e. data_configs are represeneted as objects in config and so have two name entries - - the key of the object itself - DataConfig::name field Use cases around the code use either/or and expect both these names to be identifical with no enforcement during validation. Removing the duplication by making the data_configs a list of objects so the key isn't required anymore. Updated the code to use DataConfig::name for all look-ups. Note: Pending document update. ## Checklist before requesting a review - [ ] Add unit tests for this change. - [x] Make sure all tests can pass. - [ ] Update documents if necessary. - [x] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link
Remove name ambiguity in dataset configuration
Dataset configuration i.e. data_configs are represeneted as objects in config and so have two name entries -
Use cases around the code use either/or and expect both these names to be identifical with no enforcement during validation. Removing the duplication by making the data_configs a list of objects so the key isn't required anymore. Updated the code to use DataConfig::name for all look-ups.
Note: Pending document update.
Checklist before requesting a review
lintrunner -a
(Optional) Issue link