Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove name ambiguity in dataset configuration #1111

Merged
merged 1 commit into from
Apr 26, 2024
Merged

Remove name ambiguity in dataset configuration #1111

merged 1 commit into from
Apr 26, 2024

Conversation

shaahji
Copy link
Contributor

@shaahji shaahji commented Apr 25, 2024

Remove name ambiguity in dataset configuration

Dataset configuration i.e. data_configs are represeneted as objects in config and so have two name entries -

  • the key of the object itself
  • DataConfig::name field

Use cases around the code use either/or and expect both these names to be identifical with no enforcement during validation. Removing the duplication by making the data_configs a list of objects so the key isn't required anymore. Updated the code to use DataConfig::name for all look-ups.

Note: Pending document update.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
  • Is this PR including examples changes? If yes, please remember to update example documentation in a follow-up PR.

(Optional) Issue link

Dataset configuration i.e. data_configs are represeneted as objects
in config and so have two name entries -
  - the key of the object itself
  - DataConfig::name field

Use cases around the code use either/or and expect both these names
to be identifical with no enforcement during validation. Removing
the duplication by making the data_configs a list of objects so the
key isn't required anymore. Updated the code to use DataConfig::name
for all look-ups.
}
"data_configs": [
{ "name": "dataset_1", ...},
{ "name": "dataset_2", ...},
Copy link
Collaborator

@guotuofeng guotuofeng Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trajepl , should we keep the key or keep the name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to direcly remove the name field from data_configs, and keep the original "data_configs": {'xx': {}...}

"pad_to_max_len": false
}
"data_configs": [{
"name": "oasst1_train",
Copy link
Collaborator

@guotuofeng guotuofeng Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the name instead of key would have the possibility that there are multiple dataset have the same name. we need dedup them in our code.

For me, it seems we could leverage key since the unique key is guaranteed by json loading itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedup them in our code

Yes, but this happens once during validation. Not necessarily a hot code path.

Having DataConfig::name makes it easier to identify object when debugging/printing. Having the key doesn't provide any such advantage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jambayk, @trajepl, what's your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hitesh and I previously discussed the two options. I was originally for removing the name field and keeping the dict key. But the name for debugging purpose sounds useful too.

I think it the decision depends on whether the debugging usefulness outweights the benefits of keeping the keys (1. consistent with other fields like systems, evaluators, etc, 2.guaranteed unique keys 3. direct index into data_configs)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jambayk, I will let you and @trajepl to make decision since I am ok with either approach.

@shaahji shaahji merged commit b01f806 into main Apr 26, 2024
35 checks passed
@shaahji shaahji deleted the shaahji/dc1 branch April 26, 2024 01:11
DavitGrigoryan132 pushed a commit to DavitGrigoryan132/Olive that referenced this pull request Aug 14, 2024
## Remove name ambiguity in dataset configuration
Dataset configuration i.e. data_configs are represeneted as objects in
config and so have two name entries -
  - the key of the object itself
  - DataConfig::name field

Use cases around the code use either/or and expect both these names to
be identifical with no enforcement during validation. Removing the
duplication by making the data_configs a list of objects so the key
isn't required anymore. Updated the code to use DataConfig::name for all
look-ups.

Note: Pending document update.

## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [x] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [x] Lint and apply fixes to your code by running `lintrunner -a`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.
- [ ] Is this PR including examples changes? If yes, please remember to
update [example
documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md)
in a follow-up PR.

## (Optional) Issue link
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants