Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle non-ASCII chars in dataset names 🌶️ #487

Merged
merged 17 commits into from
Sep 10, 2020

Conversation

deepyaman
Copy link
Member

@deepyaman deepyaman commented Aug 16, 2020

Description

  • Have simpler, yet more robust, handling around the "friendly" conversion of data set names
  • It also helps avoid clashes like between datasets "perchè" and "perché" or even "jalapeño" and "jalape@o"
  • On the last two points, if you do want special handling of accents, it should be more explicit/not here

Development notes

Dropped a Python 3.5 Windows relic allowing only ASCII files

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
  • Added tests to cover my changes

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

  • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.

  • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

  • I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

@deepyaman deepyaman marked this pull request as ready for review August 16, 2020 19:30
@deepyaman deepyaman requested a review from idanov as a code owner August 16, 2020 19:30
@deepyaman deepyaman changed the title Can you handle "jalapeño"? 🌶️ Gracefully handle non-ASCII chars in dataset names 🌶️ Aug 16, 2020
@deepyaman deepyaman changed the title Gracefully handle non-ASCII chars in dataset names 🌶️ Gracefully handle non-ASCII chars in dataset names 🌶️ Aug 16, 2020
Copy link
Contributor

@limdauto limdauto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Could you mention this in the release note as well?

Copy link
Contributor

@921kiyo 921kiyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! It's also worth noting this in the release note :)

tests/io/test_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
@deepyaman
Copy link
Member Author

One more question: Would it make sense to move this into a function?

The reason I came across this was because 0.16.2 broke my code, so I'm handling it like so: deepyaman/kedro-accelerator@86897b0

Having it as a function would mean less likely changes for my (and similar) code, although I'm not sure this should change that frequently anyway.

@921kiyo
Copy link
Contributor

921kiyo commented Aug 18, 2020

Yeah, you can wrap datasets = {re.sub(r"\W+", "__", key): value for key, value in datasets.items()} as a private function, or leave it as it. Since it is one line, I'm happy with either.

@deepyaman
Copy link
Member Author

Yeah, you can wrap datasets = {re.sub(r"\W+", "__", key): value for key, value in datasets.items()} as a private function, or leave it as it. Since it is one line, I'm happy with either.

Sorry, didn't do it last night as promised. I'll do it shortly

@921kiyo 921kiyo changed the title Gracefully handle non-ASCII chars in dataset names 🌶️ [DO NOT MERGE THIS YET] Gracefully handle non-ASCII chars in dataset names 🌶️ Aug 21, 2020
@deepyaman deepyaman changed the title [DO NOT MERGE THIS YET] Gracefully handle non-ASCII chars in dataset names 🌶️ Gracefully handle non-ASCII chars in dataset names 🌶️ Sep 10, 2020
@deepyaman deepyaman self-assigned this Sep 10, 2020
@921kiyo 921kiyo merged commit 93e4cc3 into master Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants