Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid exponential call to rewrite dataset names when creating _FrozenDatasets #953

Merged
merged 6 commits into from
Oct 26, 2021

Conversation

limdauto
Copy link
Contributor

@limdauto limdauto commented Oct 12, 2021

Description

Fixes #951

Development notes

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
  • Added tests to cover my changes

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

  • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.

  • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

  • I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

@limdauto limdauto changed the title Make sure _FrozenDataSets get created only once Make sure _FrozenDataset get created only once Oct 12, 2021
@limdauto limdauto changed the title Make sure _FrozenDataset get created only once Only expand _FrozenDataset into a full dictionary on first read Oct 12, 2021
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
@deepyaman
Copy link
Member

Is this the right place to optimize? Should add_all and add_feed_dict even construct a new FrozenDatasets on each add?

add encapsulates some logic around checking whether or not dataset exists in catalog, which could be encapsulated into some _exists_in_catalog. Then, the add/update is pretty straightforward, and add_all/add_feed_dict can just update a plain dictionary as opposed to constructing FrozenDatasets each time.

@limdauto
Copy link
Contributor Author

@deepyaman I think the main motivation (and I'm guessing) was that FrozenDatasets is completely immutable, so not sure how this would look in practice add_all/add_feed_dict can just update a plain dictionary as opposed to constructing FrozenDatasets each time.

That said, I think you are right in the sense that the approach proposed here is not the right one. I'm trying another one.

@deepyaman
Copy link
Member

@deepyaman I think the main motivation (and I'm guessing) was that FrozenDatasets is completely immutable, so not sure how this would look in practice add_all/add_feed_dict can just update a plain dictionary as opposed to constructing FrozenDatasets each time.

I don't see this as an issue. From a user perspective, you should not be able to update FrozenDatasets, sure; from a framework perspective, if you can merge the dictionaries more smartly before constructing the FrozenDatasets that the user interacts with (under the hood), that's fine. Framework code exists to do this kind of optimization. :)

@limdauto
Copy link
Contributor Author

@deepyaman

from a framework perspective, if you can merge the dictionaries more smartly before constructing the FrozenDatasets that the user interacts with (under the hood), that's fine

Yep, going to push something along this line. Thank you

@limdauto limdauto changed the title Only expand _FrozenDataset into a full dictionary on first read Avoid exponential call to rewrite dataset names when creating _FrozenDataset Oct 25, 2021
@limdauto
Copy link
Contributor Author

@deepyaman I have pushed a fix which avoid any extra processing when creating _FrozenDatasets from another _FrozenDatasets. The extra processing of key name only happens now for newly added datasets.

Did some benchmark for call counts to the _sub_nonword_chars method. Same catalog.

Before:
Screenshot 2021-10-26 at 00 07 32

After
Screenshot 2021-10-26 at 00 08 12

@limdauto limdauto marked this pull request as ready for review October 25, 2021 23:37
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -56,6 +56,7 @@

CATALOG_KEY = "catalog"
CREDENTIALS_KEY = "credentials"
WORDS_REGEX_PATTERN = re.compile(r"\W+")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this isn't really necessary AFAIK, as the compiled regex gets cached anyway, and you don't have a ton of regexes here (see https://docs.python.org/3/library/re.html#re.compile)

cc @datajoely just as an FYI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting - didn't know that, I still think this is nice for readability

@deepyaman deepyaman changed the title Avoid exponential call to rewrite dataset names when creating _FrozenDataset Avoid exponential call to rewrite dataset names when creating _FrozenDatasets Oct 26, 2021
Copy link
Contributor

@lorenabalan lorenabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant!! Thank you so much for this. 🙏

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! 👏

@limdauto limdauto merged commit 8b06dc6 into master Oct 26, 2021
@limdauto limdauto deleted the fix/frozen-dataset branch October 26, 2021 10:45
Galileo-Galilei pushed a commit to Galileo-Galilei/kedro that referenced this pull request Feb 19, 2022
lvijnck pushed a commit to lvijnck/kedro that referenced this pull request Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow startup because of catalog processing
6 participants