Avoid exponential call to rewrite dataset names when creating `_FrozenDatasets` #953

limdauto · 2021-10-12T22:15:47Z

Description

Fixes #951

Development notes

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
Added tests to cover my changes

Notice

I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.
I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

kedro/io/data_catalog.py

deepyaman · 2021-10-13T12:11:31Z

Is this the right place to optimize? Should add_all and add_feed_dict even construct a new FrozenDatasets on each add?

add encapsulates some logic around checking whether or not dataset exists in catalog, which could be encapsulated into some _exists_in_catalog. Then, the add/update is pretty straightforward, and add_all/add_feed_dict can just update a plain dictionary as opposed to constructing FrozenDatasets each time.

limdauto · 2021-10-21T16:50:31Z

@deepyaman I think the main motivation (and I'm guessing) was that FrozenDatasets is completely immutable, so not sure how this would look in practice add_all/add_feed_dict can just update a plain dictionary as opposed to constructing FrozenDatasets each time.

That said, I think you are right in the sense that the approach proposed here is not the right one. I'm trying another one.

deepyaman · 2021-10-22T18:57:08Z

@deepyaman I think the main motivation (and I'm guessing) was that FrozenDatasets is completely immutable, so not sure how this would look in practice add_all/add_feed_dict can just update a plain dictionary as opposed to constructing FrozenDatasets each time.

I don't see this as an issue. From a user perspective, you should not be able to update FrozenDatasets, sure; from a framework perspective, if you can merge the dictionaries more smartly before constructing the FrozenDatasets that the user interacts with (under the hood), that's fine. Framework code exists to do this kind of optimization. :)

limdauto · 2021-10-25T13:57:21Z

@deepyaman

from a framework perspective, if you can merge the dictionaries more smartly before constructing the FrozenDatasets that the user interacts with (under the hood), that's fine

Yep, going to push something along this line. Thank you

when creating frozen datasets

limdauto · 2021-10-25T23:14:49Z

@deepyaman I have pushed a fix which avoid any extra processing when creating _FrozenDatasets from another _FrozenDatasets. The extra processing of key name only happens now for newly added datasets.

Did some benchmark for call counts to the _sub_nonword_chars method. Same catalog.

Before:

After

deepyaman

LGTM

deepyaman · 2021-10-26T07:22:14Z

kedro/io/data_catalog.py

@@ -56,6 +56,7 @@

 CATALOG_KEY = "catalog"
 CREDENTIALS_KEY = "credentials"
+WORDS_REGEX_PATTERN = re.compile(r"\W+")


FYI this isn't really necessary AFAIK, as the compiled regex gets cached anyway, and you don't have a ton of regexes here (see https://docs.python.org/3/library/re.html#re.compile)

cc @datajoely just as an FYI

Ah interesting - didn't know that, I still think this is nice for readability

lorenabalan

Brilliant!! Thank you so much for this. 🙏

merelcht

Nice one! 👏

…edro-org#953)

…nDatasets` (kedro-org#953) Signed-off-by: Laurens Vijnck <[email protected]>

limdauto mentioned this pull request Oct 12, 2021

Slow startup because of catalog processing #951

Closed

limdauto requested review from idanov, lorenabalan, datajoely and deepyaman October 12, 2021 22:16

limdauto changed the title ~~Make sure _FrozenDataSets get created only once~~ Make sure _FrozenDataset get created only once Oct 12, 2021

limdauto force-pushed the fix/frozen-dataset branch from 5599115 to 2594da1 Compare October 12, 2021 22:37

limdauto changed the title ~~Make sure _FrozenDataset get created only once~~ Only expand _FrozenDataset into a full dictionary on first read Oct 12, 2021

datajoely reviewed Oct 13, 2021

View reviewed changes

kedro/io/data_catalog.py Outdated Show resolved Hide resolved

mzjp2 reviewed Oct 13, 2021

View reviewed changes

kedro/io/data_catalog.py Outdated Show resolved Hide resolved

Reduce the number of unnecessary rewriting of dataset names

5dd5e00

when creating frozen datasets

limdauto force-pushed the fix/frozen-dataset branch from 2594da1 to 5dd5e00 Compare October 25, 2021 23:10

limdauto changed the title ~~Only expand _FrozenDataset into a full dictionary on first read~~ Avoid exponential call to rewrite dataset names when creating _FrozenDataset Oct 25, 2021

Use re.compile

2be4fc5

Add test

6d7bce8

limdauto marked this pull request as ready for review October 25, 2021 23:37

Merge branch 'master' into fix/frozen-dataset

1c2108b

deepyaman approved these changes Oct 26, 2021

View reviewed changes

deepyaman changed the title ~~Avoid exponential call to rewrite dataset names when creating _FrozenDataset~~ Avoid exponential call to rewrite dataset names when creating _FrozenDatasets Oct 26, 2021

lorenabalan approved these changes Oct 26, 2021

View reviewed changes

Merge branch 'master' into fix/frozen-dataset

5db72c9

merelcht approved these changes Oct 26, 2021

View reviewed changes

Merge branch 'master' into fix/frozen-dataset

c8bfc0c

limdauto merged commit 8b06dc6 into master Oct 26, 2021

limdauto deleted the fix/frozen-dataset branch October 26, 2021 10:45

Galileo-Galilei pushed a commit to Galileo-Galilei/kedro that referenced this pull request Feb 19, 2022

[KED-1748] Namespace exclusion for modular pipelines inputs/outputs (k…

40267d5

…edro-org#953)

lvijnck pushed a commit to lvijnck/kedro that referenced this pull request Apr 7, 2022

Avoid exponential call to rewrite dataset names when creating `_Froze…

ae8ef63

…nDatasets` (kedro-org#953) Signed-off-by: Laurens Vijnck <[email protected]>

deepyaman mentioned this pull request May 2, 2022

Improve performance when many datasets are missing kedro-org/kedro-viz#832

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid exponential call to rewrite dataset names when creating `_FrozenDatasets` #953

Avoid exponential call to rewrite dataset names when creating `_FrozenDatasets` #953

limdauto commented Oct 12, 2021 •

edited

Loading

deepyaman commented Oct 13, 2021

limdauto commented Oct 21, 2021

deepyaman commented Oct 22, 2021

limdauto commented Oct 25, 2021

limdauto commented Oct 25, 2021

deepyaman left a comment

deepyaman Oct 26, 2021

datajoely Oct 26, 2021

lorenabalan left a comment

merelcht left a comment

Avoid exponential call to rewrite dataset names when creating _FrozenDatasets #953

Avoid exponential call to rewrite dataset names when creating _FrozenDatasets #953

Conversation

limdauto commented Oct 12, 2021 • edited Loading

Description

Development notes

Checklist

Notice

deepyaman commented Oct 13, 2021

limdauto commented Oct 21, 2021

deepyaman commented Oct 22, 2021

limdauto commented Oct 25, 2021

limdauto commented Oct 25, 2021

deepyaman left a comment

Choose a reason for hiding this comment

deepyaman Oct 26, 2021

Choose a reason for hiding this comment

datajoely Oct 26, 2021

Choose a reason for hiding this comment

lorenabalan left a comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Avoid exponential call to rewrite dataset names when creating `_FrozenDatasets` #953

Avoid exponential call to rewrite dataset names when creating `_FrozenDatasets` #953

limdauto commented Oct 12, 2021 •

edited

Loading