[DataCatalog]: `add_feed_dict()` performance bottleneck #3912

ElenaKhaustova · 2024-06-03T11:11:58Z

Description

The current implementation of add_feed_dict() leads to performance bottlenecks because it calls add() method which duplicates the structure of _FrozenDatasets, resulting O(N^2) complexity thus unnecessary slowdowns, especially in case of many catalog entries.

We propose implementing a more efficient approach that directly updates datasets collection without the need for copying _FrozenDatasets structures.

Context

kedro/kedro/io/data_catalog.py

Line 694 in 27f5405

self.add(dataset_name, dataset, replace)

kedro/kedro/io/data_catalog.py

Line 626 in 27f5405

self.datasets = _FrozenDatasets(self.datasets, {dataset_name: dataset})

kedro/kedro/io/data_catalog.py

Line 108 in 27f5405

for collection in datasets_collections:

Steps to Reproduce

from time import monotonic


f = _FrozenDatasets({f"k_{i}": i for i in range(100)})
now = monotonic()
for i in range(10000):
    f = _FrozenDatasets(f, {f"new_{i}": i})
print(monotonic() - now)
# 0.191298125020694

f = _FrozenDatasets({f"k_{i}": i for i in range(100)})
now = monotonic()
for i in range(100000):
    f = _FrozenDatasets(f, {f"new_{i}": i})
print(monotonic() - now)
# 21.588158333004685

f = _FrozenDatasets({f"k_{i}": i for i in range(100)})
now = monotonic()
for i in range(120000):
    f = _FrozenDatasets(f, {f"new_{i}": i})
print(monotonic() - now)
# 33.06306695801322

Suggested Implementation

Modify _FrozenDatasets constructor, so it only inputs a dict[str, AbstractDataset]. Keep using self.__dict__.update() in the constructor to add datasets into the _FrozenDatasets. In case of extending _FrozenDatasets collection as in add() method use _FrozenDatasets.__dict__.update(). We can also consider adding _FrozenDatasets._update() method wrapping _FrozenDatasets.__dict__.update() logic and use it in the constructor and upon add().

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-06-06T07:30:23Z

xref #3930 which also discusses catalog mutability

ankatiyar · 2024-06-12T16:50:07Z

Interestingly I found this from 2021 :P #951

DimedS · 2024-06-20T15:50:45Z

As I understand it, we use the _FrozenDatasets class to achieve immutability for the datasets attribute of the DataCatalog class. This means that if someone wants to modify the datasets, it will take O(n) complexity because a new _FrozenDatasets instance is created by copying the previous one.

In my opinion, to solve this issue, we should first consider cancelling immutability. Regarding the current ticket, it's unclear to me why the current situation is a problem. Do we frequently encounter scenarios where add_feed_dict() is called multiple times?

astrojuanlu · 2024-06-20T16:58:59Z

Do we frequently encounter scenarios where add_feed_dict() is called multiple times?

Indeed, that's the topic of #3930 (some examples from plugins linked there)

Admittedly, some follow-up questions could be asked to better understand in which cases do we want to allow mutability in the DataCatalog

DimedS · 2024-06-24T11:24:51Z

@ElenaKhaustova, could you please comment: Is it correct that the solution proposed in that ticket leads to the loss of dataset immutability? Before implementing it, we need to agree on this loss as described in #3930 ?

ElenaKhaustova · 2024-06-24T11:59:20Z

As I understand it, we use the _FrozenDatasets class to achieve immutability for the datasets attribute of the DataCatalog class. This means that if someone wants to modify the datasets, it will take O(n) complexity because a new _FrozenDatasets instance is created by copying the previous one.

In my opinion, to solve this issue, we should first consider cancelling immutability. Regarding the current ticket, it's unclear to me why the current situation is a problem. Do we frequently encounter scenarios where add_feed_dict() is called multiple times?

That's the case when people use multi-runner for tuning parameters which under the hood creates numerous similar datasets with namespaces and thus extensively uses the add_feed_dict() method. It's indeed different from the typical usage of pipelines.

ElenaKhaustova · 2024-06-24T12:06:22Z

@ElenaKhaustova, could you please comment: Is it correct that the solution proposed in that ticket leads to the loss of dataset immutability? Before implementing it, we need to agree on this loss as described in #3930 ?

Well, technically one can modify it now using private methods as well. The suggestion for now is just to use _FrozenDatasets.__dict__.update() instead of recreating an object. So for users, it's still immutable since it doesn't have a setter but we modify it internally inside add() method. But in future, it will probably be redesigned.

DimedS · 2024-06-24T14:58:22Z

@ElenaKhaustova, could you please comment: Is it correct that the solution proposed in that ticket leads to the loss of dataset immutability? Before implementing it, we need to agree on this loss as described in #3930 ?

Well, technically one can modify it now using private methods as well. The suggestion for now is just to use _FrozenDatasets.__dict__.update() instead of recreating an object. So for users, it's still immutable since it doesn't have a setter but we modify it internally inside add() method. But in future, it will probably be redesigned.

@ElenaKhaustova, thank you for the explanation. I have two questions:

Is it correct that we do not need to modify the _FrozenDatasets class constructor to solve the current issue?
Is your proposal to modify the add() function inside the DataCatalog class from:
self.datasets = _FrozenDatasets(self.datasets, {dataset_name: dataset})
to:
self.datasets.__dict__.update({dataset_name: dataset})
If we make this modification, we would start treating the datasets attribute as mutable within the Kedro codebase. I am concerned about this change because the immutability of datasets might be important not only for user experience but also for the internal logic of Kedro. I don't know the full list of reasons behind the decision to make them immutable initially. Could you clarify if this approach is acceptable for the Kedro DataCatalog architecture?

ElenaKhaustova · 2024-06-24T15:33:20Z

@ElenaKhaustova, could you please comment: Is it correct that the solution proposed in that ticket leads to the loss of dataset immutability? Before implementing it, we need to agree on this loss as described in #3930 ?

Well, technically one can modify it now using private methods as well. The suggestion for now is just to use _FrozenDatasets.__dict__.update() instead of recreating an object. So for users, it's still immutable since it doesn't have a setter but we modify it internally inside add() method. But in future, it will probably be redesigned.

@ElenaKhaustova, thank you for the explanation. I have two questions:

Is it correct that we do not need to modify the _FrozenDatasets class constructor to solve the current issue?

Is your proposal to modify the add() function inside the DataCatalog class from:
self.datasets = _FrozenDatasets(self.datasets, {dataset_name: dataset})
to:
self.datasets.__dict__.update({dataset_name: dataset})
If we make this modification, we would start treating the datasets attribute as mutable within the Kedro codebase. I am concerned about this change because the immutability of datasets might be important not only for user experience but also for the internal logic of Kedro. I don't know the full list of reasons behind the decision to make them immutable initially. Could you clarify if this approach is acceptable for the Kedro DataCatalog architecture?

The overall suggestion is to replace object re-creation with an update.
We need to modify the constructor as well. Now, it inputs either _FrozenDatasets or dict[str, AbstractDataset], but we can input only dict[str, AbstractDataset].
In the add() method we can add a loop to update self.datasets.__dict__ as suggested above.
It doesn't change anything in terms of immutability from the user perspective cause _FrozenDatasets will still have the same setter - not allowing modification directly.
datasets attribute will be treated as mutable within the Kedro codebase, yes, but at first glance, it shouldn't break anything. But it's a good time to check if I'm wrong.

merelcht added this to the Redesign the API for IO (catalog) milestone Jun 3, 2024

ElenaKhaustova mentioned this issue Jun 6, 2024

Research summary of insights for redesigning Kedro's data catalog API #3934

Open

merelcht assigned ankatiyar and DimedS Jun 7, 2024

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog]: `add_feed_dict()` performance bottleneck #3912

[DataCatalog]: `add_feed_dict()` performance bottleneck #3912

ElenaKhaustova commented Jun 3, 2024 •

edited

Loading

astrojuanlu commented Jun 6, 2024

ankatiyar commented Jun 12, 2024

DimedS commented Jun 20, 2024

astrojuanlu commented Jun 20, 2024 •

edited

Loading

DimedS commented Jun 24, 2024

ElenaKhaustova commented Jun 24, 2024

ElenaKhaustova commented Jun 24, 2024

DimedS commented Jun 24, 2024

ElenaKhaustova commented Jun 24, 2024

[DataCatalog]: add_feed_dict() performance bottleneck #3912

[DataCatalog]: add_feed_dict() performance bottleneck #3912

Comments

ElenaKhaustova commented Jun 3, 2024 • edited Loading

Description

Context

Steps to Reproduce

Suggested Implementation

astrojuanlu commented Jun 6, 2024

ankatiyar commented Jun 12, 2024

DimedS commented Jun 20, 2024

astrojuanlu commented Jun 20, 2024 • edited Loading

DimedS commented Jun 24, 2024

ElenaKhaustova commented Jun 24, 2024

ElenaKhaustova commented Jun 24, 2024

DimedS commented Jun 24, 2024

ElenaKhaustova commented Jun 24, 2024

[DataCatalog]: `add_feed_dict()` performance bottleneck #3912

[DataCatalog]: `add_feed_dict()` performance bottleneck #3912

ElenaKhaustova commented Jun 3, 2024 •

edited

Loading

astrojuanlu commented Jun 20, 2024 •

edited

Loading