Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: add_feed_dict() performance bottleneck #3912

Open
ElenaKhaustova opened this issue Jun 3, 2024 · 9 comments
Open

[DataCatalog]: add_feed_dict() performance bottleneck #3912

ElenaKhaustova opened this issue Jun 3, 2024 · 9 comments
Assignees

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jun 3, 2024

Description

The current implementation of add_feed_dict() leads to performance bottlenecks because it calls add() method which duplicates the structure of _FrozenDatasets, resulting O(N^2) complexity thus unnecessary slowdowns, especially in case of many catalog entries.

We propose implementing a more efficient approach that directly updates datasets collection without the need for copying _FrozenDatasets structures.

Context

self.add(dataset_name, dataset, replace)

self.datasets = _FrozenDatasets(self.datasets, {dataset_name: dataset})

for collection in datasets_collections:

Steps to Reproduce

from time import monotonic


f = _FrozenDatasets({f"k_{i}": i for i in range(100)})
now = monotonic()
for i in range(10000):
    f = _FrozenDatasets(f, {f"new_{i}": i})
print(monotonic() - now)
# 0.191298125020694

f = _FrozenDatasets({f"k_{i}": i for i in range(100)})
now = monotonic()
for i in range(100000):
    f = _FrozenDatasets(f, {f"new_{i}": i})
print(monotonic() - now)
# 21.588158333004685

f = _FrozenDatasets({f"k_{i}": i for i in range(100)})
now = monotonic()
for i in range(120000):
    f = _FrozenDatasets(f, {f"new_{i}": i})
print(monotonic() - now)
# 33.06306695801322

Suggested Implementation

Modify _FrozenDatasets constructor, so it only inputs a dict[str, AbstractDataset]. Keep using self.__dict__.update() in the constructor to add datasets into the _FrozenDatasets. In case of extending _FrozenDatasets collection as in add() method use _FrozenDatasets.__dict__.update(). We can also consider adding _FrozenDatasets._update() method wrapping _FrozenDatasets.__dict__.update() logic and use it in the constructor and upon add().

@astrojuanlu
Copy link
Member

xref #3930 which also discusses catalog mutability

@ankatiyar
Copy link
Contributor

Interestingly I found this from 2021 :P #951

@DimedS
Copy link
Contributor

DimedS commented Jun 20, 2024

As I understand it, we use the _FrozenDatasets class to achieve immutability for the datasets attribute of the DataCatalog class. This means that if someone wants to modify the datasets, it will take O(n) complexity because a new _FrozenDatasets instance is created by copying the previous one.

In my opinion, to solve this issue, we should first consider cancelling immutability. Regarding the current ticket, it's unclear to me why the current situation is a problem. Do we frequently encounter scenarios where add_feed_dict() is called multiple times?

@astrojuanlu
Copy link
Member

astrojuanlu commented Jun 20, 2024

Do we frequently encounter scenarios where add_feed_dict() is called multiple times?

Indeed, that's the topic of #3930 (some examples from plugins linked there)

Admittedly, some follow-up questions could be asked to better understand in which cases do we want to allow mutability in the DataCatalog

@DimedS
Copy link
Contributor

DimedS commented Jun 24, 2024

@ElenaKhaustova, could you please comment: Is it correct that the solution proposed in that ticket leads to the loss of dataset immutability? Before implementing it, we need to agree on this loss as described in #3930 ?

@ElenaKhaustova
Copy link
Contributor Author

As I understand it, we use the _FrozenDatasets class to achieve immutability for the datasets attribute of the DataCatalog class. This means that if someone wants to modify the datasets, it will take O(n) complexity because a new _FrozenDatasets instance is created by copying the previous one.

In my opinion, to solve this issue, we should first consider cancelling immutability. Regarding the current ticket, it's unclear to me why the current situation is a problem. Do we frequently encounter scenarios where add_feed_dict() is called multiple times?

That's the case when people use multi-runner for tuning parameters which under the hood creates numerous similar datasets with namespaces and thus extensively uses the add_feed_dict() method. It's indeed different from the typical usage of pipelines.

@ElenaKhaustova
Copy link
Contributor Author

@ElenaKhaustova, could you please comment: Is it correct that the solution proposed in that ticket leads to the loss of dataset immutability? Before implementing it, we need to agree on this loss as described in #3930 ?

Well, technically one can modify it now using private methods as well. The suggestion for now is just to use _FrozenDatasets.__dict__.update() instead of recreating an object. So for users, it's still immutable since it doesn't have a setter but we modify it internally inside add() method. But in future, it will probably be redesigned.

@DimedS
Copy link
Contributor

DimedS commented Jun 24, 2024

@ElenaKhaustova, could you please comment: Is it correct that the solution proposed in that ticket leads to the loss of dataset immutability? Before implementing it, we need to agree on this loss as described in #3930 ?

Well, technically one can modify it now using private methods as well. The suggestion for now is just to use _FrozenDatasets.__dict__.update() instead of recreating an object. So for users, it's still immutable since it doesn't have a setter but we modify it internally inside add() method. But in future, it will probably be redesigned.

@ElenaKhaustova, thank you for the explanation. I have two questions:

  1. Is it correct that we do not need to modify the _FrozenDatasets class constructor to solve the current issue?
  2. Is your proposal to modify the add() function inside the DataCatalog class from:
    self.datasets = _FrozenDatasets(self.datasets, {dataset_name: dataset})
    to:
    self.datasets.__dict__.update({dataset_name: dataset})
    If we make this modification, we would start treating the datasets attribute as mutable within the Kedro codebase. I am concerned about this change because the immutability of datasets might be important not only for user experience but also for the internal logic of Kedro. I don't know the full list of reasons behind the decision to make them immutable initially. Could you clarify if this approach is acceptable for the Kedro DataCatalog architecture?

@ElenaKhaustova
Copy link
Contributor Author

@ElenaKhaustova, could you please comment: Is it correct that the solution proposed in that ticket leads to the loss of dataset immutability? Before implementing it, we need to agree on this loss as described in #3930 ?

Well, technically one can modify it now using private methods as well. The suggestion for now is just to use _FrozenDatasets.__dict__.update() instead of recreating an object. So for users, it's still immutable since it doesn't have a setter but we modify it internally inside add() method. But in future, it will probably be redesigned.

@ElenaKhaustova, thank you for the explanation. I have two questions:

  1. Is it correct that we do not need to modify the _FrozenDatasets class constructor to solve the current issue?
  2. Is your proposal to modify the add() function inside the DataCatalog class from:
    self.datasets = _FrozenDatasets(self.datasets, {dataset_name: dataset})
    to:
    self.datasets.__dict__.update({dataset_name: dataset})
    If we make this modification, we would start treating the datasets attribute as mutable within the Kedro codebase. I am concerned about this change because the immutability of datasets might be important not only for user experience but also for the internal logic of Kedro. I don't know the full list of reasons behind the decision to make them immutable initially. Could you clarify if this approach is acceptable for the Kedro DataCatalog architecture?
  1. The overall suggestion is to replace object re-creation with an update.
  2. We need to modify the constructor as well. Now, it inputs either _FrozenDatasets or dict[str, AbstractDataset], but we can input only dict[str, AbstractDataset].
  3. In the add() method we can add a loop to update self.datasets.__dict__ as suggested above.
  4. It doesn't change anything in terms of immutability from the user perspective cause _FrozenDatasets will still have the same setter - not allowing modification directly.
  5. datasets attribute will be treated as mutable within the Kedro codebase, yes, but at first glance, it shouldn't break anything. But it's a good time to check if I'm wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

5 participants