-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow startup because of catalog processing #951
Comments
We'll look into this but this is the first time I've seen 13k catalog entries! |
13k! Indeed we have quite a lot. But this is the only part that creates an issue with it. |
@Rodolphe-cambier if you're not using an interactive workflow (i.e. trying too access |
@Rodolphe-cambier to add to what @lorenabalan suggested, as a workaround, you can also patch out with mock.patch("kedro.io.data_catalog._sub_nonword_chars", new=lambda s: s):
from kedro.io import DataCatalog
return DataCatalog.from_config(
catalog, credentials, load_versions, save_version, journal
) |
Regarding how to profile the issue. Running a profiler on a simple Sorted by own_time descending.
As you can see
|
The offending code is here in def _sub_nonword_chars(data_set_name: str) -> str:
"""Replace non-word characters in data set names since Kedro 0.16.2.
Args:
data_set_name: The data set name registered in the data catalog.
Returns:
The name used in `DataCatalog.datasets`.
"""
return re.sub(r"\W+", "__", data_set_name)
class _FrozenDatasets:
"""Helper class to access underlying loaded datasets"""
def __init__(self, datasets):
# Non-word characters in dataset names are replaced with `__`
# for easy access to transcoded/prefixed datasets.
datasets = {_sub_nonword_chars(key): value for key, value in datasets.items()}
self.__dict__.update(**datasets)
# Don't allow users to add/change attributes on the fly
def __setattr__(self, key, value):
msg = "Operation not allowed! "
if key in self.__dict__.keys():
msg += "Please change datasets through configuration."
else:
msg += "Please use DataCatalog.add() instead."
raise AttributeError(msg) As I see it we have two options to resolve this: Option 1: Introduce a
|
I do think there is also an under-lying issue. The fact the we re-process the same keys over and over again. Maybe I failed to express this in the opening post. To make it even more explicit:
And everytime we make such a FrozenDataset, we process all the keys with the regex. So here we would process 10 keys. Notice how this is quadratic in the number of keys in the catalog. This is why our 13k catalog entries lead to millions of calls. I don't think the regex is the problem. If each key was processed once, we would only need to process 13k strings instead of 200 millions. |
I'm not familiar enough with the codebase to offer a concrete solution. Locally, what I did is to wrap the But I still think that the code architecture should be changed to avoid those unnecessary calls. |
@lorenabalan @limdauto Prefer not to patch it out entirely, because we're using it to generate projects for other teams and don't want them to have different behavior from what's standard. When I saw @Rodolphe-cambier's aforementioned solution to wrap the +1 to seeing if we can fix the calls scaling non-linearly, and also +1 to us figuring out why we have 13K entries to begin with (maybe another issue 🤦). |
@Rodolphe-cambier oh no... thank you for your example. I was not aware that this is happening 😮 .
|
@Rodolphe-cambier I have fixed this in #953. Thanks again for the report! |
Description
When starting a Kedro pipeline with a big catalog, it can take mutliple minutes before it starts the pipeline. This time is lost parsing the catalog files. This is because for a total catalog size of 13000 entries, the code will call the function
_sub_nonword_chars
100millions times.Here is how it happens:
add_feed_dict()
in_get_catalog()
. Here. This happens 1 time.add_feed_dict()
callsadd()
for each dataset. Here. This creates a FrozenDataset with all the existing key and the newly added key Here. In our case it happens 13000 times. It will happen with 1 key, then 2 keys, then 3 keys, ... up to 13000 keys.add
and create aFrozenDataset
, all the keys are re-processed, here. Even the keys that were already added to the previous FrozenDataset. Since we are adding 13000 datasets, this amounts to 100millions calls to_sub_nonword_chars
.Steps to Reproduce
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
The text was updated successfully, but these errors were encountered: