Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow startup because of catalog processing #951

Closed
Rodolphe-cambier opened this issue Oct 12, 2021 · 11 comments · Fixed by #953
Closed

Slow startup because of catalog processing #951

Rodolphe-cambier opened this issue Oct 12, 2021 · 11 comments · Fixed by #953
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@Rodolphe-cambier
Copy link

Rodolphe-cambier commented Oct 12, 2021

Description

When starting a Kedro pipeline with a big catalog, it can take mutliple minutes before it starts the pipeline. This time is lost parsing the catalog files. This is because for a total catalog size of 13000 entries, the code will call the function _sub_nonword_chars 100millions times.

Here is how it happens:

  1. When starting, Kedro call add_feed_dict() in _get_catalog(). Here. This happens 1 time.
  2. add_feed_dict() calls add() for each dataset. Here. This creates a FrozenDataset with all the existing key and the newly added key Here. In our case it happens 13000 times. It will happen with 1 key, then 2 keys, then 3 keys, ... up to 13000 keys.
  3. The issue is that every time you call add and create a FrozenDataset, all the keys are re-processed, here. Even the keys that were already added to the previous FrozenDataset. Since we are adding 13000 datasets, this amounts to 100millions calls to _sub_nonword_chars.

Steps to Reproduce

  1. Kedro run

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used: 0.17.5
  • Python version used: 3.7.11
  • Operating system and version: Any
@Rodolphe-cambier Rodolphe-cambier added the Issue: Bug Report 🐞 Bug that needs to be fixed label Oct 12, 2021
@datajoely
Copy link
Contributor

datajoely commented Oct 12, 2021

We'll look into this but this is the first time I've seen 13k catalog entries!

@Rodolphe-cambier
Copy link
Author

13k! Indeed we have quite a lot. But this is the only part that creates an issue with it.

@lorenabalan
Copy link
Contributor

lorenabalan commented Oct 12, 2021

@Rodolphe-cambier if you're not using an interactive workflow (i.e. trying too access catalog.my_dataset in an IPython or Jupyter session), you can define a custom DataCatalog class instead, to avoid the datasets attribute / frozendatasets business. To use it, you'll have to make register_catalog return the new object instead (in hooks.py).

@limdauto
Copy link
Contributor

@Rodolphe-cambier to add to what @lorenabalan suggested, as a workaround, you can also patch out _sub_nonword_chars and see if it helps:

with mock.patch("kedro.io.data_catalog._sub_nonword_chars", new=lambda s: s):
    from kedro.io import DataCatalog

    return DataCatalog.from_config(
        catalog, credentials, load_versions, save_version, journal
    )

@Rodolphe-cambier
Copy link
Author

Rodolphe-cambier commented Oct 12, 2021

Regarding how to profile the issue. Running a profiler on a simple kedro run --pipeline pipeline_name highlighted the following functions taking most of the time.

Sorted by own_time descending.

function_name call_count time(time) own_time(ms)
<method 'sub' of 're.Pattern' objects>	    106849876	302915	302898
_compile	                            106829650	55367	39719
get_data	                            6806	38226	38029
sub	                                    106804688	391190	33344
<dictcomp>	                            12992	455294	32677
_sub_nonword_chars	                    106800726	422711	31564
<built-in method builtins.isinstance>	    110328963	15883	15810
<built-in method time.sleep>	            111	11336	11336
<method 'recv_into' of '_socket.socket' objects>	1191	9296	9296
<built-in method _imp.create_dynamic>	    361	7686	7671
<method 'update' of 'dict' objects>	    69905	4050	4050
<built-in method io.open>	            2408	4114	3957
add	                                    12990	462070	2695

As you can see _sub_nonword_chars gets called 106849876 times.
@limdauto I did patch it on my local setup. I observed a substantial speedup and the profiling then looked like this:

function_name call_count time(time) own_time(ms)
<dictcomp>	                                        12992	15782	15659
<method 'recv_into' of '_socket.socket' objects>	1192	7851	7851
get_data	                                        6805	4198	4078
<method 'update' of 'dict' objects>	                69906	2481	2481
<built-in method _imp.create_dynamic>	                361	1997	1994
<built-in method time.sleep>	                        15	1542	1542
<built-in method io.open>	                        2408	1572	1427
add	                                                12990	19459	1030

@datajoely
Copy link
Contributor

datajoely commented Oct 12, 2021

The offending code is here in kedro.io.data_catalog we run this command on every dataset. In most use this won't be a problem, but for 13K datasets it feels painful:

def _sub_nonword_chars(data_set_name: str) -> str:
    """Replace non-word characters in data set names since Kedro 0.16.2.

    Args:
        data_set_name: The data set name registered in the data catalog.

    Returns:
        The name used in `DataCatalog.datasets`.
    """
    return re.sub(r"\W+", "__", data_set_name)


class _FrozenDatasets:
    """Helper class to access underlying loaded datasets"""

    def __init__(self, datasets):
        # Non-word characters in dataset names are replaced with `__`
        # for easy access to transcoded/prefixed datasets.
        datasets = {_sub_nonword_chars(key): value for key, value in datasets.items()}
        self.__dict__.update(**datasets)

    # Don't allow users to add/change attributes on the fly
    def __setattr__(self, key, value):
        msg = "Operation not allowed! "
        if key in self.__dict__.keys():
            msg += "Please change datasets through configuration."
        else:
            msg += "Please use DataCatalog.add() instead."
        raise AttributeError(msg)

As I see it we have two options to resolve this:

Option 1: Introduce a re.compile step

Stackoverflow suggests that for any pattern which we apply over and over again there is an advantage for doing compilation up front:
https://stackoverflow.com/a/47269110/2010808

Side point: Are we sure this doesn't cause issues for accented catalog entries?

Option 2: Do we need regex?

I actually think we can achieve the same outcome without using regular expressions and instead rely on the string class itself, I have however run any profiling to prove this is faster.

\W matches any non-word character (equivalent to [^a-zA-Z0-9_])

The following code will remove any punctuation (except _) using the string.translate method:

from string import punctuation

test = 'a!becDef_'
characters_to_remove = punctuation.replace('_', '')
test.translate(str.maketrans('', '', characters_to_remove))

>>> 'abecDef_'

It would be interesting if you @Rodolphe-cambier had a chance to patch this approach in and re-run the profiling step?

@Rodolphe-cambier
Copy link
Author

Rodolphe-cambier commented Oct 12, 2021

I do think there is also an under-lying issue. The fact the we re-process the same keys over and over again. Maybe I failed to express this in the opening post.

To make it even more explicit:

  • If you have a catalog with 4 keys: {key1: val1, key2: val2, key3: val3, key4: val4}
  • Then the code will create 4 FrozenDatasets.
    • first FrozenDataset({key1: val1})
    • then FrozenDataset({key1: val1, key2: val2})
    • then FrozenDataset({key1: val1, key2: val2, key3: val3})
    • then FrozenDataset({key1: val1, key2: val2, key3: val3, key4: val4})

And everytime we make such a FrozenDataset, we process all the keys with the regex. So here we would process 10 keys. Notice how this is quadratic in the number of keys in the catalog. This is why our 13k catalog entries lead to millions of calls.

I don't think the regex is the problem. If each key was processed once, we would only need to process 13k strings instead of 200 millions.

@Rodolphe-cambier
Copy link
Author

I'm not familiar enough with the codebase to offer a concrete solution.

Locally, what I did is to wrap the _sub_nonword_chars function with an lru_cache. This completely removes the problem at the cost of some memory.

But I still think that the code architecture should be changed to avoid those unnecessary calls.

@deepyaman
Copy link
Member

deepyaman commented Oct 12, 2021

@Rodolphe-cambier if you're not using an interactive workflow (i.e. trying too access catalog.my_dataset in an IPython or Jupyter session), you can define a custom DataCatalog class instead, to avoid the datasets attribute / frozendatasets business. To use it, you'll have to make register_catalog return the new object instead (in hooks.py).

@lorenabalan @limdauto Prefer not to patch it out entirely, because we're using it to generate projects for other teams and don't want them to have different behavior from what's standard. When I saw @Rodolphe-cambier's aforementioned solution to wrap the _sub_nonword_chars in lru_cache, I mentioned that @limdauto had done the patching out in the past, but that I preferred @Rodolphe-cambier's approach TBH because it kept the behavior the same. edit: I think this is convoluted; let's just say would also not prefer to lose IPython/Jupyter workflow, since others want to use our generated projects the standard Kedro way.

+1 to seeing if we can fix the calls scaling non-linearly, and also +1 to us figuring out why we have 13K entries to begin with (maybe another issue 🤦).

@limdauto
Copy link
Contributor

limdauto commented Oct 12, 2021

@Rodolphe-cambier oh no... thank you for your example. I was not aware that this is happening 😮 . This has such an embarassingly easy fix to make sure the complexity grows linearly... #953 -- Looking at the git history, this piece of code has not been touched for 2 years so this definitely has been an oversight that could only have been highlighted with a huge catalog.

Actually, thinking about this a bit more, I can understand why it was done that way originally. It wants to make sure that everytime catalog.add is called, we get a _FrozenDataset with "nice" keys back. I have updated #953 with a lazy approach: the _FrozenDataset is only expanded into a full dictionary with nice keys on first read. This is still not ideal (if user explicitly uses .add after first read, the string substitution routine won't run) but will think about it more tomorrow.

@limdauto
Copy link
Contributor

@Rodolphe-cambier I have fixed this in #953. Thanks again for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants