Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataCatalog can be mutated but changes are not reflected in the session #2728

Open
astrojuanlu opened this issue Jun 26, 2023 · 9 comments
Open

Comments

@astrojuanlu
Copy link
Member

Today I wanted to apply an "advanced" hook use case: storing the catalog and then injecting datasets on the fly. However, it doesn't work:

class MissingDatasetHooks:
    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog):
        self._catalog = catalog

    @hook_impl
    def before_dataset_loaded(self, dataset_name):
        dataset = self._catalog._get_dataset(dataset_name)
        try:
            dataset.load()
        except DataSetError:
            # Create EmptyDataset on the fly
            logger.warning("Attempted to load dataset %s which doesn't exist yet, injecting it", dataset_name)
            missing_dataset = MissingDataSet(dataset=dataset)
            self._catalog.add(dataset_name, missing_dataset, replace=True)

the self._catalog that gets saved receives the .add(..., replace=True) correctly, but the catalog.load that comes immediately after the before_dataset_loaded hook still has the old dataset:

hook_manager.hook.before_dataset_loaded(dataset_name=name, node=node)
inputs[name] = catalog.load(name)

Using after_context_created gets the same result.

Context: I was trying to give a workaround for https://stackoverflow.com/q/76557758/554319.

Is this behavior expected?

Originally posted by @astrojuanlu in #2690 (comment)

@gitgud5000
Copy link

I've been trying to achieve this aswell. It would be a great feature to add.

@astrojuanlu
Copy link
Member Author

@gitgud5000 Can you detail a bit more what are you trying to achieve and why? We are trying to come up with use cases for this change.

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Jul 17, 2023
@gitgud5000
Copy link

gitgud5000 commented Jul 23, 2023

@astrojuanlu I'm trying to pass additional save_args based on the output of the node that creates a pandas.SQLTableDataSet as output.

My plan was to modify the catalog in a before_dataset_saved hook.

Similar to this: #910
This is also related: #898

@gitgud5000
Copy link

What I ended up doing was implementing a subclass of pandas.SQLTableDataSet

@noklam
Copy link
Contributor

noklam commented Jul 31, 2023

I don't think mutating DataCatalog is a good thing to do. I believe the hook system was designed in a way to avoid this exactly (correct me if I am wrong).

I am unsure if this example correct though because dataset.load() just load the dataset but not returning anything, how would it work? Maybe need an example to play with to investigate this.

There are however use cases that is very useful.
i.e. If you developing in a notebook environment, it is very convenient if you can inject/change data to test your pipeline without re-running the whole pipeline

@astrojuanlu
Copy link
Member Author

I don't think mutating DataCatalog is a good thing to do.

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

@noklam
Copy link
Contributor

noklam commented Jul 31, 2023

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

I think we need to clarify what's not working here. DataCatalog.add should work in a standalone mode? What doesn't work here is because catalog is not stored in KedroContext but rather hot-reload everytime it gets called.

I think there are two conflicting features here and we need to think of it carefully.

In any case, we should list out why we want to make this a singleton. I don't have the full context why it was designed that way, may be good to dig out the old issues and PR, but I don't have it now.

@astrojuanlu
Copy link
Member Author

Another user was confused by this but found a workaround that worked for their use case: using runner.run(..., catalog=new_catalog) https://linen-slack.kedro.org/t/16062852/hi-all-how-can-i-update-replace-catalog-entries-from-an-exis#e163f246-de25-405d-9462-7fbc757bf927

@astrojuanlu astrojuanlu changed the title Make DataCatalog a context-wide singleton? DataCatalog can be mutated but changes are not reflected in the session Nov 15, 2023
@noklam
Copy link
Contributor

noklam commented Mar 19, 2024

Maybe the action item for this ticket is:

  • Create documentation to explain DataCatalog immutability

Is there a good reason Kedro should start supporting this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants