Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on syntax to allow adding new attributes #2439

Closed
2 tasks done
merelcht opened this issue Mar 20, 2023 · 10 comments
Closed
2 tasks done

Decide on syntax to allow adding new attributes #2439

merelcht opened this issue Mar 20, 2023 · 10 comments
Assignees

Comments

@merelcht
Copy link
Member

merelcht commented Mar 20, 2023

Description

#1076

Context

  • Should we go for a top-level key or rather free-form keys. We'll go for a top-level key e.g:
my_dataset:
     type: ...
     filepath: ...
     metadata: <- this would be the top-level key
             kedro-viz: 
                  layer: ...
             kedro-mlflow: ....
             kedro-great-expectations: ....

or

my_dataset:
     type: ...
     filepath: ...
     kedro-viz: 
          layer: ...
     kedro-mlflow: ....
     kedro-great-expectations: ....

A top-level key would make it clearer that any entries under it aren't core Kedro, but custom attributes that will be handled by a plugin. However, the top-level key does add extra nesting, and especially for the layer attribute this could be annoying for users.

  • If we go for a top-level key, what should this be called? metadata, custom_data, .... ? The key will be called metadata

Possible Implementation

@Galileo-Galilei mentioned:

  • I'd go for a top level key rather than storing everything low level. This wil make explicit to users that the informations stored here are not provided by kedro (it will save you a lot of time due to questions not related to kedro, or possible "inconsistencies" between your documentation and their catalog due to third party plugins).

  • For this key, I prefer the name metadata which is more explicit than other propositions and kind of a standard but honestly I'll be fine with other propositions.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

@merelcht
Copy link
Member Author

The top-level key is a much simpler solution to handle on the Kedro side, because it would only require the parsing of the metadata key and not the unlimited possibilities of keys that could go under it.
Regardless of where we add the key parsing e.g. on the individual dataset level, catalog, or AbstractDataset, on the Kedro side we can just ignore whatever is under it and the plugins can look specifically for the metadata key and then whatever sub-key they have chosen, e.g. metadata.kedro-viz.

I suggest we go for the name metadata for the top-level key, similar to how other tools refer to extra information about datasets:

On the point of the layer key for the kedro-viz layers, I'd suggest we add a solution that can handle the layer being both inside a kedro-viz key as well as on the level where it is now. Also to ensure that adding the metadata level doesn't break anything.

@merelcht
Copy link
Member Author

An alternative option to the metadata key would be, mentioned by @idanov:

It's still a vague idea, but something along the lines of extending the @.... syntax
E.g. if you have an output called output and then you can have in your catalog the following:

output:
    type: ....
....
output@plotly:
    type: ....
....
output@preview:
    type: ....
....
output@summary:
    type: ....
....

@merelcht
Copy link
Member Author

I found an old discussion about this on the private-kedro repo where @lorenabalan suggested to treat "metadata" similarly to credentials: "i.e. you can specify it in a separate file & interpolate (main preference), but technically you can also expand it within the catalog.yml entry"

So you could have a catalog.yml file and metadata.yml file. I feel like that could be a good mid-way solution where the catalog file wouldn't get too much bloated.

I think we've established now that there's a legitimate need for a way to add new attributes to datasets, with the dataset preview being one of the use cases.

@AntonyMilneQB and @idanov could you comment with what your preferred solution would be?

@yetudada
Copy link
Contributor

So if I summarise correctly, there are four approaches:

  • Introducing metadata to each catalog.yml entry by either making a top-level metadata level field (A) or free-form keys (B)
  • Extending the ...@syntax (C)
  • Creating a metadata.yml file (D)

I want to ground this conversation in a look at an example. If we apply this approach to the layer attribute for Kedro-Viz, because this design should affect that too; it would make the examples look like:

Approach A

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv
  metadata:
    kedro-viz:
       layer: raw

Approach B

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv
  kedro-viz:
    layer: raw

Approach C

I was confused about how the ...@syntax would work here, would we have to specify type and filepath?

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv

companies@layer: 
  type: pandas.CSVDataSet
  layer: raw
  filepath: data/01_raw/companies.csv

Approach D

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv
# metadata.yml

companies_metadata:
  layer: raw

General thoughts

  • When I look at the examples at this level, then Approach A is my favourite because:
    • It allows users to interact with this metadata in one place and does not force users to work in two files; we're trying to reduce the number of files in the project template Drastically simplify the default project template #2149
    • It follows a pattern used by other packages
    • It's very descriptive to users, so they know exactly what the concept is

@antonymilne
Copy link
Contributor

antonymilne commented Mar 27, 2023

tl;dr: I like the metadata top-level key approach without a separate metadata.yml file (@yetudada Approach A). The only question in my mind is how things are structured below that point, i.e. which of these do we prefer?

# Option A1. Clear separation between different plugin responsibilities, gives more grouping.
dataset:
  type: ...
  metadata:
    kedro-viz:
      layer: raw
      preview: ...

# Option A2. Less nested. Means that e.g. layer could be used in other plugins rather than being kedro-viz specific.
# Might lead to naming conflicts though
dataset:
  type: ...
  metadata:
    layer: raw
    preview: ...

Probably option 1 is best, but we can worry about this later anyway. Basically for the purpose of kedro framework, we don't use anything that's in metadata so don't really care about the structure there. It can be up to the individual user/plugin author to decide what they call their keys and how they structure them.


metadata.yml file (@yetudada Approach D): no need to do this now or maybe ever

I was also looking for this discussion to remind myself on it because it's got many valuable points I think. So thank you for bringing it up 👍

I was a big fan of this at the time and still think it has its merits, but IMO the main advantage of it was that it would reduce repetition in catalog.yml by allowing you to inject repeated metadata into many entries without having to explicit repeat it across many catalog entries.

However, the situation is a bit different now compared to before since we now have omegaconfig and are actively trying to solve #2423. Assuming we can solve #2423 then the issue of repeating metadata across many catalog entries would be much less significant.

The suggestion was to use the same mechanism to inject metadata that we already use for credentials. This behaviour of magically injecting credentials is already a little bit weird but makes sense in that special case (to stop you from leaking credentials). Since we now have another new mechanism for variable interpolation (omegaconfig), I am now less keen to promote our own custom mechanism in another place.

If we do want to it that then we can always add it later anyway, because putting metadata inline in a catalog.yml should still remain possible. Otherwise I think we aren't solving the simple case here at all (@Galileo-Galilei "I can create an extra configuration file where the user declares all these attributes, but this would be a much better user experience to be able to declare this schema in the catalog.").

@ syntax (@yetudada Approach C): I prefer metdata key

To be honest I don't really understand this well enough to offer much of an informed opinion on it, but from what I see here I am not a big fan. A metadata key seems much more intuitive and obvious to users and something similar is done in many other tools. The @ syntax feels kind of weird and unnatural. To me, transcoding sort of makes sense to have a special @ syntax because it determines the topology of the execution graph, but the other bits of metadata don't (there could be particular confusion here with @layer, which affects the kedro-viz topology but not node execution order).

Purely from an implementation point of view it also sounds more complex to me too (need to parse all catalog entries with @ in their name in a special way, do transcoding in some cases but not in others) vs. just ignore the metadata key.

@antonymilne
Copy link
Contributor

antonymilne commented Mar 27, 2023

Looks like I posted at the same type as @yetudada but I basically agree with her here. Both my option 1 and 2 are sub-types of Approach A, which is my preferred one too.

I think the code snippets for Approach C and D should look more like this though:

Approach C

I'm pretty sure you wouldn't repeat type and filepath here, though would be good if @idanov could give an example to clarify since I also don't really understand the idea.

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv

companies@layer: 
  layer: raw

Approach D

Here you would still need to provide metadata: companies_metadata in the catalog entry to provide the mapping to metadata.yml - just like credentials, it's not done automatically (e.g. if you want to provide same metadata to many catalog entries):

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv
  metadata: companies_metadata  # Need this line
# metadata.yml

companies_metadata:
  layer: raw

@merelcht
Copy link
Member Author

Thanks @yetudada and @AntonyMilneQB for the comments! Looking at all the examples, my preference still goes to Approach A.

I like the idea of A2, described by Antony, but I worry about the clarity of use of it. As a new user coming into a project, you’d have to go and find out which metadata key does what, because some might be kedro-viz specific, another one kedro-mlflow etc.. And since we’ll let the plugins implement these we wouldn’t be able to document all key options on our side.

@noklam
Copy link
Contributor

noklam commented Mar 27, 2023

I also favor the A1 approach since that's the most intuitive one to me.

I share some concern as @AntonyMilneQB about namespace conflicting - or do we need to take one step more to protect the namespace in case there are multiple plugins. But I am not overly concern about this, since for plugins we are also not handling this right now (potentially plugins can override each other and cause weird behavior) and no one is complaining.

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Mar 27, 2023

Just an additional argument against the @ syntax: it would couple tightly the framework (catalog and ConfigLoader) and the datasets because there will have some extra magic happening while loading the catalog with the ConfigLoader. This is not necessarily bad, but this creates some inconsitencies between the yaml and the python API. While the python API is not the recommended way to create datasets, I find it useful for a couple of use cases:

  • when I am debugging a complex catalog entry, I like creating the entry in a notebook with the python API instead of the yaml API and handle it interactively
  • when I am creating unit tests, the python API is much easier to handle than creating an entire kedro catalog with the proper config loader parsing just to test a dataset.
    So I'd be more inclined to keep consistency between the 2 API as much as possible.

Regarding the extra file, this may become something useful if plugins were to introduce very long and complex attributes (e.g. the entire schema of a big table) but this will be very easy to add if needed later. For now as @AntonyMilneQB mentions I feel as a kerdo user that it would be much more readable to have these attributes near the dataset they belong so we I can see at a glance.

Just for clarity : I'd vote for approach A1 too ;)

@merelcht
Copy link
Member Author

Thanks everyone for sharing your thoughts and preferences! It's very clear that everyone that commented here prefers the option with the top-level metadata key. I will now close this issue as the decision on the syntax is made and the implementation will be done in #2440

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

5 participants