Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata #2961

Closed
bkamins opened this issue Dec 11, 2021 · 12 comments
Closed

Add metadata #2961

bkamins opened this issue Dec 11, 2021 · 12 comments

Comments

@bkamins
Copy link
Member

bkamins commented Dec 11, 2021

This issue is opened to finally close the discussions in #35 and #2276 about metadata and add this feature in 1.4 release.

People need metadata on:

  • data frame level (e.g. creation date, source, ...)
  • column level (e.g. verbose label of a column, unit of measurement, notes)

For a single data frame or its column one can annotate them with metadata using https://github.com/Tokazama/Metadata.jl.
However, what we need are mechanisms in DataFrames.jl that define rules how metadata should be handled in transformations which calls for custom metadata implementation.
Finally it seems that the most commonly needed metadata is column labels (verbose names) that could be used in pretty printing tables.
Also occasionally users ask for ability to add row level metadata (a.k.a. row names) to a data frame. I think we could add it as long as we clearly indicate that it will not be fast and such row names should not be considered a part of data frame data (this is speculative and we could discuss if we really agree here - this is just an idea).

To make the design right I would like to raise the following issues:

  • we need a way to save metadata if it is to be useful; of course one can always serialize a data frame, but I think it is essential that we are consistent with Arrow.jl support for Apache Arrow metadata (so that if we have a metadata in a data frame then Arrow.jl can correctly save it to disk and read it back); since Arrow.jl also uses metadata to encode custom types I would like to ask @quinnj to comment what metadata format that we adopted in DataFrames.jl would be compatible with the status and plans for Arrow.jl development;
  • Relatedly I think that if we do it we should agree on metadata format for Tables.jl tables (of course as an opt in, again defaulting to no metadata)
  • I think we need to make PrettyTables.jl metadata aware, and allow users to switch default printing of data frame to pretty printing that uses column labels (and potentially row labels is we agree that we want it). Thus I am adding @ronisbr to the discussion (but probably the decision what and how to do it is for later as first we need to resolve the points about Arrow.jl and Tables.jl).
  • Finally - we need metadata propagation rules. I know that @pdeffebach has some thoughts how they should be defined.
  • As usual I add @nalimilan for general comments.

In summary: I think that in order to resolve this issue consistently we need a decision on Tables.jl/Arrow.jl level first what we want to support generically and then implement it in DataFrames.jl. The point is that unless metadata can be saved/loaded its usefulness is limited.

@pdeffebach
Copy link
Contributor

Here is a gist with some stata commands to show how labels propagate across a merge. I can use this gist to answer more questions about how metadata works in Stata in order to guide behavior in DataFrames.jl

@bkamins
Copy link
Member Author

bkamins commented May 7, 2022

Given the recent requests we need to add at least minimal metadata in 1.4 release that I want to have soon.
The design should be consistent with https://arrow.juliadata.org/stable/reference/#Arrow.getmetadata-Tuple{Arrow.Table} and https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata (at least as much as possible).

Arrow supports Dict{String, String} to store metadata on table and column level and uses : as namespace separator with ARROW being reserved top-level namespace. I propose to try to follow the same rules as this would allow saving DataFrames.jl metadata in Arrow.jl easily (and also the reverse - populating DataFrames.jl metadata from Arrow.jl files).

This means that for now I propose that we have a DataFrame level metadata only for now (as they are uncontroversial and they are clearly needed - we can add other metadata later if we want) and store a single Dict{String, String} in DataFrame object (again - we can in make it more flexible later if needed). I would use getmetadata(::AbstractDataFrame) function to get the metadata dictionary (that would be always initialized - the cost is 80ns and 512 bytes - I hope we think it is acceptable). A small additional benefit would be that metadata would be type stable. Later - for now - I would assume that users manually work with the returned dictionary (add/remove elements from it)

The things to do are:

  • add the functionality to DataFrames.jl - I will do it;
  • make a decision about propagation of metadata under various operations (I will discuss this with @pdeffebach the list of operations that should preserve metadata);
  • discuss with @quinnj how to integrate reading/writing of this metadata with Arrow.jl;
  • move getmetadata definition to DataAPI.jl so that Arrow.jl and DataFrames.jl can just add methods to this function (@quinnj - I hope you are OK with this); DataFrames.jl would export getmetadata;

Please comment if you are OK with this plan and I will move forward with it then.

@pdeffebach
Copy link
Contributor

I think this plan looks good. Yes.

w.r.t. propagation rules, please see the gist above for some examples of propagation.

@bkamins
Copy link
Member Author

bkamins commented May 7, 2022

w.r.t. propagation rules, please see the gist above for some examples of propagation.

AFAICT the gist shows column-level metadata and, at least for now, we add whole data frame level metadata. My thinking was that the following operations should preserve metadata:

  • indexing, views, sorting, filtering, flattening: all preserved
  • reshaping: metadata not preserved
  • joining: left metadata updated with right metadata (but if right is the same as left then left is kept)
  • select/transform: I was not sure, but I think it should be preserved
  • combine: I was not sure, as now the interpretation can change (as opposed to select/transform) but maybe also preserve it for consistency?

@quinnj
Copy link
Member

quinnj commented May 8, 2022

This sounds like a good plan to me; I'm planning on doing another blitz on Arrow.jl in the next month or so, in case we need any additional support there.

@bkamins
Copy link
Member Author

bkamins commented May 8, 2022

@quinnj - the starting point is to move out getmetadata to DataAPI.jl and make Arrow.jl just add method to this function. I think the fallback getmetatada(::Any) = nothing then should be defined in DataAPI.jl (and this fallback should be removed from Arrow.jl then).

see JuliaData/DataAPI.jl#48

@nalimilan
Copy link
Member

Sounds good. I would store metadata as a Union{Dict, Nothing} field to avoid the overhead of allocating a new dict even when metadata isn't used. Another solution mentioned by @quinnj in another issue would be to return the empty NamedTuple(), which is essentially as fast but allows the user to call keys, isempty, etc. without checking whether there are actually metadata or not. This could also work for the fallback definition. Not sure which approach is best.

Also, let's keep in mind that given that storing metadata that applies to the whole data frame (as opposed to column-level metadata) can be more generally useful for any kind of object (as discussed at JuliaData/DataAPI.jl#22 (comment) and below), the API could be extended to any type in the future via a generic mechanism such as Metadata.jl. DataFrame would implement via an internal field an API that other types would implement using a fallback method that uses a global dict.

@pdeffebach
Copy link
Contributor

NamedTuple(), which is essentially as fast but allows the user to call keys, isempty, etc. without checking whether there are actually metadata or not. This could also work for the fallback definition. Not sure which approach is best.

I feel like metadata only really gets useful when you have 1000+ columns. So I don't think a named tuple is a good idea.

@nalimilan
Copy link
Member

A NamedTuple would only be returned when there's no metadata, instead of returning nothing. But that's indeed a bit weird.

(I don't think the number of columns is relevant here as we're talking about global metadata, there isn't one entry for each column.)

@bkamins
Copy link
Member Author

bkamins commented May 9, 2022

I would go for a union with Nothing then

the API could be extended to any type in the future via a generic mechanism such as Metadata.jl. DataFrame would implement via an internal field an API that other types would implement using a fallback method that uses a global dict.

If I understand your comment correctly then we should discuss it in JuliaData/DataAPI.jl#48 as that is the place to define the general API we want to follow and DataFrames.jl would just implement this API. Right?

@ronisbr
Copy link
Member

ronisbr commented May 9, 2022

Hi @bkamins !

Sorry for the time I went missing :D I had some problems...

With respect to:

I think we need to make PrettyTables.jl metadata aware, and allow users to switch default printing of data frame to pretty printing that uses column labels (and potentially row labels is we agree that we want it). Thus I am adding @ronisbr to the discussion (but probably the decision what and how to do it is for later as first we need to resolve the points about Arrow.jl and Tables.jl).

We can use two approaches:

  1. (The better one) Add the code inside DataFrames to support printing the metadata as soon as we decide what it is and how to print it.
  2. (The slower one) After a global metadata interface in Tables.jl is defined, I add support in PrettyTables.jl and then DataFrames will have it by default.

@bkamins
Copy link
Member Author

bkamins commented Sep 20, 2022

Closed with #3055

@bkamins bkamins closed this as completed Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants