Add metadata #2961

bkamins · 2021-12-11T10:49:41Z

This issue is opened to finally close the discussions in #35 and #2276 about metadata and add this feature in 1.4 release.

People need metadata on:

data frame level (e.g. creation date, source, ...)
column level (e.g. verbose label of a column, unit of measurement, notes)

For a single data frame or its column one can annotate them with metadata using https://github.com/Tokazama/Metadata.jl.
However, what we need are mechanisms in DataFrames.jl that define rules how metadata should be handled in transformations which calls for custom metadata implementation.
Finally it seems that the most commonly needed metadata is column labels (verbose names) that could be used in pretty printing tables.
Also occasionally users ask for ability to add row level metadata (a.k.a. row names) to a data frame. I think we could add it as long as we clearly indicate that it will not be fast and such row names should not be considered a part of data frame data (this is speculative and we could discuss if we really agree here - this is just an idea).

To make the design right I would like to raise the following issues:

we need a way to save metadata if it is to be useful; of course one can always serialize a data frame, but I think it is essential that we are consistent with Arrow.jl support for Apache Arrow metadata (so that if we have a metadata in a data frame then Arrow.jl can correctly save it to disk and read it back); since Arrow.jl also uses metadata to encode custom types I would like to ask @quinnj to comment what metadata format that we adopted in DataFrames.jl would be compatible with the status and plans for Arrow.jl development;
Relatedly I think that if we do it we should agree on metadata format for Tables.jl tables (of course as an opt in, again defaulting to no metadata)
I think we need to make PrettyTables.jl metadata aware, and allow users to switch default printing of data frame to pretty printing that uses column labels (and potentially row labels is we agree that we want it). Thus I am adding @ronisbr to the discussion (but probably the decision what and how to do it is for later as first we need to resolve the points about Arrow.jl and Tables.jl).
Finally - we need metadata propagation rules. I know that @pdeffebach has some thoughts how they should be defined.
As usual I add @nalimilan for general comments.

In summary: I think that in order to resolve this issue consistently we need a decision on Tables.jl/Arrow.jl level first what we want to support generically and then implement it in DataFrames.jl. The point is that unless metadata can be saved/loaded its usefulness is limited.

The text was updated successfully, but these errors were encountered:

pdeffebach · 2022-01-31T17:07:25Z

Here is a gist with some stata commands to show how labels propagate across a merge. I can use this gist to answer more questions about how metadata works in Stata in order to guide behavior in DataFrames.jl

bkamins · 2022-05-07T16:09:34Z

Given the recent requests we need to add at least minimal metadata in 1.4 release that I want to have soon.
The design should be consistent with https://arrow.juliadata.org/stable/reference/#Arrow.getmetadata-Tuple{Arrow.Table} and https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata (at least as much as possible).

Arrow supports Dict{String, String} to store metadata on table and column level and uses : as namespace separator with ARROW being reserved top-level namespace. I propose to try to follow the same rules as this would allow saving DataFrames.jl metadata in Arrow.jl easily (and also the reverse - populating DataFrames.jl metadata from Arrow.jl files).

This means that for now I propose that we have a DataFrame level metadata only for now (as they are uncontroversial and they are clearly needed - we can add other metadata later if we want) and store a single Dict{String, String} in DataFrame object (again - we can in make it more flexible later if needed). I would use getmetadata(::AbstractDataFrame) function to get the metadata dictionary (that would be always initialized - the cost is 80ns and 512 bytes - I hope we think it is acceptable). A small additional benefit would be that metadata would be type stable. Later - for now - I would assume that users manually work with the returned dictionary (add/remove elements from it)

The things to do are:

add the functionality to DataFrames.jl - I will do it;
make a decision about propagation of metadata under various operations (I will discuss this with @pdeffebach the list of operations that should preserve metadata);
discuss with @quinnj how to integrate reading/writing of this metadata with Arrow.jl;
move getmetadata definition to DataAPI.jl so that Arrow.jl and DataFrames.jl can just add methods to this function (@quinnj - I hope you are OK with this); DataFrames.jl would export getmetadata;

Please comment if you are OK with this plan and I will move forward with it then.

pdeffebach · 2022-05-07T16:19:57Z

I think this plan looks good. Yes.

w.r.t. propagation rules, please see the gist above for some examples of propagation.

bkamins · 2022-05-07T16:31:48Z

w.r.t. propagation rules, please see the gist above for some examples of propagation.

AFAICT the gist shows column-level metadata and, at least for now, we add whole data frame level metadata. My thinking was that the following operations should preserve metadata:

indexing, views, sorting, filtering, flattening: all preserved
reshaping: metadata not preserved
joining: left metadata updated with right metadata (but if right is the same as left then left is kept)
select/transform: I was not sure, but I think it should be preserved
combine: I was not sure, as now the interpretation can change (as opposed to select/transform) but maybe also preserve it for consistency?

quinnj · 2022-05-08T04:05:27Z

This sounds like a good plan to me; I'm planning on doing another blitz on Arrow.jl in the next month or so, in case we need any additional support there.

bkamins · 2022-05-08T06:20:51Z

@quinnj - the starting point is to move out getmetadata to DataAPI.jl and make Arrow.jl just add method to this function. I think the fallback getmetatada(::Any) = nothing then should be defined in DataAPI.jl (and this fallback should be removed from Arrow.jl then).

see JuliaData/DataAPI.jl#48

nalimilan · 2022-05-08T17:07:01Z

Sounds good. I would store metadata as a Union{Dict, Nothing} field to avoid the overhead of allocating a new dict even when metadata isn't used. Another solution mentioned by @quinnj in another issue would be to return the empty NamedTuple(), which is essentially as fast but allows the user to call keys, isempty, etc. without checking whether there are actually metadata or not. This could also work for the fallback definition. Not sure which approach is best.

Also, let's keep in mind that given that storing metadata that applies to the whole data frame (as opposed to column-level metadata) can be more generally useful for any kind of object (as discussed at JuliaData/DataAPI.jl#22 (comment) and below), the API could be extended to any type in the future via a generic mechanism such as Metadata.jl. DataFrame would implement via an internal field an API that other types would implement using a fallback method that uses a global dict.

pdeffebach · 2022-05-09T06:34:04Z

NamedTuple(), which is essentially as fast but allows the user to call keys, isempty, etc. without checking whether there are actually metadata or not. This could also work for the fallback definition. Not sure which approach is best.

I feel like metadata only really gets useful when you have 1000+ columns. So I don't think a named tuple is a good idea.

nalimilan · 2022-05-09T07:27:51Z

A NamedTuple would only be returned when there's no metadata, instead of returning nothing. But that's indeed a bit weird.

(I don't think the number of columns is relevant here as we're talking about global metadata, there isn't one entry for each column.)

bkamins · 2022-05-09T10:41:15Z

I would go for a union with Nothing then

the API could be extended to any type in the future via a generic mechanism such as Metadata.jl. DataFrame would implement via an internal field an API that other types would implement using a fallback method that uses a global dict.

If I understand your comment correctly then we should discuss it in JuliaData/DataAPI.jl#48 as that is the place to define the general API we want to follow and DataFrames.jl would just implement this API. Right?

ronisbr · 2022-05-09T22:57:35Z

Hi @bkamins !

Sorry for the time I went missing :D I had some problems...

With respect to:

I think we need to make PrettyTables.jl metadata aware, and allow users to switch default printing of data frame to pretty printing that uses column labels (and potentially row labels is we agree that we want it). Thus I am adding @ronisbr to the discussion (but probably the decision what and how to do it is for later as first we need to resolve the points about Arrow.jl and Tables.jl).

We can use two approaches:

(The better one) Add the code inside DataFrames to support printing the metadata as soon as we decide what it is and how to print it.
(The slower one) After a global metadata interface in Tables.jl is defined, I add support in PrettyTables.jl and then DataFrames will have it by default.

bkamins · 2022-09-20T07:45:33Z

Closed with #3055

bkamins added the feature label Dec 11, 2021

bkamins added this to the 1.4 milestone Dec 11, 2021

bkamins added the metadata label Jan 31, 2022

bkamins mentioned this issue Jan 31, 2022

Continue adding Metadata to dataframes #1458

Closed

chiraganand mentioned this issue May 2, 2022

Add consistency checks to expensive operations xKDR/TSFrames.jl#29

Open

bkamins mentioned this issue May 8, 2022

add metadata JuliaData/DataAPI.jl#48

Merged

bkamins mentioned this issue May 22, 2022

Metadata on data frame and column level #3055

Merged

bkamins closed this as completed Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metadata #2961

Add metadata #2961

bkamins commented Dec 11, 2021

pdeffebach commented Jan 31, 2022

bkamins commented May 7, 2022

pdeffebach commented May 7, 2022

bkamins commented May 7, 2022

quinnj commented May 8, 2022

bkamins commented May 8, 2022 •

edited

Loading

nalimilan commented May 8, 2022

pdeffebach commented May 9, 2022

nalimilan commented May 9, 2022

bkamins commented May 9, 2022

ronisbr commented May 9, 2022

bkamins commented Sep 20, 2022

Add metadata #2961

Add metadata #2961

Comments

bkamins commented Dec 11, 2021

pdeffebach commented Jan 31, 2022

bkamins commented May 7, 2022

pdeffebach commented May 7, 2022

bkamins commented May 7, 2022

quinnj commented May 8, 2022

bkamins commented May 8, 2022 • edited Loading

nalimilan commented May 8, 2022

pdeffebach commented May 9, 2022

nalimilan commented May 9, 2022

bkamins commented May 9, 2022

ronisbr commented May 9, 2022

bkamins commented Sep 20, 2022

bkamins commented May 8, 2022 •

edited

Loading