-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata #2961
Comments
Here is a gist with some stata commands to show how labels propagate across a merge. I can use this gist to answer more questions about how metadata works in Stata in order to guide behavior in DataFrames.jl |
Given the recent requests we need to add at least minimal metadata in 1.4 release that I want to have soon. Arrow supports This means that for now I propose that we have a The things to do are:
Please comment if you are OK with this plan and I will move forward with it then. |
I think this plan looks good. Yes. w.r.t. propagation rules, please see the gist above for some examples of propagation. |
AFAICT the gist shows column-level metadata and, at least for now, we add whole data frame level metadata. My thinking was that the following operations should preserve metadata:
|
This sounds like a good plan to me; I'm planning on doing another blitz on Arrow.jl in the next month or so, in case we need any additional support there. |
@quinnj - the starting point is to move out |
Sounds good. I would store metadata as a Also, let's keep in mind that given that storing metadata that applies to the whole data frame (as opposed to column-level metadata) can be more generally useful for any kind of object (as discussed at JuliaData/DataAPI.jl#22 (comment) and below), the API could be extended to any type in the future via a generic mechanism such as Metadata.jl. |
I feel like metadata only really gets useful when you have 1000+ columns. So I don't think a named tuple is a good idea. |
A (I don't think the number of columns is relevant here as we're talking about global metadata, there isn't one entry for each column.) |
I would go for a union with
If I understand your comment correctly then we should discuss it in JuliaData/DataAPI.jl#48 as that is the place to define the general API we want to follow and DataFrames.jl would just implement this API. Right? |
Hi @bkamins ! Sorry for the time I went missing :D I had some problems... With respect to:
We can use two approaches:
|
Closed with #3055 |
This issue is opened to finally close the discussions in #35 and #2276 about metadata and add this feature in 1.4 release.
People need metadata on:
For a single data frame or its column one can annotate them with metadata using https://github.com/Tokazama/Metadata.jl.
However, what we need are mechanisms in DataFrames.jl that define rules how metadata should be handled in transformations which calls for custom metadata implementation.
Finally it seems that the most commonly needed metadata is column labels (verbose names) that could be used in pretty printing tables.
Also occasionally users ask for ability to add row level metadata (a.k.a. row names) to a data frame. I think we could add it as long as we clearly indicate that it will not be fast and such row names should not be considered a part of data frame data (this is speculative and we could discuss if we really agree here - this is just an idea).
To make the design right I would like to raise the following issues:
serialize
a data frame, but I think it is essential that we are consistent with Arrow.jl support for Apache Arrow metadata (so that if we have a metadata in a data frame then Arrow.jl can correctly save it to disk and read it back); since Arrow.jl also uses metadata to encode custom types I would like to ask @quinnj to comment what metadata format that we adopted in DataFrames.jl would be compatible with the status and plans for Arrow.jl development;In summary: I think that in order to resolve this issue consistently we need a decision on Tables.jl/Arrow.jl level first what we want to support generically and then implement it in DataFrames.jl. The point is that unless metadata can be saved/loaded its usefulness is limited.
The text was updated successfully, but these errors were encountered: