Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A second attempt at DataFrames Metadata #1429

Closed
wants to merge 0 commits into from

Conversation

pdeffebach
Copy link
Contributor

This is another attempt at adding metadata support to DataFrames, modeled after #1413. The ultimate api for metadata between this approach and that one is the same. Users will be able to create arbitrary metadata fields, with a special case for a label that hopefully other packages, like plotting, will be exposed to.

However the implementation of metadata differs substantially. A MetaData type is just a wrapper for a Dict from Symbol to Vector{String}. The goal is for df.metadata to behave exactly like df.columns. This means that the index.jl code does not need to be changed at all. getindex and setindex will work the same for df.metadata as for df.columns, enabling code that looks like this:

function insert(df, col_ind, item)
    insert!(index(df), col_ind, name)
    insert!(columns(df), col_ind, item)
    insert!(metadata(df), col_ind) # the only change that had to be made to the insert method. 
end

I have deliberately touched only a small amount of code for this PR.

  • Added a new file src/other/metadata/jl which defines basic operations like getindex, setindex and append for metadata.
  • Added only a single new constructor to dataframes, such that a new dataframe always creates a dataframe with empty metadata.
    • This is probably not desirable, but new constructors can be added later.
  • Implemented getindex, setindex etc. such that any subsetting, adding, and merging will work and preserve metadata.
  • Added only three functions exposed to the user. addlabel!, showlabel and showlabels. While adding arbitrary metadata fields (:unit etc) is feasible under the current system, I didn't want to complicate the api while we sort out how the interface in general might work.
  • Metadata is only strings, and an empty metadata is just an empty string "". When a new columns gets added to the dataframe, I just push "" onto the end of each vector in the metadata dictionary.
  • global metadata, i.e. metadata that is tied to a dataframe as a whole and not just a column is not supported, because this will presumably be easy to implement in the future.

This just a stab at one implementation, and if people decide metadata should be implemented differently, that's fine and there can be another PR for another method.

Appreciate any feedback, thanks.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It's really nice to have metadata behave so close to index, this makes the code easier to follow. My main remark is that we'd better avoid exporting any new convenience functions for now, and instead rely on calling getindex and setindex! on the MetaData object itself (see inline comment).

@@ -292,7 +302,8 @@ end
function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_inds::AbstractVector)
selected_columns = index(df)[col_inds]
new_columns = Any[dv[row_inds] for dv in columns(df)[selected_columns]]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
new_metadata = metadata(df)[selected_columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_metadata isn't used. BTW, you could do the same in the getindex method above, that's more similar to what we do for the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I fixed this by calling a constructor with new_metadata

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: new_metadata still isn't used (and that's fine, just remove it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -709,6 +724,7 @@ function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector, name::S
end
insert!(index(df), col_ind, name)
insert!(columns(df), col_ind, item)
insert!(metadata(df), col_ind)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better insert nothing for clarity. Then you can remove that insert! method for MetaDataand always require a value to be passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. And I think I removed all references to making things "String".

Though thinking about the broader uses of labels, like with @df in statsplots, there is value in at least a more tightly controlled :label field, which must be a string. This was other packages can interface with :label and know what they are getting.

##############################################################################

"""
addlabel!(df::DataFrame, var::Symbol, label::String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not add special functions like that. As noted on the other issue, better just export metadata(::DataFrame), and implement methods like getindex(::MetaData, field::Symbol[, column::ColumnIndex]) and setindex!(::MetaData, value::Any, field::Symbol[, column::ColumnIndex]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for now, or do you envision this in the future? Because an idiomatic way to set labels, particularly with chaining in mind, first argument is a dataframe, returns a dataframe, seems important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing with this approach is that metadata doesn't know anything about the symbols and the colindex of the dataframe. So any function would have to include the dataframe in it, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for now, or do you envision this in the future? Because an idiomatic way to set labels, particularly with chaining in mind, first argument is a dataframe, returns a dataframe, seems important.

I don't know yet, that's why I'd rather start with the strictly minimal API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changed the addlabel to something more generic. tbh I just had it there because I was lazy when testing it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also changed getmeta and showmeta to be more generic. showmeta now returns a dataframe, but its probably a bad idea cause dataframes aren't that readable for long strings. But it was easy to write and not a horrible idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, let's start with a minimal implementation. We can always add convenience method later if that's useful.

I'd also rather rename getmeta to metadata and addmeta! to setmetadata! or metadata!.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you are saying more about setindex. Might be nice to write

metadata(df, :x1, :label) = "A variable label" work.

abstract type AbstractMetaData end

mutable struct MetaData <: AbstractMetaData
columndata::Dict{Symbol, Vector{String}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use four-space indent (here and elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed my sublime settings.

end

function newfield!(x::MetaData, ncol::Int, field::Symbol,)
x.columndata[field] = ["" for i in 1:ncol]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use nothing than the empty string. It would also be nice to support any type, not just String. That shouldn't make the code really more complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. Vector{Any}([nothing for i in 1:N]) this way the user can include arbitrary things. However I am kind of worried that the ability to include arbitrary objects in metadata will cause people to abuse metadata to make it hold actual (non-meta) data.

@@ -749,6 +765,10 @@ merge!(df, df2) # column z is added, column id is overwritten
"""
function Base.merge!(df::DataFrame, others::AbstractDataFrame...)
for other in others
notinother = setdiff(names(other), names(df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be cleaner to define merge!(::MetaData, ::MetaData...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to make all the code in metadata.jl not know anything about the dataframe attached to it. Since this code is really about working with the distinct names of both dataframes, i put it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'm a bit concerned about the fact that this allocates, even when one isn't using meta-data at all. We should really ensure there's a minimal or zero overhead in that case. Maybe we can handle that by checking whether metadata is empty first?

In terms of code organization, maybe this could be moved to a function taking Index objects for two data frames, and it could also be used for join operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a function diff_indices in index.jl that returns just the indices of the columns in one dataframe that aren't in the other. This might be useful in joins.

Copy link
Contributor Author

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I meant to start a review.

I have addressed your comments, but unfortunately couldn't change everything. Mostly because it's hard to think of adding metadata to select variables without using a wrapper function due to finding the right index to use for the column name.

A large issue to start thinking about is operations that call constructors. join should probably have metadata persist, but it doesn't currently because it calls a new constructor. Saving the metadata from both dataframes and tacking it on after the constructor is called seems inelegant.

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Jun 26, 2018

To clarify my above comments, if we had the user use

setfield!(metadata(df), :columndata, info...)

We would run into the problem where metadata objects don't know about the names of the columns in the dataframe. getfield(metadata(df), :columndata) will only return a Dict that is a bunch of arrays. So info in the above argument would have to be a vector of the right length, and with all the existing metadata just right.

Perhaps we should have a setup similar to a dataframerow, so that metadata can see the dataframe it is attached to? But this hurts us because it means metadata(df) behaves less like columns(df).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would run into the problem where metadata objects don't know about the names of the columns in the dataframe. getfield(metadata(df), :columndata) will only return a Dict that is a bunch of arrays. So info in the above argument would have to be a vector of the right length, and with all the existing metadata just right.

Perhaps we should have a setup similar to a dataframerow, so that metadata can see the dataframe it is attached to? But this hurts us because it means metadata(df) behaves less like columns(df).

Good points. Let's take the other approach then: make the MetaData type invisible to the user, and provide metadata and metadata! (or setmetadata!?) methods to set it (the internal metadata function can be renamed to e.g. meta).

@@ -292,7 +302,8 @@ end
function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_inds::AbstractVector)
selected_columns = index(df)[col_inds]
new_columns = Any[dv[row_inds] for dv in columns(df)[selected_columns]]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
new_metadata = metadata(df)[selected_columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: new_metadata still isn't used (and that's fine, just remove it).

end

function DataFrame(columns::Vector{Any}, colindex::Index, metadata::MetaData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace this with metadata::MetaData=MetaData() in the constructor above. Here you're bypassing all consistency checks done by the existing constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Let me know if we should add metadata as an optional argument for all existing constructors. This would require a good deal of consistency checks though.

return "Field does not exist"
end
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove empty lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if haskey(x.columndata, field)
return x.columndata[field][col_ind]
else
return "Field does not exist"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I throw an error. I'll need help on error type though.

##############################################################################

"""
addlabel!(df::DataFrame, var::Symbol, label::String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for now, or do you envision this in the future? Because an idiomatic way to set labels, particularly with chaining in mind, first argument is a dataframe, returns a dataframe, seems important.

I don't know yet, that's why I'd rather start with the strictly minimal API.

end

# For creating a new column
function addcolumn!(x::MetaData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated method. Also as noted above it should push nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

end

function newfield!(x::MetaData, ncol::Int, field::Symbol,)
x.columndata[field] = Vector{Any}([nothing for i in 1:ncol])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using Any, I think Union{eltype(info), Nothing} would be more appropriate. That would be more efficient, and it's probably flexible enough for meta-data. We can always add API to choose a different type later if it turns out to be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Base.:(==)(x::MetaData, y::MetaData) = isequal(x, y)

Base.copy(x::MetaData) = MetaData(copy(x.columndata))
Base.deepcopy(x::MetaData) = MetaData(copy(x.columndata)) # field is immutable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's immutable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just copying index.jl. I'll delete it.


MetaData() = MetaData(Dict{Symbol,Vector}())

Base.isequal(x::MetaData, y::MetaData) = isequal(x.columndata, y.columndata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need these definitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought they were standard boilerplate for new structs. they are deleted now.

@@ -749,6 +765,10 @@ merge!(df, df2) # column z is added, column id is overwritten
"""
function Base.merge!(df::DataFrame, others::AbstractDataFrame...)
for other in others
notinother = setdiff(names(other), names(df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'm a bit concerned about the fact that this allocates, even when one isn't using meta-data at all. We should really ensure there's a minimal or zero overhead in that case. Maybe we can handle that by checking whether metadata is empty first?

In terms of code organization, maybe this could be moved to a function taking Index objects for two data frames, and it could also be used for join operations?

@@ -279,7 +284,8 @@ end
function Base.getindex(df::DataFrame, row_ind::Real, col_inds::AbstractVector)
selected_columns = index(df)[col_inds]
new_columns = Any[dv[[row_ind]] for dv in columns(df)[selected_columns]]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
# no subsetting required for metadata cause rows dont matter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't sound very useful, meta-data just works as the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -749,6 +759,20 @@ merge!(df, df2) # column z is added, column id is overwritten
"""
function Base.merge!(df::DataFrame, others::AbstractDataFrame...)
for other in others
d = diff_indices(index(other), index(df))
#=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go to the docstring for append!. Also, better find another name since it doesn't follow the signature of the generic append!.

"""
Returns returns the indices of the columns in x that are not in y.
"""
function diff_indices(x::Index, y::Index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, when I suggested having a separate function for this, I was thinking about a MetaData-aware function which would avoid calling setdiff if the meta-data is empty. But it can't live in index.jl, which is only about Index. Since the function is just x[setdiff(names(x), names(y))], it's not very useful. Maybe just move all that stuff into append! and pass it the two Index objects. That way it can be a no-op when meta-data is empty. I guess it depends on what can be shared with the join code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a new function in metadata called merge! that takes in indices with DataFrames.

##############################################################################

"""
addlabel!(df::DataFrame, var::Symbol, label::String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, let's start with a minimal implementation. We can always add convenience method later if that's useful.

I'd also rather rename getmeta to metadata and addmeta! to setmetadata! or metadata!.

newfield!(x, ncol, field, info)
end
x.columndata[field][col_ind] = info
return nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could as well remove this and return info , that can be useful for chaining.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

"""
function addmeta!(df::DataFrame, var::Symbol, field::Symbol, info)
ind = index(df)[var]
# pass the number of columns to the function so that it can create a new array of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds OK to me (and no need for a comment).

BTW, these functions would be turned into one-liners if you skip creating the ind variable.

@@ -147,7 +149,8 @@ function DataFrame(; kwargs...)
end

function DataFrame(columns::AbstractVector,
cnames::AbstractVector{Symbol}=gennames(length(columns));
cnames::AbstractVector{Symbol}=gennames(length(columns)),
metadata = MetaData();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused argument. Anyway for now MetaData is purely internal, like Index, so it shouldn't appear here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay this part is still a bit confusing for me. but i think this makes sense.



function newfield!(x::MetaData, ncol::Int, field::Symbol, info)
x.columndata[field] = Vector{Union{typeof(info), Nothing}}([nothing for i in 1:ncol])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union{typeof(info), Nothing}[nothing for i in 1:ncol] avoids a copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

abstract type AbstractMetaData end

mutable struct MetaData <: AbstractMetaData
columndata::Dict{Symbol, Vector}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct would be enough, right?

Also, columndata is a bit of a weird name for this, since it sounds like there are other fields with non-column data. Maybe just dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

@@ -263,7 +267,8 @@ end
function Base.getindex(df::DataFrame, col_inds::AbstractVector)
selected_columns = index(df)[col_inds]
new_columns = columns(df)[selected_columns]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
new_metadata = metadata(df)[selected_columns]
return DataFrame(new_columns, Index(_names(df)[selected_columns]), new_metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use the same pattern as elsewhere and drop the new_metadata variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@pdeffebach
Copy link
Contributor Author

I think I got my push and pulls confused for this, somehow making me add the new changes to master... let me know what to do to sort it out, because I'm not sure what to do in this situation. I think you reject these and I re-submit?

I responded to all the changes, but there are a few issues to work on.

  • allocating for merge! . It seems intuitive that that the new metadata should be a combination of the two in the merge, I'm not sure what a non-allocating version would look like, since the current implementation needs full vectors in the columndata Dict (now dict Dict).
  • I like the idea of having metadata(df, :x1, :field) = "my information. This requires overloading setindex and getindex right?
  • I think that automatically making a new vector if you add metadata to a field that doesn't exist yet is dangerous, because then typos can allocate new fields silently. But this is small and can be addressed later.

Major issue still to come is join.

@nalimilan
Copy link
Member

I think I got my push and pulls confused for this, somehow making me add the new changes to master... let me know what to do to sort it out, because I'm not sure what to do in this situation. I think you reject these and I re-submit?

Better continue the conversation in this PR. You should be able to fix this with git fetch && git rebase -i origin/master, and removing lines which correspond to unrelated commits. Then if everything it OK, do a git push --force. (In general, better work in special branches and keep master in sync with origin.)

@pdeffebach
Copy link
Contributor Author

i did a fetch, a rebase, and then i manually resolved changes and conflicts.

I hope this works. Last week i realized i was doing too much on master but when I made new branches I didn't set their upstream right, I think.

In the future, I have my fork with a master i keep up to date with commits, then I make a branch and push and pull from the branch of that fork exclusively.

@pdeffebach
Copy link
Contributor Author

@nalimilan i re-organized everthing with git to try and fix this. In the process I guess I closed this branch. If it's okay, can I submit another PR? I have all the code still, and it is now up to date with current master. I can go through and add comments where you left off too, to make the transition easier.

@nalimilan
Copy link
Member

You should have been able to push your branch to this PR even if locally it has a different name, but now that the PR has been closed GitHub won't leave us reopening it anyway for strange reasons, so you'll have to file another one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants