Add Tables.jl interface for DataFrame(Rows|Columns) #2055

tkf · 2019-12-15T21:40:50Z

This PR adds Tables.jl interface for DataFrameRows and DataFrameColumns. It is useful for defining data manipulation functions expecting iterators. Example:

julia> tablemap(f, xs) = Tables.materializer(xs)(map(f, xs))
tablemap (generic function with 1 method)

julia> tablemap(x -> (A=x.a, B=x.b), eachrow(DataFrame(a=[3], b=[4])))
1×2 DataFrame
│ Row │ A     │ B     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 3     │ 4     │

src/other/tables.jl

bkamins · 2019-12-15T21:49:44Z

In general - this should not be problematic to add these definitions. Can you please elaborate in what cases you would find this useful (the issue is that you can always call parent on an iterator and get the data frame to use with Tables.jl).

src/other/tables.jl

tkf · 2019-12-15T21:55:46Z

Thanks for the review!

Can you please elaborate in what cases you would find this useful (the issue is that you can always call parent on an iterator and get the data frame to use with Tables.jl).

I need this interface to make this test for Transducers.copy pass: https://github.com/tkf/Transducers.jl/pull/104/files#diff-da78dddbbd001af539780c4952b0e9fcR23-R27

I can't use parent in general as I want to process wrapper types as-is, e.g., SubArray-of-StructVector.

JuliaData#2055 (comment)

bkamins

Looks good. Thank you. Let us just wait for @quinnj to confirm that all is OK with the usage of Tables.jl interface.

tkf · 2019-12-16T00:12:59Z

test/tables.jl

@@ -182,7 +182,7 @@ end

     df2 = DataFrame!(eachrow(df))
     @test df == df2
-     @test !any(((a,b),) -> a === b, zip(eachcol(df), eachcol(df2)))
+     @test all(((a,b),) -> a === b, zip(eachcol(df), eachcol(df2)))


With this PR, df2 = DataFrame!(eachrow(df)) does not copy columns any more. Is it an OK change?

I would say that it is even better. @nalimilan - are you OK with this?
If someone writes DataFrame! an explicit opt-out from copying, if possible, is assumed.

Yes, makes sense.

nalimilan · 2019-12-16T15:15:59Z

src/other/tables.jl

@@ -48,6 +48,18 @@ DataFrame!(x::Vector{<:NamedTuple}) =
                        "`$(typeof(x))` without allocating new columns: use " *
                        "`DataFrame(x)` instead"))

+for T in [DataFrameRows, DataFrameColumns]
+    @eval begin


Do you really need a loop? Isn't ::Union{DataFrameRows, DataFrameColumns} enough?

Ah, I hadn't seen @bkamins's comment above. I'd just repeat the Union without defining a custom type alias.

I'm OK with both approaches. I wrote this as @bkamins preferred this approach. Ref: #2055 (comment)

I am OK with both - the @eval approach is used in Base often. But using Union without defining the alias is also OK (I just prefer not to introduce the alias here).

Does 22e32db look good?

nalimilan · 2019-12-16T15:16:26Z

test/tables.jl

@@ -182,7 +182,7 @@ end

     df2 = DataFrame!(eachrow(df))
     @test df == df2
-     @test !any(((a,b),) -> a === b, zip(eachcol(df), eachcol(df2)))
+     @test all(((a,b),) -> a === b, zip(eachcol(df), eachcol(df2)))


Yes, makes sense.

nalimilan · 2019-12-16T15:19:41Z

Shouldn't materializer return a DataFramesRows/DataFramesColumns object rather than a DataFrame?

tkf · 2019-12-16T15:26:42Z

My understanding is that materializer(x)(input) isa typeof(x) does not have to hold. For example, there is a fallback definition materializer(x) = columntable which creates a NamedTuple-of-Vectors.

nalimilan · 2019-12-16T15:34:27Z

Maybe it doesn't have to, but should it? Or why shouldn't it? :-)

tkf · 2019-12-16T15:36:01Z

Actually materializer(::DataFramesRows) = DataFramesRows ∘ DataFrame makes sense.

bkamins · 2019-12-16T15:37:47Z

Well materializer is @quinnj's idea, so probably he can comment 😄. When I analyze it (it does not have a documentation) I get the understanding that it should materialize x, and DataFrameRows and DataFrameColums are views, while DataFrame is a materialized object (but maybe this is a wrong way to look at it - @quinnj can you please comment here what is the design purpose of materialize).

bkamins · 2019-12-16T15:38:51Z

Actually materializer(::DataFramesRows) = DataFramesRows ∘ DataFrame makes sense.

As I have commented - whether it makes sense depends on the contract @quinnj wants materialize to support I think (I was writing in parallel).

tkf · 2019-12-16T15:54:33Z

@bkamins I just noticed your comment after writing a patch to do (something like) materializer(::DataFramesRows) = eachrow ∘ DataFrame. But I agree waiting for @quinnj's comment is a good idea. If returning a DataFrame is a better idea, I'll just remove that commit.

The reason why I thought returning a DataFramesRows make sense was that we can then have tablemap(f ∘ g, xs) == tablemap(f, tablemap(g, xs) which is a nice property.

bkamins · 2019-12-16T16:25:44Z

src/other/tables.jl

+Tables.materializer(itr::DataFrameRows) =
+    eachrow ∘ prefer_singleton_callable(Tables.materializer(parent(itr)))
+Tables.materializer(itr::DataFrameColumns) =
+    eachcol ∘ prefer_singleton_callable(Tables.materializer(parent(itr)))


If @quinnj comments that we should return here DataFrameColumns and not DataFrame then we should inherit from itr if it was created with names positional argument set to true or false.

quinnj · 2019-12-16T17:13:28Z

We don't currently have strict requirements on materializer; I think, as mentioned, it's preferable if it returns the same type as the input, but I don't think that's always feasible (immutable or view-like inputs, certain DB or file-based format tables, etc.).

These changes seem fine by me, though I will note that I recently tried to use DataFrameRows and wanted to open an issue up about it's useability: namely the fact that it doesn't inherit from <: AbstractDataFrame means things like names(df_rows) don't work, but would be nice if they did.

Anyway, I'm good w/ this.

bkamins · 2019-12-16T20:18:20Z

namely the fact that it doesn't inherit from <: AbstractDataFrame means things like names(df_rows) don't work, but would be nice if they did.

It cannot because it inherits from AbstractVector and we do not have multiple inheritance. I will open a PR adding common AbstractDataFrame methods for them.

src/other/tables.jl

tkf · 2019-12-17T07:39:26Z

Actually, can we go back to materializer(::DataFramesRows) = DataFrame? 😅

When I analyze it (it does not have a documentation) I get the understanding that it should materialize x, and DataFrameRows and DataFrameColums are views, while DataFrame is a materialized object

@bkamins I just realized that your point totally makes sense. I needed something I can push!/append! to. At the moment, DataFramesRows does not support this and I'm guessing DataFrames devs don't want to expand the API surface. If that's the case, can we go back to the original materializer definition?

Sorry, I should've tried to re-implement JuliaFolds/Transducers.jl#107 after the API was changed...

bkamins · 2019-12-17T10:13:22Z

I'm guessing DataFrames devs don't want to expand the API surface

It is a simple rule: both objects are views so they should not be allowed to mutate the schema of the parent (i.e. changing numer of rows or columns or renaming them).

So I understand the change should be made to make materialize produce a DataFrame - right.?

Also just to confirm. This new DataFrame should be a copy of the source (not reuse the columns) - right?

nalimilan · 2019-12-17T10:53:13Z

I guess returning a DataFrame would be OK if materializer is supposed to return a mutable table (like similar for arrays), but we would have to make this clear. Otherwise it sounds weird not to return DataFrameRows.

tkf · 2019-12-17T23:35:28Z

It is a simple rule: both objects are views so they should not be allowed to mutate the schema of the parent (i.e. changing numer of rows or columns or renaming them).

The "rule" sounds arbitrary to me, in the sense that you made it so you can change it. I can't find any explicit API contracts defined for eachrows and eachcols. I agree that the function name sounds like that the caller should not expect more than iterator API. However, if DataFrameRows is also used for Tables.rows #2051, it may makes sense to adding mutation support. For example, Tables.jl mentions that Tables.rows(table) === table is a reasonable choice if table is already a row iterator. This includes a table type that support push!.

bkamins · 2019-12-18T09:11:00Z

I see your point, so let us wait for other to comment about their preference.

My thinking was that DtaFrameRows for AbstractDataFrame is like (v for x in x) in Base.

Also note that DataFrameRows might wrap a SubDataFrame in which case it will not be resizable anyway, as SubDataFrame is not resizable.

nalimilan · 2019-12-19T08:51:36Z

@quinnj Do you think materializer is supposed to return a mutable table, i.e. to which rows or columns can be added?

quinnj · 2019-12-19T09:11:09Z

I think materializer should return the most well-supported, "standard" table type for a family of table types. The point is that some higher-order table processor somewhere wants to manipulate a table and spit back out something like what you put in; so while it definitely makes sense to return the exact same type as input, I also think for general useability, it makes sense to return the most "standard" type, in this case a DataFrame.

nalimilan · 2019-12-19T09:33:20Z

Yeah it's a tough choice. It's not ideal that if you passed a table which iterates rows (DataFrameRows) you get a table that cannot be iterated over (DataFrame). Though I guess you should call Tables.rows on the result anyway if you want to iterate over rows, so returning a DataFrame should be OK.

tkf · 2019-12-19T09:44:26Z

@bkamins @nalimilan @quinnj Thanks for the discussion! I opened #2058

bkamins · 2019-12-27T13:02:39Z

I would prefer in the future to squash-merge PRs, as otherwise it is hard for me to write appropriate release notes. Thank you!

quinnj · 2019-12-31T22:17:43Z

Huh......I think I did a rebase-merge on another repo and then it must remember your latest preference regardless of repo? I almost always squash, but there was a case on another repo where I wanted to rebase-merge and then I must have just hit the default here. Sorry about that.

Add Tables.jl interface for DataFrame(Rows|Columns)

6ed00bd

bkamins reviewed Dec 15, 2019

View reviewed changes

src/other/tables.jl Outdated Show resolved Hide resolved

bkamins reviewed Dec 15, 2019

View reviewed changes

src/other/tables.jl Outdated Show resolved Hide resolved

tkf added 2 commits December 15, 2019 13:57

Use eval instead of Union

9e6b078

JuliaData#2055 (comment)

Use parent instead of getfield

b294935

JuliaData#2055 (comment)

bkamins approved these changes Dec 15, 2019

View reviewed changes

tkf mentioned this pull request Dec 15, 2019

Fix copy(::Transducer, ::Table) etc. JuliaFolds/Transducers.jl#104

Merged

Fix a test for DataFrame!(eachrow(df))

5325ce0

tkf commented Dec 16, 2019

View reviewed changes

tkf mentioned this pull request Dec 16, 2019

Test and document copy(xf, eachrow(df)) JuliaFolds/Transducers.jl#107

Closed

2 tasks

nalimilan reviewed Dec 16, 2019

View reviewed changes

Use Union (without alias) instead of eval

22e32db

Materialize as DataFrame(Rows|Columns)

fbeaff4

bkamins reviewed Dec 16, 2019

View reviewed changes

quinnj approved these changes Dec 16, 2019

View reviewed changes

Inherit names argument in materializer(::DataFrameColumns)

62ab4b6

quinnj merged commit cac52ca into JuliaData:master Dec 16, 2019

tkf deleted the tables branch December 16, 2019 19:44

quinnj mentioned this pull request Dec 16, 2019

Change Tables.rows implementation to use eachrow #2051

Merged

nalimilan reviewed Dec 16, 2019

View reviewed changes

src/other/tables.jl Show resolved Hide resolved

bkamins mentioned this pull request Dec 16, 2019

Add names to data frame iterators #2056

Merged

tkf mentioned this pull request Dec 16, 2019

Simplify materializer for DataFrameRows and DataFrameColumns #2057

Closed

tkf mentioned this pull request Dec 19, 2019

Materialize DataFrame(Rows|Columns) as DataFrame #2058

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tables.jl interface for DataFrame(Rows|Columns) #2055

Add Tables.jl interface for DataFrame(Rows|Columns) #2055

tkf commented Dec 15, 2019

bkamins commented Dec 15, 2019

tkf commented Dec 15, 2019

bkamins left a comment

tkf Dec 16, 2019

bkamins Dec 16, 2019

nalimilan Dec 16, 2019

nalimilan Dec 16, 2019

nalimilan Dec 16, 2019

tkf Dec 16, 2019

bkamins Dec 16, 2019

tkf Dec 16, 2019

nalimilan Dec 16, 2019

nalimilan commented Dec 16, 2019

tkf commented Dec 16, 2019

nalimilan commented Dec 16, 2019

tkf commented Dec 16, 2019

bkamins commented Dec 16, 2019

bkamins commented Dec 16, 2019

tkf commented Dec 16, 2019

bkamins Dec 16, 2019

quinnj commented Dec 16, 2019

bkamins commented Dec 16, 2019

tkf commented Dec 17, 2019

bkamins commented Dec 17, 2019

nalimilan commented Dec 17, 2019

tkf commented Dec 17, 2019

bkamins commented Dec 18, 2019

nalimilan commented Dec 19, 2019

quinnj commented Dec 19, 2019

nalimilan commented Dec 19, 2019

tkf commented Dec 19, 2019

bkamins commented Dec 27, 2019

quinnj commented Dec 31, 2019

Add Tables.jl interface for DataFrame(Rows|Columns) #2055

Add Tables.jl interface for DataFrame(Rows|Columns) #2055

Conversation

tkf commented Dec 15, 2019

bkamins commented Dec 15, 2019

tkf commented Dec 15, 2019

bkamins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Dec 16, 2019

tkf commented Dec 16, 2019

nalimilan commented Dec 16, 2019

tkf commented Dec 16, 2019

bkamins commented Dec 16, 2019

bkamins commented Dec 16, 2019

tkf commented Dec 16, 2019

Choose a reason for hiding this comment

quinnj commented Dec 16, 2019

bkamins commented Dec 16, 2019

tkf commented Dec 17, 2019

bkamins commented Dec 17, 2019

nalimilan commented Dec 17, 2019

tkf commented Dec 17, 2019

bkamins commented Dec 18, 2019

nalimilan commented Dec 19, 2019

quinnj commented Dec 19, 2019

nalimilan commented Dec 19, 2019

tkf commented Dec 19, 2019

bkamins commented Dec 27, 2019

quinnj commented Dec 31, 2019