Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add @byrow attempt 2 #250

Merged
merged 25 commits into from
Jun 16, 2021
Merged

Add @byrow attempt 2 #250

merged 25 commits into from
Jun 16, 2021

Conversation

pdeffebach
Copy link
Collaborator

This supercedes #239

Here is a docstring indicating how @byrow works


@transform uses the syntax @byrow to wrap transformations in
the ByRow function wrapper from DataFrames, enabling broadcasting
and more. For example, the call

@transform(df, y = @byrow :x == 1 ? "true" : "false)

becomes

transform(df, :x => ByRow(x -> x == 1 ? "true", "false") => :y)

a transformation which cannot be conveniently expressed
using broadcasting.

To avoid writing @byrow multiple times when performing multiple
transformations by row, @transform allows @byrow at the
beginning of a block of transformations. All transformations
in the block will operate by row.

julia> using DataFramesMeta

julia> df = DataFrame(A = 1:3, B = [2, 1, 2]);

julia> @transform df z = @byrow :A * :B
3×3 DataFrame
 Row │ A      B      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      2
   2 │     2      1      2
   3 │     3      2      6

julia> @transform df @byrow begin
           x = :A * :B
           y = :A == 1 ? 100 : 200
       end

3×4 DataFrame
 Row │ A      B      x      y
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      2    100
   2 │     2      1      2    200
   3 │     3      2      6    200

The implementation is a bit clunky with create_args_vector returning a tuple of the vector of arguments and whether to wrap in ByRow. This should be less clunky after we clean up the @combine code, which I will do in a later PR. So I accumulate a little bit of technical debt I promise I will pay off in the future.

I have added tests for @transform, but not others, until we find all the edge cases we need. Overall I am pleased with this PR. I think the work in #245 made this really easy. It also preserves the snappy function parsing in #221, though I still have to add those tests.

All things considered, this is ready for a review @nalimilan and @bkamins.

@bkamins
Copy link
Member

bkamins commented May 28, 2021

Thank you for working on it. Note that CI fails.

src/macros.jl Show resolved Hide resolved
src/macros.jl Outdated Show resolved Hide resolved
src/macros.jl Outdated Show resolved Hide resolved
y = [:v, :w, :x, :y, :z],
c = [:g, :quote, :body, :transform, missing])

@test @transform(df, n = @byrow :i).n == df.i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some tests checking the whole output DataFrame not only its specific columns.
Also do we support GroupedDataFrame here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated tests to be more reflective of the types of cases we are concerned with.

I will updated the GroupedDataFrame tests once we think these are good. They are in a different file.

@pdeffebach pdeffebach mentioned this pull request May 28, 2021
@pdeffebach
Copy link
Collaborator Author

Okay I've added a very lengthy documentation section in index.md.

Could use an approval of the tests, any comments on the API as discussed in index.md and the @transform docstring. and then I will get to work adding tests everywhere and cleaning up the documentation.

@@ -268,6 +268,178 @@ df2 = @eachrow df begin
end
```

## Row-wise transformations with `@byrow`

DataFrames provides the function-wrapper `ByRow`. `ByRow(f)(x, y)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what style you follow in DataFramesMeta.jl, but in DataFrames.jl we always add .jl to package names

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added .jl but will make a point to do this before 1.0.

docs/src/index.md Outdated Show resolved Hide resolved
```

!!! note
Unlike `@.`, `@byrow` is not a "real" macro and cannot be used outside of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the question is what are the parsing rules for @byrow? does it take exactly one expression that follows it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included a more detailed discussion in updated docs. Hopefully it's clear.

### Comparison with `@eachrow`

In previous versions of DataFramesMeta, `@eachrow` was named `@byrow`.
This version of `@byrow` is deprecated, but the syntax can be used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the syntax" - which syntax?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified.

docs/src/index.md Outdated Show resolved Hide resolved
```

The function `*` is applied by-row. But the result of those operations
is not stored in a new vector. Additionally, `@eachrow` and `@eachrow!`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is it stored then? (or not stored unless stored explicitly?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not stored. It's literally just a for-loop. Hopefully the re-write is clear.

docs/src/index.md Outdated Show resolved Hide resolved
and broadcasts that function. Consequently, it does not broadcast
objects that are referenced which are not columns.
```julia
@with df @byrow :x + [5, 6]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this always error? What if column :x would be a vector of two element vectors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have clarified in the example.

```
will error. On the other hand
```julia
@with df @. :x + [5, 6]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will only work if :x has 2 elements - right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I have added more detail.

docs/src/index.md Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Collaborator Author

Thanks a ton for reading all this.

I re-wrote a lot of this documentation for clarity based on your comments. Hopefully it's more complete now.

@pdeffebach
Copy link
Collaborator Author

It's worth noting that the special behavior related to NamedTuples and AsTable doesn't get referenced here because DataFramesMeta doesn't support AsTable yet.

Comment on lines 322 to 323
used, functions do not take advantage of the grouping, so the
behavior of `@transform(df, y = @byrow f(:x))` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this comment as with @combine the result would be different:

Suggested change
used, functions do not take advantage of the grouping, so the
behavior of `@transform(df, y = @byrow f(:x))` and
used, functions do not take advantage of the grouping, so
for example the result of `@transform(df, y = @byrow f(:x))` and

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch. Too niche to mention in the docs though.


### Comparison with `@eachrow`

In previous versions of DataFramesMeta, `@eachrow` was named `@byrow`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In previous versions of DataFramesMeta, `@eachrow` was named `@byrow`.
In previous versions of DataFramesMeta.jl, `@eachrow` was named `@byrow`.

docs/src/index.md Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Collaborator Author

Thanks for the detailed review!

@bkamins
Copy link
Member

bkamins commented Jun 3, 2021

Let us wait for @nalimilan to have a look at it now.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I've made comments mainly about the docs. Overall I think it would make sense to put less stress on the technical details to target "basic" users.

I also wonder whether @byrow z = :x * :y shouldn't be allowed as it seems a bit simpler not to put @byrow in the middle of the transformation. It's also more consistent with @byrow begin... end. What do you think?

docs/src/index.md Outdated Show resolved Hide resolved
@@ -268,6 +268,228 @@ df2 = @eachrow df begin
end
```

## Row-wise transformations with `@byrow`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention @byrow at the top of the file?

Rather than starting the section with technical details, it would be more user-friendly to say what @byrow does first, then show examples, and only then mention ByRow and the fact that @byrow isn't a real macro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second paragraph still applies: it would be nice to start with a sentence or two saying that @byrow allows writing code that is applied to each row instead of having to vectorize it.

docs/src/index.md Outdated Show resolved Hide resolved
docs/src/index.md Outdated Show resolved Hide resolved
docs/src/index.md Outdated Show resolved Hide resolved
src/parsing.jl Show resolved Hide resolved
src/parsing.jl Outdated Show resolved Hide resolved
test/dataframes.jl Outdated Show resolved Hide resolved
test/dataframes.jl Outdated Show resolved Hide resolved
src/macros.jl Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Jun 5, 2021

I also wonder whether @byrow z = :x * :y shouldn't be allowed as it seems a bit simpler not to put @byrow in the middle of the transformation. It's also more consistent with @byrow begin... end. What do you think?

No, I don't think we should do this. It breaks the mapping to src => fun => dest logic. The thing on the RHS becomes the fun, which is where the ByRow goes. Plus we have to support @byrow on the RHS because the RHS gets sent to the same processing that stuff in @where goes (without no RHS).

w.r.t. detailed technical docs vs introductory docs, if we were to make a docstring for @byrow, how would that work? We could make @byrow a macro the same as @with df @byrow begin ... end, but that seems a little redundant.

EDIT: Unfortunately @byrow doesn't error on the current release. So we have to have a few months where it errors before we can add @byrow back in as a real macro.

@nalimilan
Copy link
Member

No, I don't think we should do this. It breaks the mapping to src => fun => dest logic. The thing on the RHS becomes the fun, which is where the ByRow goes. Plus we have to support @byrow on the RHS because the RHS gets sent to the same processing that stuff in @where goes (without no RHS).

Well the mapping isn't 1-to-1 already: while the LHS corresponds to src, the RHS completely mixes fun and dest. I get that in terms of implementation, @byrow z = x + y may be more complex to support, but in terms of API I'm not sure it's really inferior. (That said, it could be added later.)

w.r.t. detailed technical docs vs introductory docs, if we were to make a docstring for @byrow, how would that work? We could make @byrow a macro the same as @with df @byrow begin ... end, but that seems a little redundant.

EDIT: Unfortunately @byrow doesn't error on the current release. So we have to have a few months where it errors before we can add @byrow back in as a real macro.

Can't you define a dummy macro that throws an error?

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Jun 6, 2021

@byrow z = x + y may be more complex to support, but in terms of API I'm not sure it's really inferior. (That said, it could be added later.)

This is a good point. I will think about it. Maybe the implementation is easy.

Can't you define a dummy macro that throws an error?

Yeah that sounds good. I will do that and switch the docstrings around.

@pdeffebach
Copy link
Collaborator Author

Okay I'm going to allow

@transform df begin 
    @byrow y = f(:x)
end

because with the @transform df @byrow begin ... end implemenation we allow

@transform df @byrow y = f(:x)

so it would be weird to allow that and not the same for multiple arguments.

docs/Project.toml Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Collaborator Author

Okay two substantive changes to the API

  • @transform df y = @byrow :x now errors, in favor of @transform df @byrow y = :x
  • Because a lot of this is implemented with recursion, I add a few checks to stop people from doing @transform @byrow @byrow @byrow y = :x

I added docstrings to all the relevant macros (except @combine and @by, since @byrow is useless there.) So in theory this PR is ready to be merged. All the work is done.

Could use a review to go over the docstrings, I guess. But I'm getting anxious to merge this into master.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work.

src/parsing.jl Outdated Show resolved Hide resolved
src/parsing.jl Outdated Show resolved Hide resolved
src/parsing.jl Outdated Show resolved Hide resolved
src/parsing.jl Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
src/macros.jl Show resolved Hide resolved
src/macros.jl Outdated Show resolved Hide resolved
src/macros.jl Outdated Show resolved Hide resolved
src/macros.jl Outdated Show resolved Hide resolved
src/macros.jl Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

I have incorporated the requested changes.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

src/macros.jl Outdated Show resolved Hide resolved
@pdeffebach pdeffebach merged commit 222c36c into JuliaData:master Jun 16, 2021
@pdeffebach pdeffebach deleted the byrow_2 branch June 16, 2021 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants