Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add @byrow attempt 2 #250

Merged
merged 25 commits into from
Jun 16, 2021
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
[deps]
DataFramesMeta = "1313f7d8-7da2-5740-9ea0-a2ca25f37964"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"

[compat]
Documenter = "0.25"
Documenter = "0.25"
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions docs/src/api/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@

```@autodocs
Modules = [DataFramesMeta]
Private = false
```
182 changes: 10 additions & 172 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ In addition, DataFramesMeta provides
convenient syntax
* `@eachrow` and `@eachrow!` for looping through rows in data frame, again with high performance and
convenient syntax.
* `@byrow` for applying functions by-row of a data frame.
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
* `@linq`, for piping the above macros together, similar to [magrittr](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html)'s
`%>%` in R.

Expand Down Expand Up @@ -271,9 +272,9 @@ end
## Row-wise transformations with `@byrow`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention @byrow at the top of the file?

Rather than starting the section with technical details, it would be more user-friendly to say what @byrow does first, then show examples, and only then mention ByRow and the fact that @byrow isn't a real macro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second paragraph still applies: it would be nice to start with a sentence or two saying that @byrow allows writing code that is applied to each row instead of having to vectorize it.


DataFrames.jl provides the function wrapper `ByRow`. `ByRow(f)(x, y)`
is roughly equivalent to `f.(x, y)`, with a few exceptions discussed below.
DataFramesMeta.jl allows users to construct expressions using `ByRow`
function wrapper with the syntax `@byrow`.
is roughly equivalent to `f.(x, y)`. DataFramesMeta.jl allows users
to construct expressions using `ByRow` function wrapper with the
syntax `@byrow`.

`@byrow` is not a "real" macro and cannot be used outside of
DataFramesMeta.jl macros. However its behavior within DataFramesMeta.jl
Expand All @@ -282,20 +283,20 @@ Thought of as a macro `@byrow` accepts a single argument and
creates an anonymous function wrapped in `ByRow`. For example,

```julia
@transform(df, y = @byrow :x == 1 ? "true" : "false)
@transform(df, @byrow y = :x == 1 ? true : false)
```

is equivalent to

```julia
transform(df, :x => ByRow(x -> x == 1 ? "true", "false") => :y)
transform(df, :x => ByRow(x -> x == 1 ? true, false) => :y)
```

The following macros accept `@byrow`:

* `@transform` and `@transform!`, `@select`, `@select!`, and `@combine`.
`@byrow` can be used in the right hand side of expressions, e.g.
`@select(df, z = @byrow :x * :y)`.
`@byrow` can be used in the left hand side of expressions, e.g.
`@select(df, @byrow z = :x * :y)`.
* `@where` and `@orderby`, with syntax of the form `@where(df, @byrow :x > :y)`
* `@with`, where the anonymous function created by `@with` is wrapped in
`ByRow`, as in `@with(df, @byrow :x * :y)`.
Expand All @@ -319,171 +320,8 @@ julia> @where df @byrow begin
`@byrow` can be used inside macros which accept `GroupedDataFrame`s,
however, like with `ByRow` in DataFrames.jl, when `@byrow` is
used, functions do not take into account the grouping, so for
example the result of `@transform(df, y = @byrow f(:x))` and
`@transform(groupby(df, :g), y = @byrow f(:x))` is the same.

### Comparison with `@eachrow`

To re-cap, the `@eachrow` rougly transforms

```julia
@eachrow df begin
:a * :b
end
```

to

```julia
begin
function tempfun(a, b)
for i in eachindex(a)
a[i] * b[i]
end
end
tempfun(df.a, df.b)
df
end
```

The function `*` is applied by-row. But the result of those operations
is not stored anywhere, as with `for`-loops in Base Julia.
Rather, `@eachrow` and `@eachrow!` return data frames.

Now consider `@byrow`. `@byrow` transforms

```julia
@with df @byrow begin
:a * :b
end
```

to

```julia
tempfun(a, b) = a * b
tempfun.(df.a, df.b)
```

In contrast to `@eachrow`, `@with` combined with `@byrow` returns a vector of the
broadcasted multiplication and not a data frame.

Additionally, `@eachrow` and `@eachrow!` allow modifying a data
data frame. Just as with Base Julia broadcasting, `@byrow` will
not update columns.

```julia
julia> df = DataFrame(a = [1, 2], b = [3, 4]);

julia> @with df @byrow begin
:a = 500
end
2-element Vector{Int64}:
500
500

julia> df
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
```

### Comparison with `@.` and Base broadcasting

Base Julia provides the broadasting macro `@.` and in many cases `@.`
and `@byrow` will give equivalent results. But there are important
deviations in behavior. Consider the setup

```julia
df = DataFrame(a = [1, 2], b = [3, 4])
```

* Control flow. In all versions of Julia, expressions of the form
`if...else`, `a ? b : c` cannot be broadcasted. In versions below
1.7-dev, expressions of the form `a && b` and `a || b` cannot be
broadcasted. Consequently, the `@.` macro will fail when encountering such
control flow while `@byrow` will not.
```
julia> @with df @byrow begin
if :a == 1
5
else
10
end
end
2-element Vector{Int64}:
5
10

julia> @with df @. begin
if :a == 1
5
else
10
end
end # will error
```

* Broadcasting objects that are not columns. `@byrow` constructs an
anonymous function *which accepts only the columns of the input data frame*
and broadcasts that function. Consequently, it does not broadcast
referenced objects which are not columns.

```julia
julia> df = DataFrame(a = [1, 2], b = [3, 4]);
julia> @with df @byrow :x + [5, 6]
```

will error, because the `:x` in the above expression refers
to a scalar `Int`, and you cannot do `1 + [5, 6]`.

On the other hand

```julia
@with df @. :x + [5, 6]
```
will succeed, as `df.x` is a 2-element vector as is `[5, 6]`.

Because `ByRow` inside `transform` blocks does not internally
use broadcasting in all circumstances, in the rare instance
that a column in a data frame is a custom vector type that
implements custom broadcasting, this custom behavior will
not be called with `@byrow`.

* Broadcasting expensive calls. In Base Julia, broadcasting
evaluates calls first and then broadcasts the result. Because
`@byrow` constructs an anonymous function and evaluates
that function for every row in the data frame, expensive functions
will be evaluated many times.

```julia
julia> function expensive()
sleep(.5)
return 1
end;

julia> @time @with df @byrow :a + expensive();
1.037073 seconds (51.67 k allocations: 3.035 MiB, 3.19% compilation time)

julia> @time @with df :a .+ expensive();
0.539900 seconds (110.67 k allocations: 6.525 MiB, 7.05% compilation time)

```

This problem comes up when using the `@.` macro as well, but can easily be fixed with `$`.

```julia
julia> @time @with df @. :a + expensive();
1.036888 seconds (97.55 k allocations: 5.617 MiB, 3.20% compilation time)

julia> @time @with df @. :a + $expensive();
0.537961 seconds (110.68 k allocations: 6.525 MiB, 6.73% compilation time)
```

No such solution currently exists with `@byrow`.
example the result of `@transform(df, @byrow y = f(:x))` and
`@transform(groupby(df, :g), @byrow y = f(:x))` is the same.

bkamins marked this conversation as resolved.
Show resolved Hide resolved
## Working with column names programmatically with `cols`

Expand Down
1 change: 1 addition & 0 deletions src/DataFramesMeta.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ using Reexport
export @with, @where, @orderby, @transform, @by, @combine, @select,
@transform!, @select!,
@eachrow, @eachrow!,
@byrow,
@based_on # deprecated


Expand Down
Loading