Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add @byrow attempt 2 #250

Merged
merged 25 commits into from
Jun 16, 2021
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,178 @@ df2 = @eachrow df begin
end
```

## Row-wise transformations with `@byrow`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention @byrow at the top of the file?

Rather than starting the section with technical details, it would be more user-friendly to say what @byrow does first, then show examples, and only then mention ByRow and the fact that @byrow isn't a real macro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second paragraph still applies: it would be nice to start with a sentence or two saying that @byrow allows writing code that is applied to each row instead of having to vectorize it.


DataFrames provides the function-wrapper `ByRow`. `ByRow(f)(x, y)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what style you follow in DataFramesMeta.jl, but in DataFrames.jl we always add .jl to package names

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added .jl but will make a point to do this before 1.0.

is roughly equivalent to `f.(x, y)`, with a few exceptions discussed below.
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
DataFramesMeta allows for users to construct expressions using `ByRow`
function wrapper with the syntax `@byrow`.
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved

```julia
@transform(df, y = @byrow :x == 1 ? "true" : "false)
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
```

becomes
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved

```
transform(df, :x => ByRow(x -> x == 1 ? "true", "false") => :y)
```

!!! note
Unlike `@.`, `@byrow` is not a "real" macro and cannot be used outside of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the question is what are the parsing rules for @byrow? does it take exactly one expression that follows it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included a more detailed discussion in updated docs. Hopefully it's clear.

DataFramesMeta macros. However it's behavior within DataFramesMeta
macros should be indistinguishable from externally defined macros.

### Comparison with `@eachrow`

In previous versions of DataFramesMeta, `@eachrow` was named `@byrow`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In previous versions of DataFramesMeta, `@eachrow` was named `@byrow`.
In previous versions of DataFramesMeta.jl, `@eachrow` was named `@byrow`.

This version of `@byrow` is deprecated, but the syntax can be used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the syntax" - which syntax?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified.

to for similar, but not identical, behavior.
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved

The syntax

```julia
@eachrow df begin
:a * :b
end
```

is similar to

```julia
begin
function tempfun(a, b)
for i in eachindex(a)
a[i] * b[i]
end
end
tempfun(df.a, df.b)
df
end
```

The function `*` is applied by-row. But the result of those operations
is not stored in a new vector. Additionally, `@eachrow` and `@eachrow!`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is it stored then? (or not stored unless stored explicitly?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not stored. It's literally just a for-loop. Hopefully the re-write is clear.

return data frames.

By contrast,

```julia
@with df @byrow begin
bkamins marked this conversation as resolved.
Show resolved Hide resolved
:a * :b
end
```

is similar to

```julia
tempfun(a, b) = a * b
tempfun.(df.a, df.b)
```

`@with` combined with `@byrow` will return a vector of the
broadcasted multiplication and not a data frame.

Additionally, `@eachrow` and `@eachrow!` allow modifying a data
data frame. Just as with Base Julia broadcasting, `@byrow` will
not update columns.
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved

```
julia> df = DataFrame(a = [1, 2], b = [3, 4]);

julia> @with df @byrow begin
:a = 500
end
2-element Vector{Int64}:
500
500

julia> df
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
```

### Comparison with `@.` and Base broadcasting

Base Julia provides the broadasting macro `@.` and in many cases `@.`
and `@byrow` will give equivalent results. But there are important
deviations in behavior. Consider the setup

```julia
df = DataFrame(a = [1, 2], b = [3, 4])
```

* Control flow. In all versions of Julia, expressions of the form
`if...else`, `a ? b : c` cannot be broadcasted. In versions below
1.7-dev, expressions of the form `a && b` and `a || b` cannot be
broadcasted. Consequently, the `@.` macro will fail when encountering such
control flow while `@byrow` will not.
```
julia> @with df @byrow begin
if :a == 1
5
else
10
end
end
2-element Vector{Int64}:
5
10

julia> @with df @. begin
if :a == 1
5
else
10
end
end # will error
```

* Broadcasting objects that are not columns. `@byrow` constructs an
anonymous function *which accepts only the columns of the dataframe*
and broadcasts that function. Consequently, it does not broadcast
objects that are referenced which are not columns.
```julia
@with df @byrow :x + [5, 6]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this always error? What if column :x would be a vector of two element vectors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have clarified in the example.

```
will error. On the other hand
```julia
@with df @. :x + [5, 6]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will only work if :x has 2 elements - right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I have added more detail.

```
will not.

* Broadcasting expensive calls. In Base Julia, broadcastsing
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
evaluates calls first and then broadcasts the result. Because
`@byrow` constructs an anonymous function and evaluates
that function for every row in the DataFrame, expensive functions
pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
will be evaluated many times.
```julia
julia> function expensive()
sleep(.5)
return 1
end;

julia> @time @with df @byrow :a + expensive();
1.037073 seconds (51.67 k allocations: 3.035 MiB, 3.19% compilation time)

julia> @time @with df :a .+ expensive();
0.539900 seconds (110.67 k allocations: 6.525 MiB, 7.05% compilation time)

```
This problem comes up when using the `@.` macro as well, but can easily be fixed with `$`.
```julia
julia> @time @with df @. :a + expensive();
1.036888 seconds (97.55 k allocations: 5.617 MiB, 3.20% compilation time)

julia> @time @with df @. :a + $expensive();
0.537961 seconds (110.68 k allocations: 6.525 MiB, 6.73% compilation time)
```
No such solution currently exists with `@byrow`.

bkamins marked this conversation as resolved.
Show resolved Hide resolved
## Working with column names programmatically with `cols`

DataFramesMeta provides the special syntax `cols` for referring to
Expand Down
2 changes: 1 addition & 1 deletion src/DataFramesMeta.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ using Reexport
export @with, @where, @orderby, @transform, @by, @combine, @select,
@transform!, @select!,
@eachrow, @eachrow!,
@byrow, @byrow!, @based_on # deprecated
@based_on # deprecated


global const DATAFRAMES_GEQ_22 = isdefined(DataFrames, :pretty_table) ? true : false
Expand Down
21 changes: 0 additions & 21 deletions src/eachrow.jl
Original file line number Diff line number Diff line change
Expand Up @@ -70,27 +70,6 @@ function eachrow_helper(df, body, deprecation_warning)
end
end

"""
@byrow!(d, expr)

Deprecated version of `@eachrow`, see: [`@eachrow`](@ref)

Acts the exact same way. It does not change the input argument `d` in-place.
"""
macro byrow!(df, body)
esc(eachrow_helper(df, body, true))
end

"""
@byrow(d, expr)

Deprecated version of `@eachrow`, see: [`@eachrow`](@ref)

Acts the exact same way.
"""
macro byrow(d, body)
esc(eachrow_helper(d, body, true))
end

pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
"""
@eachrow(df, body)
Expand Down
Loading