Easier syntax for mixed computations by row or by column #186

jkrumbiegel · 2020-10-07T07:26:29Z

I came up with a syntax that I like quite a lot in my own little macro package I was trying out, and I wanted to see if there is interest to add something like it here.

The problem

Some computations in @transform, @combine, etc. are easier to express when thinking about the inputs as whole column vectors. Others are easier to write in an elementwise fashion. Currently, computations are done on whole vectors. There is a @byrow macro but I think that iterates over all rows, making it slower than what I have in mind. In any case, it can not be mixed with the vector style.

The solution

We can pun on broadcasting assignment syntax to solve this issue. Here is an example:

df = DataFrame(val = rand(1:4, 100), tup = rand([(1, 2, 3), (4, 5, 6)], 100)

# vector syntax
@transform(df, z = :val .* getindex.(:tup, 2))

# proposed row wise syntax
@transform(df, z .= :val * :tup[2])

How it works

While the default syntax creates a this function:

(val, tup) -> val .* getindex.(tup, 2)

The rowwise version creates this:

(val, tup) -> map((v, t) -> val * tup[2], val, tup))

This should still be as fast as possible for a row wise computation, compared to iterating all rows unnecessarily.
Another benefit is that the two syntaxes can be easily mixed, depending on which way of thinking is more appropriate for the current computation.

The text was updated successfully, but these errors were encountered:

pdeffebach · 2020-10-07T16:11:47Z

Thanks for this! I appreciate the proposal.

I think this idea conflicts with other appearances of .= sytax, though.

y .= f(x)

means that y is updated in-place by the values returned by f(x), i.e.

t = f(x)
for i in eachindex(y)
    y[i] = t[i]
end

This is not what's going on with the .= operator in your proposal. In particular, this operation is always going to allocate, and might not even exist to be assigned into.

Note that in DataFramesMeta, both the release branch and master, you can use @. and

@transform(df, y = @. :val * getindex(:tup, 2))

You are right that the :tup[2] doesn't work, though. iirc this is something that might be allowed in Julia base in the future.

jkrumbiegel · 2020-10-07T19:08:20Z

I do know that the broadcasted assignment syntax is usually understood as mutation of an array, I just felt that for a macro which defines a nonstandard DSL for DataFrame manipulation, such a nonstandard functionality is ok.

I actually have the situation quite often that I have an expression that is not really suitable for broadcasting syntax. For example, keywords don't broadcast, so if you want to pass values of one column as keyword arguments to a function, that wouldn't work unless you manually created the map that I proposed here. Do you think that's too uncommon to make it easy with this dot assignment syntax?

pdeffebach · 2020-10-07T19:32:25Z

a macro which defines a nonstandard DSL for DataFrame manipulation, such a nonstandard functionality is ok

I think that this is a slightly different mental of DataFramesMeta than I am imagining people to have. I think I would like people to view DataFramesMeta as a way to construct an expression for inputting into DataFrames.transform. I don't think that it's very clear from y .= f(:x) that a user should read :x => ByRow(f) => :y. imo a simply flag @byrow y = f(:x) is more explicit (assuming we change the name of the existing @byrow macro).

jkrumbiegel · 2020-10-08T08:06:47Z

a simply flag @byrow y = f(:x) is more explicit

Maybe you're right, I didn't even know about the ByRow(f) wrapper, because I guess I would not use it anyway in its standard form because things get very verbose. But in this macro package I think it could be good. Something like this?

@transform(df,
	@byrow y = f(:x),
	z = g.(:x))

# or

@transform(df,
	y = @byrow f(:x),
	z = g.(:x))

this is a slightly different mental

I understand where you're coming from. Personally, I value non-redundant syntax a lot, especially for things that I have to write over and over. So to me, a package like DataFramesMeta really makes DataFrames comfortable to use, because DataFrames' defaults rely on a lot of redundant typing. Which is fine, because it's understandable that the base package doesn't want to use macros. But I'd also say that this frees macro packages which try to implement the smoothest DSL workflow possible from staying too close to the original syntax.

In my mind it's not a problem to do a .= keyword syntax in a macro, people can easily learn what it means and move on. I don't know if you have used R's data.table before, which in my opinion is a good example of having a slightly steeper learning curve, but then offering a very concise syntax once you get the hang of it. It takes so much redundancy out of the split apply combine workflow. You could argue too that it doesn't do multidimensional indexing even though that's what it looks like for an outsider. They just determined that this is a really good way to express the typical transformations they need.

matthieugomez · 2020-10-14T06:26:31Z

Interesting. Another thing that does not work with your proposal is that it cannot be extended to expressions that do no create a new variable (which may happen in @where , @orderby, etc). But I like the idea: maybe a related one would be to use the syntax @transform., @select., etc.

nalimilan · 2020-10-15T12:29:21Z

Using .= is appealing, but in Julia one needs to use dots elsewhere in the expression to enable broadcasting, e.g. y .= x .+ 1. So that would essentially be a syntax pun.

But I like the idea: maybe a related one would be to use the syntax @transform., @select., etc.

Unfortunately @select. isn't a valid identifier name so it can't work.

matthieugomez · 2020-10-15T14:12:25Z

Ok, I thought it might be possible since `@.` exists, but I did not look into it.

…

On Thu, Oct 15, 2020 at 5:29 AM Milan Bouchet-Valat < ***@***.***> wrote: Using .= is appealing, but in Julia one needs to use dots elsewhere in the expression to enable broadcasting, e.g. y .= x .+ 1. So that would essentially be a syntax pun. But I like the idea: maybe a related one would be to use the syntax @Transform., @select., etc. Unfortunately @select. isn't a valid identifier name so it can't work. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#186 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPPPXNLUYRVBRBJSOVZA5TSK3TLHANCNFSM4SHAITAQ> .

pdeffebach · 2021-08-13T13:43:21Z

Closed with the addition of @rtransform etc. in #267

pdeffebach added this to the 1.X milestone Mar 7, 2021

pdeffebach closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easier syntax for mixed computations by row or by column #186

Easier syntax for mixed computations by row or by column #186

jkrumbiegel commented Oct 7, 2020 •

edited

Loading

pdeffebach commented Oct 7, 2020

jkrumbiegel commented Oct 7, 2020

pdeffebach commented Oct 7, 2020

jkrumbiegel commented Oct 8, 2020 •

edited

Loading

matthieugomez commented Oct 14, 2020 •

edited

Loading

nalimilan commented Oct 15, 2020

matthieugomez commented Oct 15, 2020 via email

pdeffebach commented Aug 13, 2021

Easier syntax for mixed computations by row or by column #186

Easier syntax for mixed computations by row or by column #186

Comments

jkrumbiegel commented Oct 7, 2020 • edited Loading

The problem

The solution

How it works

pdeffebach commented Oct 7, 2020

jkrumbiegel commented Oct 7, 2020

pdeffebach commented Oct 7, 2020

jkrumbiegel commented Oct 8, 2020 • edited Loading

matthieugomez commented Oct 14, 2020 • edited Loading

nalimilan commented Oct 15, 2020

matthieugomez commented Oct 15, 2020 via email

pdeffebach commented Aug 13, 2021

jkrumbiegel commented Oct 7, 2020 •

edited

Loading

jkrumbiegel commented Oct 8, 2020 •

edited

Loading

matthieugomez commented Oct 14, 2020 •

edited

Loading