Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Instead of :x allow x #168

Closed
xiaodaigh opened this issue Sep 16, 2020 · 44 comments
Closed

Discussion: Instead of :x allow x #168

xiaodaigh opened this issue Sep 16, 2020 · 44 comments
Milestone

Comments

@xiaodaigh
Copy link
Contributor

This is based on @pdeffebach's comment in #163

"Many people have expressed interest in not forcing people to write :x and instead write x for df.x, similar to dplyr and Stata (@Transform(z = x + y)). I support this, conditional on air-tight escaping rules. Should we start that transition now in the development of these macros?"

I like this, but I think it's also nice to use :x as well.

Consider this

x = 1

df = DataFrames(x = 2)

@transform(df, y = x + 1)

You need some to declare what you mean. So

@transform(df, y = x + 1, y = $x + 1)

will disambiguate.

Now consider

name = :x

df = DataFrames(x = 2)

@transform(df, y = sum(x))

but how do I use the variable name? we can do

name = :x

df = DataFrames(x = 2)

@transform(df, y = sum(cols(name))

Also, auto-broadcasting so user don't have to type the . would be nice too. We can open up a desperate issue for that if you wish.

@pdeffebach
Copy link
Collaborator

Thanks for this! It's good to have an issue for discussion.

I generally think this is the direction to go in. Here is a scenario to consider, though.

One thing I would really like to have in DataFramesMeta is to be able to use the source => fun => pair syntax in the same call as the y = f(:x) syntax. Something like

vars = names(df, Between(:x5, :x10)
@transform(df, y = :x1 .+ :x2, vars .=> normalize)

This would make it super easy have a mix of "global scope" code, i.e. working with literals, and "function scope" code working together. The Between(:x5, :x10) .=> normalize would be a real pain to write with the y = x syntax.

I'm worried that allowing both x instead of :x and arbitrary source => fun => dest syntax would make for a super complicated escaping system.

Maybe these are easily solvable, I haven't really played around with it yet, though.

@xiaodaigh
Copy link
Contributor Author

they should be easily solvable because each one is delimited by a comma so you can use a different function to transform it.

@pdeffebach
Copy link
Collaborator

Yes, after the new backend things are essentially

  1. Does the expression look like y = f(:x)? If so, call replace_syms! etc.
  2. If not, evaluate normally.

If we kept the same logic we would have

@transform(df, y = x + z, [:x, :y] => sum => :y2)

It might be confusing to users to mix literals and symbols in the expressions in different calls. But maybe not! Maybe I am worrying too much about this.

@pdeffebach
Copy link
Collaborator

Also, auto-broadcasting so user don't have to type the . would be nice too. We can open up a desperate issue for that if you wish.

Also no, I am very skeptical of this because it would break 1:1 comparison with the base transform functions. I know it's more typing and characters but the clarity and comparison overcomes those concerns I think. Please file a separate issue for discussion though.

@bkamins
Copy link
Member

bkamins commented Sep 16, 2020

My thinking is that it would be best to have one syntax that works (i.e. either :x or x) as having to many options to do the same things might be confusing for the users.

CC @vjd

@pdeffebach
Copy link
Collaborator

pdeffebach commented Sep 16, 2020

My thinking is that it would be best to have one syntax that works (i.e. either :x or x) as having to many options to do the same things might be confusing for the users.

you mean no source => fun => dest syntax in @transform? I think its important to be able to create multiple columns in a programmatic way in @transform. A "flag" might be a good solution,

@transform(y = f(x), @pair [:x, :y] .=> normalize)

@bkamins
Copy link
Member

bkamins commented Sep 16, 2020

No - I meant either use x or :x (not both to a denote column).
Also - can you please remind me if we are able to support column names like "my column" in DataFramesMeta.jl?

@pdeffebach
Copy link
Collaborator

Also - can you please remind me if we are able to support column names like "my column" in DataFramesMeta.jl?

You definitely can using cols. You currently cannot while using a string literal. Whether or not we support using string literals as column names depends on if we want to enforce new escaping rules for strings, i.e. require string-as-data to be prefaced by ^ as in @transform(y = "a string column" * ^("_suffix")

A major advantage of Symbols is that no one really uses Symbols as data so uses that require ^() to escape are rare.

@vjd
Copy link

vjd commented Sep 16, 2020

Are we talking about representing symbols across the dataframemeta package as just variable names? Will this behavior be the same in DataFrames.jl? If it is inconsistent between the two packages, I would go with the consistency of having :x. If DataFrames.jl will also change that behavior to x, then we can think about it?

@matthieugomez
Copy link

matthieugomez commented Sep 16, 2020

I have been experimenting with x instead of :x in DataFramesMacros, here are my two cents.

The main advantages of x are:

  1. It fits with the keyword argument syntax (i.e. lhs and rhs are similar, e.g. y = x, can be escaped similarly etc)
  2. I like the syntax $ to refer to a column name programmatically, e.g. v = :x; @transform(df, y = mean($v))
  3. it give a syntactically correct expression when considering the dataframe as an environment
  4. Less typing / visual noise

The main disadvantage is that there is typically more escaping to do. In particular, escaping is required when one refers to functions/types as arguments, e.g. map(^(f), x) or parse(^(Int), x). Dplyr avoids this issue by only replacing symbols that refer to columns in the DataFrame, but I am not sure a macro can do the same thing in Julia (since names(df) is not known at parsing time).

@pdeffebach
Copy link
Collaborator

Are we talking about representing symbols across the dataframemeta package as just variable names? Will this behavior be the same in DataFrames.jl? If it is inconsistent between the two packages, I would go with the consistency of having :x. If DataFrames.jl will also change that behavior to x, then we can think about it?

No, DataFrames would not change this. DataFrames will evaluate everything as normal Julia code evaluates things. This is just for DataFramesMeta.

@pdeffebach
Copy link
Collaborator

@matthieugomez Thanks for your input, it's helpful to have someone already exploring this.

@waynelapierre
Copy link

I guess it is important to maintain some kinds of consistency with DataFrames and Julia Base. I personally prefer :x to x.

@matthieugomez
Copy link

matthieugomez commented Sep 29, 2020

A few more observations from working with x:

  • x requires to special case missing to avoid interpreting it as a column name
  • x requires to special case calls and dot broadcasts f(x) and f.(x)
  • x cannot be used with non standard column names such as Symbol(“my variable“)
  • x creates potential ambiguities between the LHS of an expression and keyword arguments of transform
  • x is inconsistent with simple functions like groupby or sort that work with :x.

Tbh, while I started with a preference for x, I now think that :x is a better choice for DataFramesMeta. The only thing is that I would like x=:y to be deprecated in favor of :x=:y.

@pdeffebach
Copy link
Collaborator

Another proposal is #187 , to have $x refer to the column df.x instead of :x.

Pros:

  • Don't have to escape Symbols
  • Can work with strings with spaces easier

Cons:

  • Maybe conflicts with other uses of $ in the ecosystem
  • Less syntax highlighting

@xiaodaigh
Copy link
Contributor Author

how about .x? This is unambiguous.

@jkrumbiegel
Copy link
Contributor

As I suggested in the other thread, &x is also a good candidate as & is often used as a reference operator, so it would fit to reference columns with it.

Dplyr avoids this issue by only replacing symbols that refer to columns in the DataFrame, but I am not sure a macro can do the same thing in Julia (since names(df) is not known at parsing time)

One option could be not to replace the variable names with column references directly at parse time (which as you say wouldn't work) but to replace them with expressions that check if something exists in the dataframe, and if yes use that, but otherwise use the name normally. That sounds a bit ugly maybe, but the user wouldn't see it anyway..

Because of short-circuiting this works:

df = DataFrame(x = [1, 2, 3])

julia> ("x" in names(df) ? df.x : x)
3-element Array{Int64,1}:
 1
 2
 3

julia> ("sum" in names(df) ? df.sum : sum)
sum (generic function with 15 methods)

So an expression like sum(x) could be resolved just fine as:

julia> ("sum" in names(df) ? df.sum : sum)(("x" in names(df) ? df.x : x))
6

As I said, that's a bit ugly, but it's not visible to the user :)

@pdeffebach
Copy link
Collaborator

&x is another good contender, particularly for having use in other macros.

Unfortunately your suggestion about checking if a column is in the data frame is not implementable. If we could do that, then we would never have to escape anything and always get the names right on the first try.

Remember that DataFramesMeta has to define a function based off of the expression first. When all we have is the expression, and not the data frame column names, we can't write a function that takes in the right number of arguments. If we see

@transform(df, y = :a + :b + c)

then we need to make a function that takes in exactly two columns. If we had a and b instead of :a or :b we can't perform the logic that checks if a and b are really columns in the data frame at the time we construct the function.

If there is really a way to get around this, I would love it, but I'm not sure that's possible.

@jkrumbiegel
Copy link
Contributor

I think it's not technically impossible, but after playing around with it for a bit it was so ugly that I wouldn't recommend it.. Also, directly marking columns like &col is a really obvious thing for readers, who might not know all variables that could be in the dataframe or not

@pdeffebach
Copy link
Collaborator

Extra confirmation that re-writing the funciton based off of propertynames doesn't work here.

@pdeffebach
Copy link
Collaborator

Update for everyone. I think that :x instead of x is the best path forward.

  1. Using x instead of :x would make it way harder to inspect an expression and figure out what the columns are without some sort of escaping everywhere. As an extreme edge case, consider the scenario where we we have a @ByRow macro for broadcasting as well as elements of :x are iterable
@transform(df, y = @ByRow x(1)

If we had used x(1) then we would have to have some way to distinguish that call from

@transform(df, y = f(x)

It all sounds very difficult.

  1. It opens the door for many new metaprogramming tricks, such as creating new columns in @with blocks.
@with df begin 
    :z = :y + :b
end

This only works because :z = is disallowed.

  1. It becomes more consistent with @eachrow. i.e.
@eachrow df begin
    a = 500
    :x = :y .- a
end

In the above example, trying to re-create this with x instead of :x would be really hard, probably impossible. Since this has to stay, better have the rest of the ecosystem adapt.

  1. It's consistent with DataFrames. In the future, we will allow arbitrary expressions in transform calls
@transform(df, [:a, :b] => f => :c, :d = :a .+ :b)

Keeping :x everywhere is more consistent with people moving back and forth between the two.

  1. We want to allow keyword arguments in the macros, i.e.
@transform(groupby(df, :g), :y = f(:x); ungroup = false)

It would be odd to have ungroup be a keyword argument in this setting when you create new columns with y = f(:x)

I bring this up now because I want to work on adding keyword arguments to these macros, and I think as a necessary pre-condition for this change, we have to convert to using :x on the LHS and RHS of the =.

Please let me know your thoughts on this issue!

@bkamins
Copy link
Member

bkamins commented Dec 19, 2020

I think it is the way to go forward. Thank you for gathering the rationale in one post!

@nalimilan
Copy link
Member

nalimilan commented Dec 19, 2020

Thanks for the detailed proposal. I must say I'm still torn between the two options.

Marking columns with an explicit syntax like :x (or $x or &x) is indeed consistent and explicit. OTOH making the syntax simple is kind of the point of DataFramesMeta, even it it diverges from DataFrames. Since accessing and creating columns is by far the most common operation, it would make sense to have it use the simplest syntax (i.e. y = f(x)) and require explicit, more complex constructs for other operations (like accessing an external variable). For example, we could require using ; before keyword arguments, or disallow creating columns called ungroup unless some escaping is applied. And things like x(1) are too rare to influence the design IMO. Another consideration is the consistency with @formula, which was raised by a newcomer on Discourse recently.

Ideally we would experiment with both solutions in two packages and see what's the best choice in practice. But that would take quite some work...

@bkamins
Copy link
Member

bkamins commented Dec 19, 2020

The @formula consideration is relevant (I have not seen this thread, but maybe actually it would be better to switch to using : in @formula also - can you please cross post the link).

However, as a general rule I am in favour of :x as it is then consistent and visually explicit. Typing : is not super noisy in my opinion. But let us wait and hear what other people think.

@nalimilan
Copy link
Member

@bkamins
Copy link
Member

bkamins commented Dec 19, 2020

Thank you. However, it seems to me that the user actually felt :A is more intuitive rather than A?

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Dec 19, 2020

I think it's a win in clarity vs R etc. if we don't mix normal variables with column variables and the user has to sort out what's what or use some escaping rules.

I think using external variables is quite common in any kind of expression, and that would probably become a common nuisance if that were to become a special case. Then there are things like reduce(some_func, variable), where some_func would also need to be escaped, and that is not the "external variable" case.

I personally still think &col is nice because there are no collisions, where :col will need symbol escaping. Lots of functions use symbols to qualify the type of algorithm run, etc, so that could also be a common nuisance having to escape normal symbols. Although the syntax highlighting will look nicer with symbols.

Another case would be where the column name comes from a variable. While :("column_" * some_var) collides with quoting syntax, &("column_" * some_var) doesn't collide with anything, I think?

@pdeffebach
Copy link
Collaborator

Thanks for the feedback. This is a tough decision and it's important to consider the long term consequences.

I agree that consistency with @formula is important, bu I think that the StatsModels teams is actually kind of struggling with their choice of x insead of :x. For instance, this PR aimed at working with custom functions has some slightly complicated protect and unprotect logic about what should be evaluated as a functional term and what shouldn't.

If we decide to allow "traditional" transform inputs, i.e. @transform(df, :x => f => :y), we would have to have a special syntax to allow

t = :x => f => :y 
@transform(df, @protect t)

This is okay, as you mentioned it is an edge case. But it's another layer of complication, adding more and more of these edge cases and I would worry things would get a bit out of hand.

If we were to implement x instead of :x, I would borrow code from Volcanito.jl. The logic is here. It's a bit more complicated than what we currently have.

I can try and open up a branch borrowing from that code and see how it goes.

@kleinschmidt
Copy link

Using :x instead of x in formula is orthogonal to the issue of protect/unprotect; that's more about whether the function calls themselves are special or not, whereas what we're discussing here is whether you're looking up a symbol binding in local scope or in the table scope.

I do agree you want a way to syntactically differentiate between the two scopes. I think it's unlikely that we'd change statsmodels now, given that naked symbols (x) is consistent with every other implementation out there.

Why not use x (table columns) and $x (local variables)? I've always read $x as "splice in teh value of this variable in the calling scope".

@kleinschmidt
Copy link

Note that if quoting all "special" calls explicitly is required in a formula than you'd have to write f :~ x :+ y :+ z which I think we can agree is bad.

@kleinschmidt
Copy link

I do see the trouble though with "naked" column symbols which is that functions are not columns in the table and should be evaluated wrt. the "local scope". That's not a big problem when you have an explicit call like f(x) because syntactically they're differentiated, but if you are passing in a function with the => DataFrames syntax it's awkward.

@nalimilan
Copy link
Member

nalimilan commented Jan 15, 2021

Yes the biggest annoyance with treating symbols as references to columns is functions. But I'm not sure it's a showstopper: the point of DataFramesMeta is that you typically wouldn't use the DataFrames => syntax (otherwise just use DataFrames). So most of the time you'd write f(x). The main cases where that doesn't work would be map(f, x), sum(f, x) and so on. But map($f, x) may be acceptable (and we could even warn for common cases like this).

@kleinschmidt
Copy link

Yeah, and it might be possible to introduce some indirection where before calling getpropoerty(df, sym) you check whether it has that property and if not throw an informative error message

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Jan 17, 2021

I think I've also come around to like the "naked" column syntax more than :column, just because it achieves the highest simplicity in most cases. You're probably right that usually the function calls used are relatively simple, and it would be easiest to just use $ to mark any local-scope variables to interpolate. That would then include functions etc from global scope. I think it's better that more "difficult" cases like map(f, ...) cause a newcomer to look up a special case, and the simple cases like sum(col) don't.

Also, in the @transform(newcol = :col1 + :col2) I don't really like the asymmetry, then it should also be :newcol.

@pdeffebach pdeffebach added this to the Decision 1.X milestone Mar 7, 2021
@pdeffebach
Copy link
Collaborator

Marking this as decision 1.X.

I don't want to introduce this big of a breaking change at the moment, and I don't have the bandwidth for a giant PR finding every edge case for x vs :x.

I still prefer :x to x, but worry that making that change (@transform(df, :y = f(:x)), but worry that making that change now, then potentially reverting to x later, would cause too much variation and breakage for users.

So we should leave everything as-is for now and make this a breaking change for 2.0.

@pdeffebach
Copy link
Collaborator

I recently tested out a change to have :x on the LHS, i.e. transform(df, :y = f(:x)) and... I really liked it.

In particular, it allowed me to do

@tranform df AsTable = begin
    (a = mean(:x), b = mean(:y))
end

which maps perfectly to transform(df, [:x, :y] => ... => AsTable).

Unlike @jkrumbiegel I was also writing lots of more complicated expressions

		@groupby :firm_id :year
		@transform :closure_outmesh = begin
			(sum(:closure) .- :closure) .> 0
		end
		@groupby :mesh :year
		@transform :stays = mean(:closed .== false) .+ .1 .* :economic_shock
		@groupby :mesh :year
		@combine begin
			any_closure_outmesh = any(:closure_outmesh)
			stays = mean(:stays)
			any_closure = any(:closure)
			num_locations = sum(:closed .== false)
			(;any_closure_outmesh, stays, any_closure)
		end
		@groupby :mesh
		@transform AsTable = begin
			years_since_first_closure = begin
				ind = findfirst(:any_closure)
				if ind == nothing
					fill(-1000, length(:year))
				else
					:year .- :year[ind]
				end
			end

			years_since_first_closure_outmesh = begin
				ind = findfirst(:any_closure_outmesh)
				if ind == nothing
					fill(-1000, length(:year))
				else
					:year .- :year[ind]
				end
			end

			(;years_since_first_closure, years_since_first_closure_outmesh)
		end

Being able to look clearly at the blocks of code and know exactly which ones were columns was super convenient. Additionally, I create a lot of intermediate variables. Which would lead to the following with x

  1. Mark all local variables with $a. This is a pain
  2. Only mark "outer" variables with $a, local temporary don't need a $. This is also annoying because now we have 3 classes of variables to worry about: columns a, local temporary variables, a, and "outer" variables $a.

Side note, the begins and ends in the above block are pretty annoying and noisy, but I'm not sure what to do about that.

I also have a @groupby with a VarArg form. I like it a lot and would like to add it.

@nalimilan
Copy link
Member

Using symbols both on the RHS and LHS is indeed more consistent.

I also agree that AsTable should be supported in some way. But I'm not super convinced by the AsTable examples you give. For example, the last one would more simply be written as

@transform(years_since_first_closure = begin
               ...
           end,
           years_since_first_closure_outmesh = begin
               ...
           end)

i.e. without repeating the names of the variables twice. That's also more efficient since that allows running operations in parallel in two different threads. The only cases where AsTable is useful is when you compute several variables together, like with extrema or with custom complex computations.

Then there's the issue of local vs. outer variables, but I'm not sure it's a showstopper. Using outer variables isn't something super common so I'm not sure people would be annoyed if they had to be marked with $.

@bkamins
Copy link
Member

bkamins commented Mar 17, 2021

Using symbols both on the RHS and LHS is indeed more consistent.

I agree. It requires learning and is breaking, but if we want to go mainstream with DataFramesMeta.jl we should make all key breaking decisions now I think.

@pdeffebach
Copy link
Collaborator

pdeffebach commented Mar 17, 2021

That's also more efficient since that allows running operations in parallel in two different threads. The only cases where AsTable is useful is when you compute several variables together, like with extrema or with custom complex computations.

It's also useful if you have variables which rely on one another. Since you currently can't do

transform(df, :a => fun => :b, :b => fun2 => :c)

Since this isn't possible, it's easier to do with AsTable.

Then there's the issue of local vs. outer variables, but I'm not sure it's a showstopper. Using outer variables isn't something super common so I'm not sure people would be annoyed if they had to be marked with $.

My point above is that intermediate local variables are prominent, and having x instead of :x would cause confusion between local variables and columns as well.

@pdeffebach
Copy link
Collaborator

As another note, if we add :x instead of x on the LHS before 1.0, I would still want to disallow @transform(df, AsTable = ...).

It's too confusing with keyword arguments and wouldn't allow us to throw an error if someone expects old behavior. I would require either

@transform(df, :y = f(:x))

or

@transform(df, cols(AsTable) = f(:x))

where cols() can contain anything acceptable as dest in DataFrames.

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Mar 18, 2021

cols(AsTable) looks convoluted, I liked AsTable = ... better. Having a whole expression as a keyword looking syntax is not less confusing I'd say.

Or maybe again a macro to show something else is going on? But at some point there are too many macros :)

@transform(df, @AsTable f(:x))

@pdeffebach
Copy link
Collaborator

Or maybe again a macro to show something else is going on? But at some point there are too many macros :)

@transform(df, @AsTable f(:x))

This was the original idea, but then I realized we could mimic source => fun => dest exactly by having AsTable on the LHS.

AsTable = ... would be nice, but it would create an asymmetry. We could "special case" as-table to be the only thing non-Symbol on the LHS. If we just let the LHS be anything, then you would have the unfortunate assymetry of

x = :newcol
@transform(df, x = f(:y))

work, but not

x = :col 
@transform(df, :y = f(x))

If we rely on cols(AsTable) then all behavior is the exact same on the LHS and RHS of the expression.

@xiaodaigh
Copy link
Contributor Author

I like the look of :new_col = :old_col .+ 1

@pdeffebach
Copy link
Collaborator

Closed via #264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants