-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Instead of :x
allow x
#168
Comments
Thanks for this! It's good to have an issue for discussion. I generally think this is the direction to go in. Here is a scenario to consider, though. One thing I would really like to have in DataFramesMeta is to be able to use the
This would make it super easy have a mix of "global scope" code, i.e. working with literals, and "function scope" code working together. The I'm worried that allowing both Maybe these are easily solvable, I haven't really played around with it yet, though. |
they should be easily solvable because each one is delimited by a comma so you can use a different function to transform it. |
Yes, after the new backend things are essentially
If we kept the same logic we would have
It might be confusing to users to mix literals and symbols in the expressions in different calls. But maybe not! Maybe I am worrying too much about this. |
Also no, I am very skeptical of this because it would break 1:1 comparison with the base transform functions. I know it's more typing and characters but the clarity and comparison overcomes those concerns I think. Please file a separate issue for discussion though. |
My thinking is that it would be best to have one syntax that works (i.e. either CC @vjd |
you mean no
|
No - I meant either use |
You definitely can using A major advantage of |
Are we talking about representing symbols across the dataframemeta package as just variable names? Will this behavior be the same in DataFrames.jl? If it is inconsistent between the two packages, I would go with the consistency of having |
I have been experimenting with The main advantages of
The main disadvantage is that there is typically more escaping to do. In particular, escaping is required when one refers to functions/types as arguments, e.g. |
No, DataFrames would not change this. DataFrames will evaluate everything as normal Julia code evaluates things. This is just for DataFramesMeta. |
@matthieugomez Thanks for your input, it's helpful to have someone already exploring this. |
I guess it is important to maintain some kinds of consistency with DataFrames and Julia Base. I personally prefer :x to x. |
A few more observations from working with
Tbh, while I started with a preference for |
Another proposal is #187 , to have Pros:
Cons:
|
how about |
As I suggested in the other thread,
One option could be not to replace the variable names with column references directly at parse time (which as you say wouldn't work) but to replace them with expressions that check if something exists in the dataframe, and if yes use that, but otherwise use the name normally. That sounds a bit ugly maybe, but the user wouldn't see it anyway.. Because of short-circuiting this works: df = DataFrame(x = [1, 2, 3])
julia> ("x" in names(df) ? df.x : x)
3-element Array{Int64,1}:
1
2
3
julia> ("sum" in names(df) ? df.sum : sum)
sum (generic function with 15 methods) So an expression like julia> ("sum" in names(df) ? df.sum : sum)(("x" in names(df) ? df.x : x))
6 As I said, that's a bit ugly, but it's not visible to the user :) |
Unfortunately your suggestion about checking if a column is in the data frame is not implementable. If we could do that, then we would never have to escape anything and always get the names right on the first try. Remember that DataFramesMeta has to define a function based off of the expression first. When all we have is the expression, and not the data frame column names, we can't write a function that takes in the right number of arguments. If we see
then we need to make a function that takes in exactly two columns. If we had If there is really a way to get around this, I would love it, but I'm not sure that's possible. |
I think it's not technically impossible, but after playing around with it for a bit it was so ugly that I wouldn't recommend it.. Also, directly marking columns like |
Extra confirmation that re-writing the funciton based off of |
Update for everyone. I think that
If we had used
It all sounds very difficult.
This only works because
In the above example, trying to re-create this with
Keeping
It would be odd to have I bring this up now because I want to work on adding keyword arguments to these macros, and I think as a necessary pre-condition for this change, we have to convert to using Please let me know your thoughts on this issue! |
I think it is the way to go forward. Thank you for gathering the rationale in one post! |
Thanks for the detailed proposal. I must say I'm still torn between the two options. Marking columns with an explicit syntax like Ideally we would experiment with both solutions in two packages and see what's the best choice in practice. But that would take quite some work... |
The However, as a general rule I am in favour of |
Thank you. However, it seems to me that the user actually felt |
I think it's a win in clarity vs R etc. if we don't mix normal variables with column variables and the user has to sort out what's what or use some escaping rules. I think using external variables is quite common in any kind of expression, and that would probably become a common nuisance if that were to become a special case. Then there are things like I personally still think Another case would be where the column name comes from a variable. While |
Thanks for the feedback. This is a tough decision and it's important to consider the long term consequences. I agree that consistency with If we decide to allow "traditional"
This is okay, as you mentioned it is an edge case. But it's another layer of complication, adding more and more of these edge cases and I would worry things would get a bit out of hand. If we were to implement I can try and open up a branch borrowing from that code and see how it goes. |
Using I do agree you want a way to syntactically differentiate between the two scopes. I think it's unlikely that we'd change statsmodels now, given that naked symbols ( Why not use |
Note that if quoting all "special" calls explicitly is required in a formula than you'd have to write |
I do see the trouble though with "naked" column symbols which is that functions are not columns in the table and should be evaluated wrt. the "local scope". That's not a big problem when you have an explicit call like |
Yes the biggest annoyance with treating symbols as references to columns is functions. But I'm not sure it's a showstopper: the point of DataFramesMeta is that you typically wouldn't use the DataFrames |
Yeah, and it might be possible to introduce some indirection where before calling |
I think I've also come around to like the "naked" Also, in the |
Marking this as decision 1.X. I don't want to introduce this big of a breaking change at the moment, and I don't have the bandwidth for a giant PR finding every edge case for I still prefer So we should leave everything as-is for now and make this a breaking change for |
I recently tested out a change to have In particular, it allowed me to do
which maps perfectly to Unlike @jkrumbiegel I was also writing lots of more complicated expressions
Being able to look clearly at the blocks of code and know exactly which ones were columns was super convenient. Additionally, I create a lot of intermediate variables. Which would lead to the following with
Side note, the I also have a |
Using symbols both on the RHS and LHS is indeed more consistent. I also agree that @transform(years_since_first_closure = begin
...
end,
years_since_first_closure_outmesh = begin
...
end) i.e. without repeating the names of the variables twice. That's also more efficient since that allows running operations in parallel in two different threads. The only cases where Then there's the issue of local vs. outer variables, but I'm not sure it's a showstopper. Using outer variables isn't something super common so I'm not sure people would be annoyed if they had to be marked with |
I agree. It requires learning and is breaking, but if we want to go mainstream with DataFramesMeta.jl we should make all key breaking decisions now I think. |
It's also useful if you have variables which rely on one another. Since you currently can't do
Since this isn't possible, it's easier to do with
My point above is that intermediate local variables are prominent, and having |
As another note, if we add It's too confusing with keyword arguments and wouldn't allow us to throw an error if someone expects old behavior. I would require either
or
where |
Or maybe again a macro to show something else is going on? But at some point there are too many macros :) @transform(df, @AsTable f(:x)) |
This was the original idea, but then I realized we could mimic
work, but not
If we rely on |
I like the look of |
Closed via #264 |
This is based on @pdeffebach's comment in #163
"Many people have expressed interest in not forcing people to write :x and instead write x for df.x, similar to dplyr and Stata (@Transform(z = x + y)). I support this, conditional on air-tight escaping rules. Should we start that transition now in the development of these macros?"
I like this, but I think it's also nice to use
:x
as well.Consider this
You need some to declare what you mean. So
will disambiguate.
Now consider
but how do I use the variable
name
? we can doAlso, auto-broadcasting so user don't have to type the
.
would be nice too. We can open up a desperate issue for that if you wish.The text was updated successfully, but these errors were encountered: