-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-vectorized syntax in @where #39
Comments
Here's one way: macro wherex(df, ex)
df = esc(df)
ex = esc(ex)
quote
df2 = transform($df, _idx = true)
$df[(@byrow! df2 :_idx = $ex)[:_idx],:]
end
end
using DataFrames, DataFramesMeta, RDatasets
iris = dataset("datasets", "iris")
i2 = @wherex iris :Species in ("setosa", "virginica") It's definitely worth thinking about expanding this idea. |
Interesting. Though it doesn't look like the most efficient way of doing this: it would be good to avoid creating a temporary data frame. Also, I wonder whether the non-vectorized form shouldn't be the recommended one (or even the only supported one): vectorized expressions require storing temporaries when combining operators, which is inefficient. |
Creating a temporary DataFrame is relatively inexpensive, but you could get around it with more effort. Another issue with byrow operations is that the following won't work. @where(df, :colA .> mean(:colB)) Supporting that requires something like Devectorize.jl. Maybe embedding |
I came up with this: function Base.in{T}(xs::PooledDataArray{T}, ys::AbstractArray{T})
Bool[any(x in ys) for x in xs]
end Of course it would need more methods for when Some micro benchmarks (after JIT warm up): julia> @time a = @where iris :Species in ["setosa", "virginica"];
0.004165 seconds (1.11 k allocations: 56.937 KB)
julia> @time b = @where iris (:Species .== "setosa") | (:Species .== "virginica");
0.004514 seconds (1.55 k allocations: 84.269 KB)
julia> @time c = @wherex iris :Species in ["setosa", "virginica"];
0.006636 seconds (2.34 k allocations: 117.952 KB)
julia> a == b == c
true |
@Ismael-VC The problem with this method for Also, with complex conditions, a non-vectorized form will always be faster because the vectorized form creates temporary arrays for each one. @tshort Operations relying on aggregate values would indeed no longer be possible with my proposal. Not sure what can be done about it (except having two different forms of |
@nalimilan what about using small in ( julia> function ∊{T}(xs::PooledDataArray{T}, ys::AbstractArray{T})
Bool[any(x in ys) for x in xs]
end
∊ (generic function with 1 method)
julia> @where iris :Species ∊ ["setosa", "virginica"];
|
@Ismael-VC I don't think it's a good idea. The two operators are easily confused, and nothing in the definition of "small in" implies it's vectorized. Anyway, the present issue is not about vectorizing |
@nalimilan I think you are contradicting yourself in those statements, also I thought that the idea would be for documentation to explain that and yes I just focused on |
@Ismael-VC Sorry, I don't see where I'm contradicting myself. Here I propose to work row-wise, and use only non-vectorized operators. You proposed to add a new vectorized operator which looks closely like the non-vectorized one. |
Oh yeah you are right. I missed the point of |
Integrating conditional deletion of rows into |
EDIT: Nevermind, posted this to the wrong issue. After JuliaLang/julia#22089, this can be "solved" with
|
Closed in favor of #165 |
Since
@where
operates by row, I would find it both natural and practical to allow using non-vectorized operators, like==
instead of.==
. That would be particularly useful forin
, which is currently not vectorized and might never be (JuliaLang/julia#5212). For example, it would be great to be able to write:instead of as currently:
Do you think this would be technically possible? (If so, this idea could also be applied to other macros.)
The text was updated successfully, but these errors were encountered: