Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested transform is broken on master branch #176

Closed
tk3369 opened this issue Sep 27, 2020 · 3 comments · Fixed by #180
Closed

Nested transform is broken on master branch #176

tk3369 opened this issue Sep 27, 2020 · 3 comments · Fixed by #180

Comments

@tk3369
Copy link

tk3369 commented Sep 27, 2020

Given this data frame:

julia> d = DataFrame(rand(2,3))
2×3 DataFrame
│ Row │ x1        │ x2       │ x3       │
│     │ Float64   │ Float64  │ Float64  │
├─────┼───────────┼──────────┼──────────┤
│ 1   │ 0.876809  │ 0.385893 │ 0.111728 │
│ 2   │ 0.0609952 │ 0.15999  │ 0.100248 │

A plain transform works fine:

julia> @transform(d, y1 = :x3 .* 2)
2×4 DataFrame
│ Row │ x1        │ x2       │ x3       │ y1       │
│     │ Float64   │ Float64  │ Float64  │ Float64  │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.876809  │ 0.385893 │ 0.111728 │ 0.223456 │
│ 2   │ 0.0609952 │ 0.15999  │ 0.100248 │ 0.200496 │

But if I nest the result of the previous transform with another one, it doesn't work:

julia> @transform(@transform(d, y1 = :x3 .* 2), y2 = :y1 .* 2)
2×4 DataFrame
│ Row │ x1        │ x2       │ x3       │ y1       │
│     │ Float64   │ Float64  │ Float64  │ Float64  │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.876809  │ 0.385893 │ 0.111728 │ 0.223456 │
│ 2   │ 0.0609952 │ 0.15999  │ 0.100248 │ 0.200496 │

If I assign it to a temp variable then it works properly. So it's probably an issue with the macro.

julia> d_temp = @transform(d, y1 = :x3 .* 2)
2×4 DataFrame
│ Row │ x1        │ x2       │ x3       │ y1       │
│     │ Float64   │ Float64  │ Float64  │ Float64  │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.876809  │ 0.385893 │ 0.111728 │ 0.223456 │
│ 2   │ 0.0609952 │ 0.15999  │ 0.100248 │ 0.200496 │

julia> @transform(d_temp, y2 = :y1 .* 2)
2×5 DataFrame
│ Row │ x1        │ x2       │ x3       │ y1       │ y2       │
│     │ Float64   │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.876809  │ 0.385893 │ 0.111728 │ 0.223456 │ 0.446911 │
│ 2   │ 0.0609952 │ 0.15999  │ 0.100248 │ 0.200496 │ 0.400992 │

My versions:

(DataFramesMetaTest) pkg> st
Status `~/Julia/DataFramesMetaTest/Project.toml`
  [a93c6f00] DataFrames v0.21.7
  [1313f7d8] DataFramesMeta v0.5.1 `https://github.com/JuliaData/DataFramesMeta.jl#master`

julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4
@pdeffebach
Copy link
Collaborator

There is a lot of weird stuff going on. I added a print statement and it seems to be calling transform_helper too many times.

julia> @transform(d, y1 = :x3 .* 2)
inside `transform_helper`
2×4 DataFrame
│ Row │ x1       │ x2       │ x3        │ y1       │
│     │ Float64  │ Float64  │ Float64   │ Float64  │
├─────┼──────────┼──────────┼───────────┼──────────┤
│ 1   │ 0.398607 │ 0.821873 │ 0.0643828 │ 0.128766 │
│ 2   │ 0.138358 │ 0.74625  │ 0.639852  │ 1.2797   │

julia> @transform(@transform(d, y1 = :x3 .* 2), y2 = :y1 .* 2)
inside `transform_helper`
inside `transform_helper`
inside `transform_helper`
inside `transform_helper`
2×4 DataFrame
│ Row │ x1       │ x2       │ x3        │ y1       │
│     │ Float64  │ Float64  │ Float64   │ Float64  │
├─────┼──────────┼──────────┼───────────┼──────────┤
│ 1   │ 0.398607 │ 0.821873 │ 0.0643828 │ 0.128766 │
│ 2   │ 0.138358 │ 0.74625  │ 0.639852  │ 1.2797   │

@pdeffebach
Copy link
Collaborator

I have found the answer. The problem is here,

    quote
        out = $DataFrames.transform($x, $(t...))
        if $x isa GroupedDataFrame
            out = out[$x.idx, :]
        end
    out

I have to re-sort the output to retain consistency with previous @transform behavior on a grouped data frame. For some reason which I don't understand yet, this is causing transform to be called more times than it needs to be called.

If I instead do

function transform2(x, args...)
    println("in transform2")
    out = DataFrames.transform(x, args...)
    if x isa GroupedDataFrame
        out = out[x.idx, :]
    end
    return out
end


function transform_helper(x, args...)

    t = (fun_to_vec(arg) for arg in args)

    quote
        $transform2($x, $(t...))
    end
end

Then everything works as expected.

I will spend more time on an MWE to fully understand why things are working.

Options are

  1. Use this fix, with a function barrier
  2. Break @transform behavior by not re-sorting.

@pdeffebach
Copy link
Collaborator

I have opted for option 2, to break @transform

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants