Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions/structs to move out of DataFrames to DataAPI (DataFrames syntax for DTable) #3075

Open
krynju opened this issue Jun 12, 2022 · 10 comments
Labels
ecosystem Issues in DataFrames.jl ecosystem

Comments

@krynju
Copy link

krynju commented Jun 12, 2022

I'm wrapping up JuliaParallel/Dagger.jl#344 and I managed to narrow down the list of things I need to import from DataFrames to get the DataFrames-style select somewhat working

Here's the list:

Index
ColumnIndex
MultiColumnIndex
normalize_selection
ByRow
AsTable

With these in a common package I could drop the DataFrames dependency altogether to get the select syntax working

I think with some more work broadcast_pair could be moved out as well and then one could technically have DataFrames select syntax working just by implementing these two functions for their own structure:

function _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
function manipulate(dt::DataFrame, args::AbstractVector{Int}; copycols::Bool, keeprows::Bool, renamecols::Bool)
@bkamins bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Jun 12, 2022
@bkamins bkamins added this to the 1.4 milestone Jun 12, 2022
@bkamins
Copy link
Member

bkamins commented Jun 12, 2022

We can do it relatively easily, but I think it should be a separate interface package, not DataAPI.jl.

@nalimilan - what do you think?

@quinnj
Copy link
Member

quinnj commented Jun 12, 2022

Why not DataAPI.jl? It seems like it would be fine there.

@bkamins
Copy link
Member

bkamins commented Jun 13, 2022

Because these are not only signatures, but also quite complex logic (probably in total it is ~1000LOC - I have not counted). I thought we want to keep DataAPI.jl lightweight so that it does not add compilation latency to packages that depend on it.

@nalimilan
Copy link
Member

Yes, DataAPI should remain really lightweight. We could include empty definitions there (or basic constructors) but not big implementations. These would sound more suited to something like TableTransforms, but I'm not sure they are generic enough for that. Does the Dagger implementation support any kind of Tables.jl object that may be wrapped by a DTable?

Otherwise we can create another package. We talked about having DataFramesBase in the past (#1764). Though the drawback is that it can make it more difficult to develop DataFrames. We would have to define more precisely what is the API and try to keep it stable.

@bkamins
Copy link
Member

bkamins commented Jun 16, 2022

We could include empty definitions there

There is no benefit from this. The point is to have a shared implementation so that both packages process operation specification language in the same way.

they are generic enough

They are not

Does the Dagger implementation support any kind of Tables.jl object

DTable is not DataFrames.jl specific, but this is irrelevant. The functions tht @krynju proposes to move are not related to DataFrames.jl implementation details of an AbstractDataFrame object - they only serve as pre-processing of operation specification language to cannonical form.

We would have to define more precisely what is the API and try to keep it stable.

As commented above - the objective of the functionalities that @krynju proposes to extract out is narrower than full implementation of common data frame operations. It is just pre-processing of operation specification language (and this is something essentially independent from AbstractDataFrame except for the fact that it accepts both Symbol and string as column name, but I think it is currently a common consensus that this is a desirable way to look at column names)

@krynju
Copy link
Author

krynju commented Jun 16, 2022

Yes, this is basically all about parsing the args... provided by the user into a format that the end structure (DataFrames or DTable) can take and manipulate the data with it.

Let's take this piece of code as an example:

select(df::DTable, @nospecialize(args...); copycols::Bool=true, renamecols::Bool=true) =
    manipulate(df, map(x -> broadcast_pair(df, x), args)...,
        copycols=copycols, keeprows=true, renamecols=renamecols)

I think the ideal setup for now would be to:

  1. move broadcast_pair to an external package
  2. move the Index related things mentioned in the original post to an external package
  3. pinpoint the bare minimum interface to use broadcast_pair (like names, index etc.)
  4. leave the structure to implement it's own manipulate set of functions to handle the output of map(x -> broadcast_pair(df, x), args)...
  5. put bare definitions of select, transform, combine etc. into DataAPI, so that other structures can extend them (like the *join methods)

Even if this args parsing part of DataFrames is moved out I think it will be pretty simple to maintain compatibility as this frontend part of DataFrames is pretty stable in general

@quinnj
Copy link
Member

quinnj commented Jun 16, 2022

This all makes sense to me; thanks for the explanations. Sounds like a great plan.

@nalimilan
Copy link
Member

OK. How should we call it then? If it's not specific to DataFrames the name should be relatively general, as this kind of syntax could be useful for other tables.

@bkamins bkamins modified the milestones: 1.4, 1.5 Jun 19, 2022
@bkamins
Copy link
Member

bkamins commented Jun 19, 2022

I mark this for 1.5 release milestone (maybe it will be OK to make it in 1.4.x patch release as it should be internal).
The reason is that 1.4 release will be made soon and I do not want this discussion to delay it.

@bkamins bkamins removed this from the 1.5 milestone Dec 25, 2022
@bkamins
Copy link
Member

bkamins commented Dec 25, 2022

@krynju - I am clearing the milestone from this PR. Given our earlier discussions - is this PR needed in the end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem Issues in DataFrames.jl ecosystem
Projects
None yet
Development

No branches or pull requests

4 participants