Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to NullableArrays and CategoricalArrays #1008

Merged
merged 43 commits into from
Sep 26, 2016
Merged

Port to NullableArrays and CategoricalArrays #1008

merged 43 commits into from
Sep 26, 2016

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Jul 3, 2016

Completely remove support for DataArrays.

Still rough, but passes the tests on Julia 0.5.

@datnamer
Copy link

datnamer commented Jul 3, 2016

Does this improve speed and type stability?

@nalimilan
Copy link
Member Author

Type-stability: yes. Speed: haven't measured anything yet, but it should.

@ViralBShah
Copy link
Contributor

I did write a very simple sum benchmark using NullableArrays and DataArrays yesterday. NullableArrays was within 2x of Arrays, and 10x faster than DataArrays. I would expect this to carry over, but of course, much better and systematic benchmarking is necessary.

@quinnj
Copy link
Member

quinnj commented Jul 4, 2016

Word. I'm pumped for this. I'll try to find some time in the next few days to pull the branch and play around.

function DataArrays.PooledDataVecs(df1::AbstractDataFrame,
df2::AbstractDataFrame)
# FIXME: rename this, and possibly move to CategoricalArrays.jl
function PooledDataVecs{S,N}(v1::NullableCategoricalArray{S,N},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it would be good to rename this as it's clearly a reference to PooledDataArrays. The operation here is a join!/union!/merge!, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure about the best name. Maybe join!. Ideally, we would also avoid accessing the refs field from the PooledDataVecs(::AbstractDataFrame, ::AbstractDataFrame) method below. But I'm even less sure what the API should look like in CategoricalArrays.jl. Could still improve this in steps.

@dmbates
Copy link
Contributor

dmbates commented Jul 15, 2016

With the RDA stuff removed you can probably avoid using GZip in src/DataFrames.jl. I included that specifically for reading compressed .rda files.

@ViralBShah
Copy link
Contributor

What holds up the merging here? If a few things need to be moved out, that can be handled in a different PR. I say we merge this if basically this does the right things.

@ararslan
Copy link
Member

Milan is on vacation, which I assume is at least part of the holdup.

@ViralBShah
Copy link
Contributor

I missed that. Thanks.

@dmbates
Copy link
Contributor

dmbates commented Jul 28, 2016

I find myself stumbling over DataArrays in doing some analyses and was wondering how this branch is coming along. Are there things that others can do to help?

@nalimilan
Copy link
Member Author

This branch needs some cleanup before merging, but I think we should also wait until we provide convenience APIs for lifting semantics. Without these, the port will be a major regression in terms of usability. I would certainly be useful if people could experiment in that direction in particular in DataFramesMeta, jplyr, NullableArrays and NullableUtils, as @davidagold is already doing.

@dmbates
Copy link
Contributor

dmbates commented Jul 29, 2016

@nalimilan I have been working on an analysis in Julia of some benchmarking results, which is why I found myself tripping over DataArrays. The analysis is in TimingResults As you can see I used DataFramesMeta in addition to DataFrames. I started using Gadfly for plots but still find it difficult to implement the plot I am thinking of. I ended up reverting to lattice graphics accessed through RCall, because I know lattice graphics quite well.

My conclusion from this exercise is that the single most effective change we can make regarding DataFrames is not to create a NullableArray or DataArray unless it is necessary. It is fine to consider lifting semantics, etc. but, when there are no missing data in a column, creating a missing-data-enabled type then sweating over how to handle missing data values that are never present is what J. Edwards Deming called "burning the toast and then scraping it". Why go to all that trouble and, in the process, slow down the computation if you don't need to? Functions such as readtable should allow for missing data, of course, but should return Arrays when there are no missing data values in a column. Reading a feather file using the Feather package should do the same thing. RCall conversions of R data.frames should do this too. I would go one step further and suggest adding a function (I haven't decided on a good name for it) that takes a DataFrame and converts any NullableArray{T} columns to Array{T} if there are no nulls. Same for NullableNominalArray, etc. The completecases and completecases! functions should also convert nullable structures to plain arrays.

By the way, @nalimilan, I applaud your decision to create separate types NominalArray and NullableNominalArray, etc. in the CategoricalArrays package. This is exactly the distinction we should make.

As an exercise, consider the proportion of columns with missing data in the data sets you encounter. For me, at least, columns with missing data are the exception rather than the rule.

Regarding DataFramesMeta I took a while to learn it and now find it quite useful. Is there a reason to also mention jplyr? Anyone is free to create their own package - this is open source software - but the JuliaStats group should try to settle on one approach to a given problem, not multiple incompatible approaches. (The selection of verbs in DataFramesMeta follows the LINQ/SQL tradition whereas jplyr follows dplyr). Unless there is something wonderful in jplyr that cannot be implemented in DataFramesMeta the approach should be to improve DataFramesMeta, not create another incompatible package. As an aside, @davidagold , a name like jplyr, while informative to those who know dplyr, removes flexibility. You are bound by the name to emulate the approach of dplyr, whether it fits in well with Julia or not, and you are always playing catch up in the implementation.

So, what am I missing? Should I be creating my plots using the Plots package or ...?

@johnmyleswhite
Copy link
Contributor

@dmbates: I think the problem is that DataFramesMeta doesn't fully solve the problems you're describing as the moment, but somewhat exacerbates them: its existing commitment to non-lazy DAG accumulation means that we can't use it to generate SQL -- which is the main example we have of backend-independence. We could change those semantics in that package, but it would completely break all existing code using that package.

David and I have begun preparing a document we'll share soon about the semantics we hope all tables will agree upon. Assuming we can actually get everyone on board, we could fix the situation by ensuring that every downstream package works only with a standardized abstract interface that multiple backends could implement. Part of the purpose of this document is to point out that we can implement the dplyr semantics with very high performance and fully automatic lifting regardless of historical Julia backends: we could make the current DataFrames fast or make the new version with NullableArrays fast. The only thing we really need is to maintain the tradition of always keeping the values array and is-null mask array separate. In cases where we intend to keep a non-nullable column in a table, I think we're better off not using Array objects, but instead adding boolean flags to every NullableArray that specify allownulls and anynull. That would simplify implementations, which will need to have access to the internal representation of columns. We can always work around that, but I don't think it will buy us much.

As for Gadfly, I think the best thing we could do there is remove the canonical interface and focus on the heretical one, which would assume that inputs are specified as AbstractArray objects. Then the tabular data interface issue just becomes one of ensuring that a table can always provide access to its columns as AbstractArray objects.

@tshort
Copy link
Contributor

tshort commented Jul 29, 2016

I agree with @dmbates on preferring Arrays where possible. Towards that end, I included a PassThrough type with abbreviation P in DataFramesMeta to try to avoid creation of DataArrays where they aren't needed.

With regards to jplyr, it's quite new and promising. I wouldn't want to discourage development or exploration. It's approach has advantages over DataFramesMeta. DataFramesMeta was always meant to be a playground.

@dmbates
Copy link
Contributor

dmbates commented Jul 29, 2016

@nalimilan Thanks for the clarification on how DataFramesMeta can be improved and is being improved in jplyr. I appreciate the comments by @tshort too.

My comments on the name jplyr still hold. It is a name that will be recognized by those with dplyr experience but it is not informative to others. (plyr is not meaningful to those outside the HadleyVerse.) Furthermore, it seems to claim that jplyr is dplyr in Julia - nothing more, nothing less. That's a difficult expectation to fulfill and also an unnecessary constraint. Please consider renaming it.

I still think it is important to prefer Arrays when possible. I will read your document with interest when it is available but also, I'm sorry to say, with a bit of skepticism. There is no free lunch. Saying that all columns of a DataFrame must be of a missing-value-enabled type but perhaps with an extra tag that indicates they don't really have any missing data means that they are no longer arrays and you are back at the lifting semantics. To me that still seems like burning the toast and scraping it. And, it seems to me, it is unnecessary. Why take an Array and make it into something that is no longer an Array and must be massaged to get it back to being an Array so you can operate on it?

I am not infallible by any means and the whole Julia project has already pulled a very big rabbit out of the hat by creating a language with dynamic types where methods can be compiled to machine code. So I could be pleasantly surprised and I hope I will be. But if your objective is to create fully automatic lifting that is as fast as not doing any lifting at all, that seems to be a big ask.

@davidagold
Copy link

@dmbates The name alludes to Hadley's notion of dplyr as a "grammar of data manipulation", rather than a specific design spec. It's just a working project title -- I intend to change it. Does anybody have any suggestions? One thought I've had is Select.jl.

@dmbates
Copy link
Contributor

dmbates commented Jul 29, 2016

@davidagold Before there was dplyr there was plyr, which was a generalization of the various *apply functions in R and a play on the word "plyer". But functions like lapply and friends are called map in other languages, including Julia.

I regret that I don't have a good suggestion for a name at this point but I will certainly speak up if something comes to mind.

@dmbates
Copy link
Contributor

dmbates commented Jul 29, 2016

I just noticed that I replied to a comments by @johnmyleswhite as if they were by @nalimilan. Sorry for the confusion.

I must admit to being discouraged that, after over 4 years of development, DataFrames still has some pretty glaring problems. I will be giving a presentation at the Joint Statistical Meetings on Sunday and, for all the good aspects of Julia, I can't yet tell the audience "yes, you really should be using Julia for data analysis". Maybe the grand vision thing will work out. Maybe it will be another case of "the best is the enemy of the good". I don't think that potential users will wait forever to find out.

@davidanthoff
Copy link
Contributor

I now have a branch in Query.jl that works with the current DataFrames master, i.e. with the NullableArrays version. See queryverse/Query.jl#62.

@nalimilan
Copy link
Member Author

@davidanthoff How does Query.jl work as regards Nullable lifting? Can you e.g. compute a mean of a NullableArray conveniently? Select observations based on conditions on NullableArrays?

@nalimilan
Copy link
Member Author

BTW, there's now a blog post on StructuredQueries.jlg, but it's more about the design than a simple introduction for users. We also need to address the issue of lifting to offer a decent replacement for the current DataFrames features.

@davidagold
Copy link

StructuredQueries (SQ) occupies a strange position in that it provides a user-facing interface but no backend support. The former is intended to be extended by packages that define a particular semantics. I don't intend for users to do using StructuredQueries. I expect for them to do, say using X, where X reexports SQ and implements a backend support ("collection machinery" in my lingo). I'm still working out how this will look in practice. I think if users want a ready-to-go data manipulation framework right now, they're better off using DataFramesMeta or Query.

@johnmyleswhite
Copy link
Contributor

Can we plan out a timeline for tagging this release? Given the roughness of our API offering, can we hold off for, say, 2-3 months? I think this is clearly the future, but I think we want to make sure that we're ready to tell everyone to query Query.jl or DataFramesMeta.jl or StructuredQueries.jl.

@davidanthoff
Copy link
Contributor

@nalimilan I don't want Query to provide any special lifting support, it is meant to just pick up whatever there is in base. Right now I define a whole bunch of methods that provide lifted implementations for things like + etc., but I want to remove those in the future once we have that in base.

@tkelman
Copy link
Contributor

tkelman commented Oct 6, 2016

Which is known as type piracy and changes the behavior of unrelated code. Packages should not be doing that.

@davidanthoff
Copy link
Contributor

@tkelman Totally agreed, that stuff should all be moved to base. Both NullableArrays and Query are guilty of this right now, as far as I can tell.

@nalimilan
Copy link
Member Author

Yeah, we need to move these into Base, and have a package for 0.4 and 0.5 which only provides lifted operators, so that it's clear to the user what happens.

But without automatic lifting for other functions, I'm afraid working with Query.jl is going to be quite painful. Time will tell...

@davidanthoff
Copy link
Contributor

Yeah, we need to move these into Base, and have a package for 0.4 and 0.5 which only provides lifted operators, so that it's clear to the user what happens.

Could we just back port these things to 0.5, and just not support this stuff on 0.4?

@tkelman
Copy link
Contributor

tkelman commented Oct 7, 2016

We don't backport behavior changes, only bug fixes.

@johnmyleswhite
Copy link
Contributor

I entirely agree that we should not bother supporting 0.4.

@ararslan
Copy link
Member

ararslan commented Oct 7, 2016

DataFrames master doesn't, so we can't anyway. ¯_(ツ)_/¯

@davidanthoff
Copy link
Contributor

Who is going to create the package that hosts the nullable methods for julia 0.5? Is it going to be hosted under JuliaStats?

Probably easier if we start with adding these methods in the package, and then once that is done we can move them all into julia base master in one go, right?

Would be great if someone could create the package so that we can start contributing, I'd be eager to get this moving.

@ararslan
Copy link
Member

ararslan commented Oct 7, 2016

As Tony said, packages shouldn't be engaging in type piracy, so perhaps we should just starting making base PRs?

@davidanthoff
Copy link
Contributor

@ararslan I believe there was agreement (including from @tkelman) that in this case, for julia 0.5 compat only, there would be one package that would have lifted methods and engage in type piracy. That package would only be used on julia 0.5, and the same methods would then be included in julia base for 0.6.

Essentially that package would just get all the type piracy that is currently going on in NullableArrays and in Query.

If we don't do that, I really don't see how we can possibly tag the new DataFrames before julia 0.6 is released.

@ararslan
Copy link
Member

ararslan commented Oct 7, 2016

Oh sorry, I didn't realize that such an agreement had been reached.

Looks like this is what we're after, then: https://github.com/davidagold/NullableOps.jl

@nalimilan
Copy link
Member Author

nalimilan commented Oct 7, 2016

Probably easier if we start with adding these methods in the package, and then once that is done we can move them all into julia base master in one go, right?

AFAIK everything is done since I filed JuliaLang/julia#16988. I just need to open a new PR with non-boolean operators and see how the discussion goes.

@davidagold
Copy link

@nalimilan I don't think that's the PR you mean. Where's your PR that introduces lifted operators?

@nalimilan
Copy link
Member Author

Indeed, I've fixed the link!

@davidanthoff
Copy link
Contributor

I'm currently working on a macro that might make adding these lifted methods much easier. The general idea is this:

@lift function log(x)

would for example add a lifted method. The whole thing would work much more general:

@lift function +(x::Number, y::Number)
@lift function +{T<:Number}(x::T,y::T)

etc. should all work.

This macro could be used to then implement another macro called @lifted that could be used like the @generated macro to easily add lifted versions of a function at the function def site:

@lifted foo(x::Float64)
    return blabla
end

would automatically add lifted methods to foo.

It is not done, I'm just mentioning this here in case someone else is working on a similar thing, then we should merge effort.

@tkelman
Copy link
Contributor

tkelman commented Oct 12, 2016

Doing lifting through selective manual method extension seems like you're repeating the dot-operator mistakes (and @vectorize_1arg as well). It's also fragile due to the global side-effects of modifying method tables. The approach taken in JuliaLang/julia#18758 seems safer, more general, and will have more localized effects.

nalimilan pushed a commit that referenced this pull request Jul 8, 2017
Port to NullableArrays and CategoricalArrays
@Wilfred
Copy link

Wilfred commented Oct 19, 2017

This was merged in September 2016, but as far as I can see version 0.10.1 (August 2017) doesn't have this. Is there a plan for releasing?

@nalimilan
Copy link
Member Author

This PR has actually been merged, then moved to DataTables, and then replaced with a port to Nulls.jl before being "backported" to DataFrames. See this Discourse post for the full story. We're currently preparing a release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.