-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Tables.rows implementation to use eachrow #2051
Conversation
Instead of converting to a NamedTuple of Vectors (via `columntable`). The conversion to NamedTuple is problematic in the case of extremely wide DataFrames, and as such is the root cause of JuliaData/CSV.jl#538. The reason NamedTuple was used in the first place, however, was due to a desire in certain querying contexts to be able to get a type-stable row-iterator (cc: @davidanthoff ). While this change generally follows the DataFrames.jl approach to avoiding extreme compiler pressure, it does incur just over 2x performance penalty on a simple manipulation like: df |> @Map({x=_.id + 1, y=_.salary * 2.0}) |> DataFrame Perhaps it's easy enough for users who need that extra speed to just do `columntable(df)` first before manipulating and when they know their DataFrame has a reasonable amount of columns, but it might be possible to automatically switch between typed/untyped representations automatically, depending on the # of columns and pre-selected thresholds.
|
@nalimilan - do we need to make a similar decision for |
Given that
|
Maybe one way to handle this is to replace IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(df) with IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) =
Tables.datavaluerows(columntable(df)) Not sure about the performance implications of that, I don't really understand the whole fallback story in Tables and what calls what, and whether this would introduce some additional indirection or not. But in general it should be possible to handle the queryverse/tabletraits story independently of this change, given that we have the separate interface in tabletraits for the Query.jl story. On a more general level, is a non-type stable row iterator useful for anything but very small toy examples? Doesn't this change essentially mean that any client who currently uses the |
This is the performance of type-unstable API from a fresh Julia session:
Maybe 8GB is not huge, but I would say this is roughly what is a typical maximum the user can ask. Of course it is roughly 200x slower than what would be possible if we run the same operation in a type stable way, but my point is that 0.03 second is not bad for a normal scenario (i.e. if you are not doing this operation inside a hot loop). Can you please try running the same example using your pipeline and report three first runtimes as above, so that we can see the difference? My preference would be to have both options (type stable and type unstable). In particular for |
I'd say your proposal is fine @davidanthoff, at least if what you need is a type-stable iterator for Query.
My point is that with a type-unstable struct like |
I guess I was just worried that currently people might have code that uses |
Let's make sure we're using clear language here; there is absolutely nothing breaking here. A 2x performance slowdown is not breaking, and as @bkamins pointed out, probably wouldn't even be noticed in a wide variety of use-cases. And let's not forget the original issue here: Now, I do think we should probably go with the proposal of defining: IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) =
Tables.datavaluerows(columntable(df)) ** But it still makes me a little uncomfortable knowing there's a code path that could bite people who are using DataFrames as the stable, mature table format. I know that 90% or even 95% of use-cases will never run into compilation issues with very wide datasets, but it does raise the question of Query's suitableness for these very wide scenarios; I mean, it effectively can't be used if you're dealing with a really wide dataset, right? Are there any plans to try and support that? My thoughts and feelings have definitely evolved over the last year or two on how critical type stability really is; after running into lots of production issues, issues with compilation costs in general, etc. I just think there are smarter ways to leverage Julia's power and flexibility by picking and choosing your "type battles"; there are a lot of cases, after going back and reviewing, where I've realized that the "type stability overhead" was definitely not worth cost of a more simple approach, with some extra code or API that allows introducing type stability more strategically or automatically when it would be really useful. Anyway, sorry for waxing a tad too theoretical, but I wanted to make sure my thoughts on the "dangers" of type stability were thoroughly on the record. ** as a separate thought, it might be nice if we could just define it as |
Yes, "breaking" is probably not the right word. I was really just referring to @bkamins statement that things would be 200x slower and a problem if this happened in a hot loop. For folks that use things in that way, this PR seems to have the potential to be a major performance regression. Not sure how important that is. Query.jl really doesn't work with wide data at all, at this point I think one can realistically only use it with tidy data or something like that. My plan to fix this is to create an alternative backend for sources that store things in columnar format, and add a full column oriented processing and whole-query optimization story (think MonetDB/X100). The hooks to do something like that are all there (this was the plan from the beginning of the whole project), but implementing that is a major undertaking, so that is more of a multi-year project. I'll probably try to get some research funding down the road for this so that I can hire some folks to work on this, I think there are enough novel issues in this that it can count as a research.
I'm not entirely sure I understand this. Do you mean that it would be nice if |
As I stated in the original post, for the specific workflows of doing various sequences of Query.jl operations, in my benchmarks, the performance hit was 2x. I'm not exactly sure what @bkamins was referring to with the 200x statement as I didn't seem to see a comparison of type-stable vs. not. In my benchmarks, I was directly comparing current release code vs. this PR.
No; as you probably could guess, there isn't a way for
I remember discussing some future Query stuff like this, but I'm not sure I quite make the link w/ that enabling the processing of wide datasets any better. Currently, the interface requires iterating strongly typed NamedTuples, right? Which I think already puts really wide datasets into a bit of a trap. Would the alternative backend use some other kind of mechanism to avoid materializing the NamedTuples or even their types? I know there's some scaffolding in TableTraits for column processing, but I think I remember it still required returning a NamedTuple of vectors as columns, which is still problematic. Anyway, probably a little off-topic at this point, so I'll make the changes for Query.jl here and I think it should be ready to go. |
Just to be clear, here is a simple comparison. x200 is the worst ratio I could get. Here is a "normal" situation:
And My x200 statement was about an optimal way to do a certain operation, which in this case roughly is:
|
src/other/tables.jl
Outdated
|
||
Tables.schema(df::AbstractDataFrame) = Tables.Schema(names(df), eltype.(eachcol(df))) | ||
Tables.schema(df::DataFrameRows) = Tables.schema(getfield(df, :df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this needs a new test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - this is probably clear what it will produce, but a test will provide us with an error if in the future someone makes a breaking change in this part of source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and merged @tkf's PRs that added tests for Tables.schema(df::DataFrameRows)
already, so we should be good to do here now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean #2054? I don't see how it tests schema
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, #2055
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. TBH I'd have preferred if you had waited a bit more until we had a chance to comment on the final design...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I didn't realize there was more to discuss or pending questions/concerns. Is there something specific you're still wondering about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I have: #2055 (comment). Sorry, it turned out I didn't understand what I wanted...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK to merge it. If someone wants type-stability it is easy-enough to call Tables.columntable
on a data frame and later work with it.
CI fails to the reason that |
@@ -70,7 +70,7 @@ Base.propertynames(d::DuplicateNamesColumnTable) = (:a, :a, :b) | |||
@test @inferred(Tables.materializer(table)(Tables.columns(table))) isa typeof(table) | |||
|
|||
row = first(Tables.rows(table)) | |||
@test propertynames(row) == (:a, :b) | |||
@test collect(propertynames(row)) == [:a, :b] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. this is the point to discuss as I have raised. Now depending on if you pass e.g. df
or eachrow(df)
to Tables.rows
you get a different type of row
and in consequence a different type of propertynames
return value. Is this OK to have.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I have made #2056 to make sure we do not have to change this line as propertynames
now always returns a Tuple
.
Instead of converting to a NamedTuple of Vectors (via
columntable
). The conversion toNamedTuple is problematic in the case of extremely wide DataFrames, and
as such is the root cause of
JuliaData/CSV.jl#538.
The reason NamedTuple was used in the first place, however, was due to a
desire in certain querying contexts to be able to get a type-stable
row-iterator (cc: @davidanthoff ).
While this change generally follows the DataFrames.jl approach to
avoiding extreme compiler pressure, it does incur just over 2x
performance penalty on a simple manipulation like:
Perhaps it's easy enough for users who need that extra speed to just do
columntable(df)
first before manipulating and when they know theirDataFrame has a reasonable amount of columns, but it might be possible
to automatically switch between typed/untyped representations
automatically, depending on the # of columns and pre-selected
thresholds.