Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: LazyFrame.plot? #13339

Open
MarcoGorelli opened this issue Dec 31, 2023 · 8 comments
Open

Discuss: LazyFrame.plot? #13339

MarcoGorelli opened this issue Dec 31, 2023 · 8 comments
Labels
A-api Area: changes to the public API enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer python Related to Python Polars

Comments

@MarcoGorelli
Copy link
Collaborator

I'm just breaking this off from #13238

Here's the comments made so far:

@MarcoGorelli :

Doing .collect within a plotting function sounds potentially like a footgun, and a risk that people re-trigger the same expensive IO function over and over again? I'd suggest, at least to start, to only have .plot on DataFrame, so people are expected to collect before plotting

@MarcSkovMadsen

But if plotting requires running some operations for example if using 'by' argument, making boxplot, histogram or using datashader, would you not want the plotting library to control when things are collected?

(I'm not the expert)

@MarcoGorelli

I'd suggest moving incrementally:

  1. start with DataFrame.plot (based on comments/reactions so far, fairly uncontroversial?)
  2. then, discuss LazyFrame.plot later, as I think that's more complicated (see semi/anti join #5508 (comment) for an example of how it's possible to collect-shoot yourself in the foot). I'd like to see at least some discussion of collect-footguns in the context of plotting before adding LazyFrame.plot. But let's save that for a future PR / discussion if it's OK

@ritchie46

then, discuss LazyFrame.plot later

Yes.. that's a big no-no for me.

@jbednar

I don't have experience with using Polars myself, but .collect sounds like it might be conceptually similar to Dask .compute()? For Dask we keep it lazy all the way through to plotting so that tools like Datashader can do the actual rendering in a fully distributed fashion, with chunks of each plot rendered to an image or array in parallel and only the final rendered chunk needing to cross the wire out of each remote worker. Datashader doesn't currently support Polars, but I'd hope it will some day, and so it would be good to choose an approach here that anticipates having that work smoothly.

@MarcoGorelli

Adding DataFrame.plot now doesn't preclude adding some LazyFrame.plot solution later on, possibly with some safeguards (e.g. a "I know what I'm doing" kind of flag)

@hoxbro

I would make one suggestion of adding a plot accessor to the LazyFrame, with an error or warning which says plotting is not supported for LazyFrame, please call .collect() beforehand.


I did discuss on API kind of like

(
    pl.scan_parquet(file)
    .plot.line(x='x', y='y', by='by')
    .collect()
)

with Ritchie, so that predicate pushdown can happen and you only read from the parquet file the columns you need to make the plot.

There is a risk though that people will repeatedly make cells like this and so re-trigger the scan_parquet part multiple times, and plots in particular tend to incentivize interactive development.

But, for running jobs which create reports, or for plotting larger-than-memory datasets, this could unlock value.

One to think about. This would require some precursor work in hvPlot before it could be added to Polars

@stinodego stinodego added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Dec 31, 2023
@mcrumiller
Copy link
Contributor

I think this is not worth the effort. If someone wants to create a plot, just materialize and then plot, and it seems like it would be a lot of work for a probably not-common case.

@MarcSkovMadsen
Copy link

MarcSkovMadsen commented Jan 1, 2024

Remember that the plots are not static. They are interactive if you select Bokeh or Plotly backend. They might even contain widgets to filter the data shown. This enables browsing through data that are larger than can be shown in a browser or even can be hold in memory.

When you use .plot the plots are not materialised, they are just configurations/ definitions. As a user, you might continue changing the color, the kind of plot, filter to a subsection of data.

The plot is not materialized before its the output of a cell in a notebook, .save is run, .show is run etc.

I dont understand why its risky not to materialize explicitly before or after .plot. When .plot is run you are no longer in Polars land. You are in the land of the plot framework. Its should have the chance to control things.

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Jan 1, 2024

Thanks for explaining

What I'm concerned about is the risk that someone does

df = (
    pl.scan_parquet(...)
    .group_by(...)
    .#more expensive operations
)

and then, in the next cell of their notebook,

df.plot.line(x='x', y='y')

The plot is shown, and then they realise that, actually, they want to colour by plant species, and so they do

df.plot.line(x='x', y='y', by='by')

in the next cell.

If within .plot a collect was called, then like this, the user would be re-triggering the whole DAG, including scan_parquet, group_by, and the other expensive operations

Does this make sense / seem like a valid concern?
Or would there be a way for hvplot to change the plot to be coloured by plant species once the first plot has already been drawn, without having to re-trigger the whole DAG?

@mkleinbort
Copy link

Just to reply to:

What I'm concerned about is the risk that someone does...

This is a common situation, and I think we should trust users.

It is possible that any given user will make this mistake once or twice, but if the computation truly is slow, they will soon figure out they can first collect the frame and then tinker with their plot.

That said, I have no idea how the plotting frameworks could be enabled to use Polars' processing capabilities - this seems like a strong entangling of the libraries that would require updates with every API change, but I remain open minded.

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Jan 1, 2024

Thanks for your input!

I'd like to think that there's a solution somewhere in between

  • disallow LazyFrame.plot
  • call .collect implicitly for users

but don't yet know exactly what it would be. Happy to discuss more and have a think, maybe we could have a group call. It's good to take our time with this one

What impresses me about hvplot is that they currently do

self._data.select(columns).collect()

for their hvplot accessor - this is exactly the kind of code which allows for the predicate projection pushdown to happen (as opposed to collecting upfront and then selecting columns)

@philippjfr
Copy link

philippjfr commented Jan 5, 2024

for their hvplot accessor - this is exactly the kind of code which allows for the predicate pushdown to happen (as opposed to collecting upfront and then selecting columns)

In the long run I'd like to push things even further down, i.e. HoloViews should implement a full Polars interface, that keeps things lazy until the last possible moment. This would allow things like histogram calculations, bar plot aggregations, and ideally datashader support to never materialize the full data in memory.

As for what to do about people accidentally generating plots and are infeasible either because they trigger huge amounts of computation or a silly amount of categories, that's a problem we have to find a more general solution for at the hvPlot level. The unfortunate problem that we've always had is that if the data is huge then even a len(df) or a df[col].unique() call can significantly slow things down and always paying that cost is annoying. In the end it's something that we have to do though and we'll just have to very carefully consider where to add these checks.

@stinodego
Copy link
Member

stinodego commented Feb 8, 2024

I've thought about this a bit, and it's similar to #13928

Materializing might be unnecessarily expensive if we're just trying to show some aggregations in a plot.

With that in mind, adding LazyFrame.plot is probably a good idea, if lazy computation is actually supported by the plotting backend. If the backend just does .collect().plot(...), then I would be against adding this.

@stinodego stinodego added A-api Area: changes to the public API needs decision Awaiting decision by a maintainer labels Feb 8, 2024
@MarcoGorelli
Copy link
Collaborator Author

I think holoviz/holoviews#5939 maybe should be considered a necessary precursor. Because with that, the collect could be done a lot further down (rather than currently where it's done just before converting to pandas, so at best you save reading a few columns. Better than nothing, and nice that hvplot does this, but before being available in polars.LazyFrame itself, the lazy support should probably be complete)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

6 participants