Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating Time-series Line Chart with high cardinality always times out #23464

Closed
2 tasks done
Usiel opened this issue Mar 23, 2023 · 1 comment
Closed
2 tasks done

Creating Time-series Line Chart with high cardinality always times out #23464

Usiel opened this issue Mar 23, 2023 · 1 comment
Labels
#bug Bug report

Comments

@Usiel
Copy link
Contributor

Usiel commented Mar 23, 2023

Creating Time-series Line Chart with high cardinality always times out due to inefficiencies in the pandas_postprocessing.pivot module.

The example below may seem slightly constructed, but I think it's likely that a Superset user will come across this at some point: They want to build a time-series chart and inadvertently create high cardinality groupings without setting a series limit. Currently, they will be confronted with a timeout and be non the wiser. With a minor optimization we can instead show them the data they requested and they can make a decision from there.

A better solution than a simple performance fix would be imo, if Superset would make a decision and apply a series limit for the user, but I figure that would be more of a feature request :)

How to reproduce the bug

  1. Explore the example dataset cleaned_sales_data
  2. Add multiple dimensions (e.g. contact_first_name, contact_last_name, phone)
  3. Add any metric
  4. Select the Time-series Line Chart
  5. Click on "Update Chart"

Expected results

Chart should load within a few seconds

Actual results

Chart will time out or take a very long time

Screenshots

image

Environment

(please complete the following information):

  • superset version: 2.0.1, 2.1.0rc3 and latest master@7ef06b0a6
  • python version: 3.8.13

Checklist

Make sure to follow these steps before submitting your issue - thank you!

  • I have reproduced the issue with at least the latest released version of superset.
  • I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

I will open a PR shortly and link to this issue.

@Usiel Usiel added the #bug Bug report label Mar 23, 2023
Usiel added a commit to Usiel/superset that referenced this issue Mar 23, 2023
Executing a pivot for with `drop_missing_columns=False` and lots of resulting columns can increase the postprocessing time by seconds or even minutes for large datasets.
The main culprit is `df.drop(...)` operation in the for loop. We can refactor this slightly, without any change to the results, and push down the postprocessing time
to seconds instead of minutes for large datasets (millions of columns).

Fixes apache#23464
@rusackas
Copy link
Member

This is likely fixed by now, and is pretty out of date if not. If people are still encountering this in current versions (3.x) please open a new Issue or a PR to address the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
#bug Bug report
Projects
None yet
Development

No branches or pull requests

2 participants