Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boxplot - wrong logic #12005

Closed
li-ana opened this issue Dec 10, 2020 · 7 comments
Closed

Boxplot - wrong logic #12005

li-ana opened this issue Dec 10, 2020 · 7 comments
Labels
enhancement:request Enhancement request submitted by anyone from the community viz:charts:boxplot Related to the Boxplot chart viz:charts:echarts Related to Echarts

Comments

@li-ana
Copy link

li-ana commented Dec 10, 2020

Box plot assumes that you will have only 1 observation per timestamp. In my case, this is not true, which means that the box plot will aggregate all of the observations per timestamp using the function that you select, thus it skews the results. It needs to look at raw data, instead of the aggregation.

Query produced by the Superset box plot:
**
image
**
Results of this query:
image

Box plot details:
image

What it should be:
image

@junlincc junlincc added the viz:charts:echarts Related to Echarts label Dec 10, 2020
@villebro
Copy link
Member

villebro commented Dec 11, 2020

While we don't yet support using the full raw data and then calculating the boxplot on that, we recently migrated Boxplot to ECharts and added some features in this PR: #11199 . In the below example I've created a Boxplot where the categories are continents and the distribution is calculated across countries (I'm using the average of total population, as the dataset contains data for multiple years):

image

If you have a row id, you can use that as the the "Distribute Across" parameter. The plan is to add support for using the raw row data, but I probably won't have time to work on it any time soon.

@junlincc junlincc added the enhancement:request Enhancement request submitted by anyone from the community label Dec 11, 2020
@rumbin
Copy link
Contributor

rumbin commented Aug 9, 2021

The problem with calculating the Boxplot metrics on the query result is - even if distributing across a unique column - that the row limit hits hard and silently:
If the number of data points across all series exceeds the row limit, the resulting boxplot is non-deterministically excluding data points without notifying the user.
Non-deterministically, since there is no ORDER BY applied, nor is it configurable.

So we have three issues that are caused by the current Boxplot logic:

  1. There is no way of including all records, as soon as the number of rows exceeds the row limit.
  2. Whether the row limit has been reached is not displayed anywhere.
  3. The row limit excludes records in a non-deterministic fashion, as no explicit ordering is present.

In my eyes, all of these issues can best be covered by calculating all Boxplot metrics per series directly within the SQL query. The only drawback that I can immediately see is that outliers cannot be returned by such a query...

@rumbin
Copy link
Contributor

rumbin commented Aug 9, 2021

Not sure to what extent it would be useful to create new issues for the three items above...

@rumbin
Copy link
Contributor

rumbin commented Oct 8, 2021

I filed a separate bug for it: #17042

@junlincc junlincc added the viz:charts:boxplot Related to the Boxplot chart label Oct 12, 2021
@junlincc
Copy link
Member

@rumbin please feel free to open separate issues for all.
for 2. it's happening on all the charts, no?

@rumbin
Copy link
Contributor

rumbin commented Oct 12, 2021

@junlincc
For 2.: Yes and no.
True is that hitting the row limit is not shown on charts on a dashboard.
However, at least in Explore the row count indicator pill is turning red when the threshold has been reached.
This is normally the case, bit not so for the Box Plot.
That's what I have reported in #17942.

@rumbin
Copy link
Contributor

rumbin commented Oct 12, 2021

I am going to write up a separate issue for 1. soon, which will elaborate the pros and cons of calculating the box plot metrics in a push down fashion directly in the database...

@apache apache locked and limited conversation to collaborators Feb 2, 2022
@geido geido converted this issue into discussion #18423 Feb 2, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement:request Enhancement request submitted by anyone from the community viz:charts:boxplot Related to the Boxplot chart viz:charts:echarts Related to Echarts
Projects
None yet
Development

No branches or pull requests

4 participants