Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add explicit option to control how densities are resolved, change how densities are resolved by default #9172

Merged
merged 19 commits into from
Mar 26, 2024

Conversation

joelostblom
Copy link
Contributor

Will add more description soon, just seeing if tests pass first because my local installation doesn't work.

@joelostblom
Copy link
Contributor Author

joelostblom commented Nov 11, 2023

I got stuck trying to fix the issues with density area marks, so I thought I would summarize what I'm trying to do so that others can help:

I believe the recent issues with density marks stem from PR #9018, which made stacking the new default for area marks. Unfortunately stacked area marks also perform an imputation to zero by default as described here #5611 (comment) by @domoritz which leads to issues with stacked density as previously noted in #6624 (this became more obvious after the PR mentioned above since it made stacked areas the default). Thanks @mattijn for pointing out that this imputation was happening in vega/vega#3815.

The reason the imputation leads to jagged densities is because grouped densities value are defined only for data values existing within their own group. So when putting two grouped densities together on the same axis, but without exactly the same extent over which the density is defined will lead to a lot of imputation to zero and spiky appearance. In #9106 @jonmmease fixed this by guaranteeing the same extent for all individual grouped densities. However this introduced a new issue, which is that densities are no longer cut off at the min/max values of the data within each group, but extends to the min/max values of the data within all groups leading to the appearance of values where there actually are none as shown in vega/vega#3815.


For density area marks to work properly, we need these two things:

  1. Grouped density transforms should be limited to the extent of the data within each group to not avoid suggesting non-existing values.
    • This can be fixed if we go back to setting "resolve": "indepedent"" in the density transform. If we do that, we need area marks used for densities to not be imputed to zero by default. I think the current zero imputation is not ideal for non-density areas either, sine it gives the impression of non-existing values as can be seen in . Maybe the suggestion in Don't add impute for violin plot #5611 (comment) from @jheer to use Vega's xc channel instead of the stack could help?
    • Another way of fixing this would be to keep "resolve": "shared", but we trim the densities after. Note that we can't simply trim values where the density is 0, but need to trim all values outside the original data range for each group (because we need access to each group's min/max values, I believe these needs to happen inside the KDE transform as suggested in The KDE transform creates values where there are none when used with {"resolve": "shared"} vega#3815).
  2. By default area marks used for densities should not be stacked since it makes it harder to compare the distributions with each other.

My hunch would be that limiting the range of each grouped density in the KDE transform in Vega and rolling back #9018, would be the most straightforward way to fix this, but I'm very open to hearing other suggestions.

@joelostblom
Copy link
Contributor Author

Here is another example of how the current behavior leads to unexpected outcomes. Take this spec that creates two densities which seem to cover the same data range:

image
Open the Chart in the Vega Editor

If we add another chart to the layer, such as a mean line (but really any chart), suddenly it becomes apparent that the densities don't at all cover the same region and the one to the right now appears cut off for no reason since it looked like the extent was the same in the previous chart:

image
Open the Chart in the Vega Editor

The reason seems to be that the layering is moving the transforms around, but this is not easy to understand and the inconsistent behavior is quite confusing.

@joelostblom
Copy link
Contributor Author

Gentle ping on this @jheer, @domoritz, and @kanitw. What do you think is the best way forward to fix the density specs? The decision we take here will also have an impact on the violin plots we have been talking about.

My hunch would be that limiting the range of each grouped density in the KDE transform in Vega as suggested in vega/vega#3815, and rolling back #9018 would be the most straightforward way to fix this, but I'm very open to hearing other suggestions.

@domoritz domoritz requested a review from a team December 12, 2023 22:13
src/transform.ts Outdated Show resolved Hide resolved
@domoritz
Copy link
Member

domoritz commented Dec 14, 2023

If densities should be handled differently from other area marks when it comes to stacking, I think it would make sense to introduce enough knobs to support handling them differently and then to introduce a higher-level mark type (which could even create the density transform as well). I know @jheer was advocating for that for a long time 😅 so I'd say let's use this opportunity to do that.

Imputation by default has issues but I still think it's a reasonable choice (as discussed in #5611). Would introducing xc as a channel as an alternative to centering a stacked area mark help as well? It would make for a more canonical representation of densities (https://vega.github.io/vega/examples/violin-plot/), I think.

@yhoonkim and @kanitw since you introduced #9018, what are your thoughts on the issue/reverting?

At this point, I am a bit lost between the discussion here and slack. Would it make sense to have a sync meeting where we can agree on a plan with @kanitw @yhoonkim @joelostblom @jheer and @domoritz?

@domoritz domoritz added this to the Density Charts milestone Dec 18, 2023
@domoritz
Copy link
Member

Outcome from meeting: let's get this option added but not change the default for now.

@joelostblom
Copy link
Contributor Author

@domoritz I updated this as per our discussion to change the default back to "shared", but now it's tunable which gives us more flexibility for designing the density mark

Copy link
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider the default and make sure we test different cases where resolve is set and not set in unit tests.

@@ -15,7 +15,8 @@
"type": "kde",
"field": "IMDB Rating",
"bandwidth": 0.3,
"as": ["value", "density"]
"as": ["value", "density"],
"resolve": "shared"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be independent. https://vega.github.io/vega/docs/transforms/kde/

Why should this be shared for non-stacked transforms? Maybe a reasonable default is to have it only shared if there is stacking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly from our discussion, we said it would be difficult to set the default value of the transform based on whether the encoding channel is stacked or not. I believe the main reason was that it would be hard to detect which encoding channel the density values are plotted on.

Leaving "shared" as the default would work in most situations, including most of the common ones, and if there are situations where the "independent" is needed, it would be easy to switch. To avoid that densities stack by default, we said we would turn off stacking as part of a mark_density, but leave the default stacking behavior for the area mark as it is now to be consistent with other marks.

as: ['A', 'density']
});
});

it('should add resolve shared if we group', () => {
it('should only add resolve "shared" if we set it explicitly', () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test this, we would need a case that doesn't set resolve when it's not set in the transform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't resolve always be in the transform sine the parameter has a default value? I think this test case should just be switched to "independent" instead of "shared" to reflect the updates in e3f4505. I did that in my latest commit.

Copy link
Contributor Author

@joelostblom joelostblom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I'm really sorry, I missed that you had left new comments a while ago @domoritz. I just responded to the two comments you made and pushed a new commit for the test, let me know what you think.

@@ -15,7 +15,8 @@
"type": "kde",
"field": "IMDB Rating",
"bandwidth": 0.3,
"as": ["value", "density"]
"as": ["value", "density"],
"resolve": "shared"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly from our discussion, we said it would be difficult to set the default value of the transform based on whether the encoding channel is stacked or not. I believe the main reason was that it would be hard to detect which encoding channel the density values are plotted on.

Leaving "shared" as the default would work in most situations, including most of the common ones, and if there are situations where the "independent" is needed, it would be easy to switch. To avoid that densities stack by default, we said we would turn off stacking as part of a mark_density, but leave the default stacking behavior for the area mark as it is now to be consistent with other marks.

as: ['A', 'density']
});
});

it('should add resolve shared if we group', () => {
it('should only add resolve "shared" if we set it explicitly', () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't resolve always be in the transform sine the parameter has a default value? I think this test case should just be switched to "independent" instead of "shared" to reflect the updates in e3f4505. I did that in my latest commit.

@domoritz
Copy link
Member

Thank. I will take a look. Can you merge the latest main and make sure that the CI runs?

@joelostblom
Copy link
Contributor Author

@domoritz Thanks! Merge complete and CI ran successfully.

@domoritz domoritz changed the title fix: add explicit option to control how densities are resolved feat: add explicit option to control how densities are resolved, change how densities are resolved by default Mar 26, 2024
@domoritz domoritz enabled auto-merge (squash) March 26, 2024 21:04
@domoritz domoritz merged commit bf0b8d3 into main Mar 26, 2024
12 checks passed
@domoritz domoritz deleted the fix-shared-densities-and-extent branch March 26, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants