Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized simultaneous ECDF bands #2368

Merged
merged 17 commits into from
Aug 28, 2024
Merged

Add optimized simultaneous ECDF bands #2368

merged 17 commits into from
Aug 28, 2024

Conversation

sethaxen
Copy link
Member

@sethaxen sethaxen commented Aug 19, 2024

Description

This pull request adds the optimization method from https://doi.org/10.1007/s11222-022-10090-6 for computing simultaneous ECDF bands and switches to using it as the default approach uses it as the default approach sometimes based on a heuristic.

Part of #2309. The changes are non-breaking, since the docstring of plot_ecdf intentionally does not document which method is the default.

Example

import numpy as np
import scipy.stats

import matplotlib.pyplot as plt
import arviz as az

np.random.seed(0)

dist = scipy.stats.expon()
x = dist.rvs(1000)

eval_points = np.linspace(0, 10, 100)

fig, axs = plt.subplots(2, 1, figsize=(8, 5))
%time az.plot_ecdf(x, ax=axs[0], eval_points=eval_points, confidence_bands=True, cdf=dist.cdf, difference=True)
%time az.plot_ecdf(x, ax=axs[1], eval_points=eval_points, confidence_bands='simulated', cdf=dist.cdf, difference=True)
axs[0].set_title('Optimized confidence bands')
axs[1].set_title('Simulated confidence bands')
plt.tight_layout()
CPU times: user 261 ms, sys: 116 μs, total: 261 ms
Wall time: 261 ms
CPU times: user 96.2 ms, sys: 32 μs, total: 96.3 ms
Wall time: 96.3 ms

tmp

Note that while optimized confidence bands are more stable than simulated, with this implementation they are not necessarily faster. Perhaps that can be improved using Numba.

Checklist

  • Does the PR follow official
    PR format?
  • Has included a sample plot to visually illustrate the changes? (only for plot-related functions)
  • Is the new feature properly documented with an example?
  • Does the PR include new or updated tests to cover the new feature (using pytest fixture pattern)?
  • Is the code style correct (follows pylint and black guidelines)?
  • Is the new feature listed in the New features
    section of the changelog?

📚 Documentation preview 📚: https://arviz--2368.org.readthedocs.build/en/2368/

Copy link

codecov bot commented Aug 19, 2024

Codecov Report

Attention: Patch coverage is 82.27848% with 14 lines in your changes missing coverage. Please review.

Project coverage is 86.99%. Comparing base (e6b5e2b) to head (054efa6).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
arviz/stats/ecdf_utils.py 80.28% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2368      +/-   ##
==========================================
- Coverage   87.01%   86.99%   -0.03%     
==========================================
  Files         124      124              
  Lines       12788    12862      +74     
==========================================
+ Hits        11128    11189      +61     
- Misses       1660     1673      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

return prob_right


def _ecdf_band_optimization_objective(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ideally be jitted with numba if available for a likely significant speed-up. Currently the problem is that the functions it calls use scipy.stats.binom.interval and scipy.stats.binom.pmf, which are not JIT-able. If we reimplement these functions here, then we could probably get a significant speed-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interval calls ppf, which defers to Boost's implementation, which in turn uses TOMS Algorithm 748 to find a root in an interval: https://doi.org/10.1145/210089.210111, so reimplementing that to be JIT-able would be the bottleneck. Which seems like it could be a lot of work.

Copy link
Member Author

@sethaxen sethaxen Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wait, since interval is called outside the for loop and only pmf is called within, this could be simple.

@conditional_jit doesn't seem to like it if I call another conditionally-jitted function within the first conditionally-jitted function, e.g.

E           numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           Untyped global name '_update_ecdf_band_interior_probabilities': Cannot determine Numba type of <class 'arviz.utils.maybe_numba_fn'>

Is there a way to make this work?

@sethaxen sethaxen changed the title Add optimized simultaneous ECDF bands [WIP] Add optimized simultaneous ECDF bands Aug 22, 2024
@sethaxen sethaxen removed the request for review from OriolAbril August 22, 2024 10:35
@sethaxen
Copy link
Member Author

I worked out how to JIT compile with numba and definitely see large speed-ups for low ndraws and len(evaluation_points). For large ndraws the sizes of the prob_conditional matrices increases, and for large len(evaluation_points) the number of matmuls increase, so the paper mentions the following approach for efficiency:

The computation time required can be further reduced by using a grid of pre-computed values as the adjustment parameters, and interpolate for different values of N in log-log scale.

bayesplot defaults to interpolation for ndraws >= 100 & len(evaluation_points) > 200. I'll need to think more about whether this would work for arbitrary ECDFs or only for cases where the CDF at eval points is a uniform grid between 0 and 1 (as is the case for uniform eval points for uniform distribution).

@sethaxen
Copy link
Member Author

I'll need to think more about whether this would work for arbitrary ECDFs or only for cases where the CDF at eval points is a uniform grid between 0 and 1 (as is the case for uniform eval points for uniform distribution).

Yeah, that's the only context where this makes sense, which is the same context as PIT plots. Ergo, I'll leave interpolation for a separate PR, and in this PR I'll use a benchmark to heuristically identify when we should switch from optimized bands to simulated as the default in plot_ecdf.

@sethaxen sethaxen changed the title [WIP] Add optimized simultaneous ECDF bands Add optimized simultaneous ECDF bands Aug 27, 2024
Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good, not sure if you still need help with numba, if so it wasn't clear to me what wasn't working. Do you have more changes planned for ecdf functionality? I wanted to test porting it to arviz-stats too

@sethaxen
Copy link
Member Author

changes look good, not sure if you still need help with numba, if so it wasn't clear to me what wasn't working.

No, it works now and is tested.

Do you have more changes planned for ecdf functionality?

Yes, mainly to make this efficient for multi-panel PIT plots. For those cases one wants to compute the band once and then add it to each subplot. And the band can be computed more efficiently in many cases by us fitting the band probability using optimization for a wide range of values and then interpolating.

Efficiency boosts shouldn't block porting it. It's just that as I implement the changes, it may motivate some code restructuring, so maybe it's better to hold off on porting until that is finished?

Later I will add the rank ECDF plots and corresponding bands, but that's strictly addition of new features and maybe should be added directly to arviz-stats.

@OriolAbril OriolAbril merged commit 5a928fb into main Aug 28, 2024
12 checks passed
@sethaxen sethaxen deleted the ecdf_bands_opt branch September 16, 2024 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants