Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of small timesteps #423

Merged
merged 2 commits into from
Dec 30, 2023

Conversation

sjdaines
Copy link
Contributor

Sundials solver is allowed to warn and continue if it returns a timestep with no change in t.
If the solver doesn't recover, it will fail with eg SciMLBase.ReturnCode.MaxIters

Reverts PR #416, fixes #420

For CVode and ARKode, it looks like this is intended behaviour: the Sundial solver emits a warning message, with an API call

CVodeSetMaxHnilWarns
ARKStepSetMaxHnilWarns

to set the maximum number of warning messages printed, which is exposed to Julia as

max_hnil_warns

see Sundials driver code https://github.com/LLNL/sundials/blob/e8a3e67e3883bc316c48bc534ee08319a5e8c620/src/cvode/cvode.c#L1339-L1347

For IDA, there is no API like this so it's not so clear what should happen, however the behaviour prior to PR #416 (and restored by this PR) for the f_noconverge test added in test/common_interface/ida.jl is to return SciMLBase.ReturnCode.MaxIters
which seems reasonable?

(NB: @oscardssmith I can't reproduce the issue implied by the title of PR #416 so perhaps what is missing here is the original failure case? The call to solve in the f_noconverge test test/common_interface/ida.jl, returns SciMLBase.ReturnCode.MaxIters, it doesn't return with no error?

Also issue SciML/SciMLBase.jl#458 shows IDA printing

[IDAS ERROR]  IDASolve
  At t = 0 and h = 1.49012e-18, the error test failed repeatedly or with |h| = hmin.

which suggests that IDA did flag an error in that case ? )

Sundials solver is allowed to warn and continue if it returns a timestep
with no change in t.
If the solver doesn't recover, it will fail with eg SciMLBase.ReturnCode.MaxIters

Reverts PR SciML#416, fixes
SciML#420

For CVode and ARKode, it looks like this is intended behaviour: the
Sundial solver emits a warning message, with an API call
    CVodeSetMaxHnilWarns
    ARKStepSetMaxHnilWarns
to set the maximum number of warning messages printed, which is exposed
to Julia as
    max_hnil_warns
see Sundials driver code https://github.com/LLNL/sundials/blob/e8a3e67e3883bc316c48bc534ee08319a5e8c620/src/cvode/cvode.c#L1339-L1347

For IDA, there is no API like this so it's not so clear what should happen,
however the behaviour prior to SciML#416 (restored by this PR)
for the f_noconverge test added in test/common_interface/ida.jl is to return
  SciMLBase.ReturnCode.MaxIters
which seems reasonable?

(NB: @oscardssmith I can't reproduce the issue implied by the title of
SciML#416 so perhaps what is missing here
is the original failure case that demonstrates the issue?
The call to solve in the f_noconverge test test/common_interface/ida.jl, returns SciMLBase.ReturnCode.MaxIters,
it doesn't return with no error?

Also SciML/SciMLBase.jl#458 shows IDA printing
  [IDAS ERROR]  IDASolve
    At t = 0 and h = 1.49012e-18, the error test failed repeatedly or with |h| = hmin.
which suggests that IDA did flag an error in that case ?
)
@codecov
Copy link

codecov bot commented Sep 23, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (883d752) 81.04% compared to head (849ef80) 81.09%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #423      +/-   ##
==========================================
+ Coverage   81.04%   81.09%   +0.05%     
==========================================
  Files          11       11              
  Lines        1477     1476       -1     
==========================================
  Hits         1197     1197              
+ Misses        280      279       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@oscardssmith
Copy link
Contributor

however the behaviour prior to PR #416 (and restored by this PR) for the f_noconverge test added in test/common_interface/ida.jl is to return SciMLBase.ReturnCode.MaxIters
which seems reasonable?

The problem with this is that it does return MaxIters but it only does so after spending MaxIters iterations which for non-trivial systems will be extremely slow. It looks like the behavior this PR re-enables is the better behavior for the non-IDA solvers. Would you be willing to adjust this PR to make it so it does fail with IDA but not the other solvers?

@sjdaines
Copy link
Contributor Author

sjdaines commented Sep 24, 2023

OK so it looks like PR 416 is really a feature request for an early return from solver failure? So I'd suggest:

  1. Log an issue to that effect: 'enhancement request: early return from solver failure' or similar
  2. Either revert PR 416 (or apply this PR). The problem as I see it is that this is really a non-trivial design change (as well as currently a breaking change and a regression in robustness) that interacts with solver robustness and needs some more information and some thought. And currently we're doing this process backwards...

I did a little more digging into Sundials IDA to understand what is needed here:

  • The SciML defaults for maxiters (= 1e5) are much larger than the Sundials defaults (= 500 for CVode, ARKode and IDA) so that is possibly contributing to the issue seen here (I'm not suggesting changing this ! Probably worth a note in the documentation though)
  • In Sundials 6.2 there is a new API call IDASetMinStep (ie make use of dtmin in the SciML interface) which might be the preferred fix, but that isn't available in the current Julia v5.2 jll, So that I think is an incentive to update ie [WIP] Refreshing the wrappers for Sundials 6.6 #415
  • To understand the behaviour of the 'native' IDA main loop (IDASolve with option IDA_NORMAL to timestep until a supplied tout, cf Julia executes single steps using IDA_ONE_STEP) I wrote a C test case that uses f_noconverge from test/common_interface/ida.jl. Testing with Sundial 6.2.1 this shows that IDAsolve() errors with flag -1 and message At t = 1, mxstep steps taken before reaching tout. So the the native IDA main loop doesn't appear to implement any special-casing for timestep going to zero, which suggests this is an intentional design decision.

So if we do want to implement an early return from failure, it looks to me like either a) this would need to be clearly identified as an additional feature added by SciML (ie a behaviour change, not just an interface to Sundials), and that it would need to do something like count the number of consecutive zero-length steps and error if some configurable threshold is exceeded; or b) we need more information about the IDA design.

@oscardssmith
Copy link
Contributor

thanks for digging in to this. I think the main tension here is whether we want the sundials wrapper to be a light wrapper over the c++ code or a set of solvers that integrate as well as possible into the SciML stack. The rest of our summers specify a dtmin which is usually based on making sure we don't take size zero steps. this not only prevents infinite loops, but also makes things like when callbacks trigger unambiguous. I think the thing I would need to see before I was comfortable merging this was an example where a solver actually completes after taking a zero length step.

@sjdaines
Copy link
Contributor Author

This CI log is an example of CVODE failure and then recovery (the original CI failure from #420 without the change from PR #416)

Here CVODE recovers and the solution uses 3527 function evaluations for 2278 timesteps.

I just tried the same test case using some other solvers (this is a small stiff system of 19 equations), all at reltol=1e-4:

  • Sundials IDA 3986 evals, 2315 steps (no warnings logged)
  • LSODA.lsoda() evals not reported, 9155 steps (no warnings logged)
  • (also from memory a MATLAB version of this code gave similar results to CVODE using the ode15s solver)

and the recommended SciML native solvers for stiff systems:

  • OrdinaryDiffEq.Rosenbrock23(autodiff=false) 131039 evals, 5429 steps (so ~30x the function evals and ~20x slower than CVODE)
  • OrdinaryDiffEq.TRBDF2(autodiff=false) 129302 evals, 23478 steps
  • OrdinaryDiffEq.FBDF(autodiff=false)) fails with retcode=DtLessThanMin
  • OrdinaryDiffEq.QNDF(autodiff=false)) fails with retcode=DtLessThanMin

So in this case, consistency with the SciML solvers is definitely not what we want !!

I'm sure there is something that could be changed (eg scaling of variables, etc) to make this problem more solver-friendly, but looking at this from the point of view of solver robustness, this is consistent with my subjective impression that the Sundials CVODE and IDA solvers are a great deal more robust than the SciML solvers on "real world" problems. I suspect this takes years of incremental improvements as a result of user feedback to achieve, and isn't fully captured by small test cases.

From my point of view as a user, I personally would prefer a SciML design that maintains the strengths and full performance and robustness of the native and wrapped solvers, even if this means that additional user configuration is required where the pieces don't exactly fit. Otherwise the risk is that this results in a lowest-common-denominator.

For the specific design question here about handling of small timesteps, if it is a design decision of SciML to require that solver_step always returns a timestep > 0, then perhaps that could be handled by additional code within the Sundials-specific solver_step that makes multiple calls to CVode() etc (up to some configurable limit < maxsteps if you want to also provide an early error return) ? But this is already adding complexity of course and possibly reducing robustness or making issues harder to debug, so it's not obvious to me that this is a good idea...

@oscardssmith
Copy link
Contributor

Fair enough. I think I am now in favor of this PR.

@ChrisRackauckas do you have thoughts on what we should do here on the OrdinaryDiffEq side? Should the Julia stiff solvers use extended precision tracking of time?

@ChrisRackauckas
Copy link
Member

So in this case, consistency with the SciML solvers is definitely not what we want !!

I mean, I could also make dumb comparisons which only include 2nd order methods with Sundials, restricting its order arbitrarily to make it slower. Obviously showing Rosenbrock23 and TRBDF2 in this kind of case is a straw man. Did you try the recommended list?

this is consistent with my subjective impression that the Sundials CVODE and IDA solvers are a great deal more robust than the SciML solvers on "real world" problems

I wouldn't make such a statement from a very small set of academic cases. We have seen more often than not CVODE and IDA fail pretty spectacularly on many problems. Many of the benchmarks have had to had tolerances adjusted just to accommodate the run failures of Sundials. The one thing CVODE/IDA seems to work on are reaction-advection-diffusion types of problems, anything with dominant complex eigenvalues is rather suspect. When you have one use case that work with a given solver, sure that one tends to look good, but other the wide use cases that we have explored and benchmarked, industry and academic, this seems to be far from the truth. If you think otherwise, please start sharing benchmarks to the SciMLBenchmarks that we can explore more deeply. Currently there, Sundials tends to fail in a lot of cases with higher tolerances and many cases had tolerances adjusted just so that Sundials would not diverge. If you only benchmark the problems of one domain you will not see the full set of behaviors.

Note that one major complaint with the Sundials solvers is that on average they have a tendency to solve equations with a lot more error than the requested tolerance, a lot more than other solvers by an order of magnitude or two, which tends to make it look a bit better in "eye tests" simply because the solver gives a "good enough" trajectory fast that is very inaccurate. You can see that as the right shift in the benchmarks. To account for this you really need to compute reference trajectories and clearly calculate the solution error. Solvers at the same tolerances are not solving to the same accuracy, and CVODE is very clearly the one that relaxes the accuracy of the solvers the most in almost every benchmark we've seen. On average, asking for 6 digits of accuracy gives only 3 with Sundials, while closer to 5 with a Rosenbrock method for example. If your eyeball test only cares about a few digits of accuracy, that's fine, but we really need to care that equations are solved not just fast but also correctly.

I personally would prefer a SciML design that maintains the strengths and full performance and robustness of the native and wrapped solvers, even if this means that additional user configuration is required where the pieces don't exactly fit. Otherwise the risk is that this results in a lowest-common-denominator.

I mean, that's kind of an odd way to determine "lowest-common-denominator". Options which give the direct possibility of silently incorrect and non-convergent behavior should not be taken lightly. I am also of the opinion that we should lighten the checks on minimum dt throwing, but do understand that the reason that is there stems from the justification of the Hairer Fortran codes that such a behavior can lead to incorrect solutions in some cases. The Sundials implementation seems to just not worry about warning/erroring in those cases and continue forward, and by doing so in many cases it will give a correct answer, though in some cases it can lose convergence guarantees and simply give a wrong answer. Such a behavior isn't obviously the best idea right out of the gate and shouldn't just be taken lightly: solving equations fast is different from solving them accurately, and accurately seems to be what Sundials has a lot of trouble with.

That said as stated above, I do think the Hairer approach is a bit too heavy handed since solvers do have a recovery behavior more often than not. The default error checking should probably be much more direct to the exact behavior to stop infinite looping. The reason for this difference is mostly because the instability of BDF methods tend to make them have dt go very close to zero even when that's not required, simply because the method is unstable (only A-alpha stable, so dominant complex eigenvalues for example can lead to instability and dt->0 type of behavior), while this is not the case for stiffly-stable methods like Rodas5P which have A-B-L stability and thus should in general never have a small dt except when needed for very low tolerance convergence. Since the Hairer suite is all (I)RK methods and Rosenbrock methods with sufficient stability criteria, this diagnostic thus barely ever gets in the way, which is why you see Rosenbrock23 and TRBDF2, while bad choices for this case, will simply chug along and solve it. The issue though is that somewhere along the line we added A-alpha stable methods, which need small dt handling to recover around things like area of high complex eignevalues or events, but the error checking behavior was kept stricter in the form of assuming L-stability. In most cases it does not hurt to relax this, and so the best thing to do here is probably to just relax it so that BDF methods have an easier time recovering. We could make this be something that is solver-dependent, but keeping the rules simpler might be good.

In that case we could do the following. Inside of the solver loop, it should check whether t + dt == t and error out by default if that is the case. It can just use DtLessThanMin. We can have a default dtmin=nothing which then only is checking for dt>eps(t) at each step, which is equivalent to what Sundials does by default.

if it is a design decision of SciML to require that solver_step always returns a timestep > 0, then perhaps that could be handled by additional code within the Sundials-specific solver_step that makes multiple calls to CVode() etc (up to some configurable limit < maxsteps if you want to also provide an early error return) ? But this is already adding complexity of course and possibly reducing robustness or making issues harder to debug, so it's not obvious to me that this is a good idea...

Sundials also only allows a timestep > 0. It's not numerically robust to allow time steps < 0 because reverse solves are not in general numerically stable, and in fact there are many cases where a reverse solve is unconditionally unstable for any |dt|>0 (for example, upwinding schemes for advection). So if you let dt<0 as a way to reverse, you do not generally get to the same point and you will introduce numerical error in many cases which does not have a reverse stability property. Sundials' default minimum step is zero and so it effectively defaults to a minimum dt of dt>eps(t). Thus the scheme above would give it similar behavior in OrdinaryDiffEq.

We should give this a try and see if it's all that's required for a BDF to handle this case effectively. Did you try just setting dtmin=0?

@ChrisRackauckas
Copy link
Member

Did you spawn the dtmin=0 run?

@ChrisRackauckas
Copy link
Member

Do you have the MWE you can share? Can even have the packages. I just want to do some proper testing on this and don't have a case.

ChrisRackauckas added a commit to ChrisRackauckas/PALEOcopse.jl that referenced this pull request Dec 29, 2023
I'm running the test that was suggested in SciML/Sundials.jl#423 .

> I just tried the same test case using some other solvers (this is a small stiff system of 19 equations), all at reltol=1e-4:

How did you do that? I didn't see a loop over the solvers in the CI here at all. I'd like to see if this option is all that's needed.
ChrisRackauckas added a commit to SciML/OrdinaryDiffEq.jl that referenced this pull request Dec 30, 2023
This has been a long time coming. Ever since the issue SciML/Sundials.jl#423 I've been like "could you actually do that?", since Sundials does seem to default to dtmin=0, so I've been testing it around. We started down this path because the Hairer codes use it, and there's a justification that I gave as to why it might make more sense for Runge-Kutta codes than it does for BDF codes (SciML/Sundials.jl#423).

But... I've come to the realization that it is a good idea to have dtmin=0 for a few reasons. It's hard to pinpoint the codes since I've seen a few thousand models since September, but a few things have stood out to me.

1. We've had a lot of false negatives with dtmin. #1830, fixes SciML/DifferentialEquations.jl#924, fixes #1616, fixes #1879. There's just a lot of these cases that can pop up. It's hard to individually root them all out. There was a whole bunch of work that @oscardssmith did last year #1762 in order to try and root out the false positives, and I think there are a lot less, but it led to a separate concern I will mention next. This basically means that any dtmin>0 has a downside at some point getting false negative exits. It's good to exit early in some cases, but in many cases it's "hyperactive".

2. Setting dtmin to a reasonable default to not get false positives ends up being too conservative for it to be useful in a lot of scenarios. To see what I mean, here's some stiff ODE solves:

```julia
function rober!(du, u, p, t)
    y₁, y₂, y₃ = u
    k₁, k₂, k₃ = p
    du[1] = -k₁ * y₁ + k₃ * y₂ * y₃
    du[2] = k₁ * y₁ - k₂ * y₂^2 - k₃ * y₂ * y₃
    du[3] = k₂ * y₂^2
    nothing
end
prob = ODEProblem(rober!, [1.0, 0.0, 0.0], (0.0, 1e5), [0.04, 3e7, 1e4])
sol = solve(prob, Tsit5())

const k1=.35e0
const k2=.266e2
const k3=.123e5
const k4=.86e-3
const k5=.82e-3
const k6=.15e5
const k7=.13e-3
const k8=.24e5
const k9=.165e5
const k10=.9e4
const k11=.22e-1
const k12=.12e5
const k13=.188e1
const k14=.163e5
const k15=.48e7
const k16=.35e-3
const k17=.175e-1
const k18=.1e9
const k19=.444e12
const k20=.124e4
const k21=.21e1
const k22=.578e1
const k23=.474e-1
const k24=.178e4
const k25=.312e1

function f2(dy,y,p,t)
 r1  = k1 *y[1]
 r2  = k2 *y[2]*y[4]
 r3  = k3 *y[5]*y[2]
 r4  = k4 *y[7]
 r5  = k5 *y[7]
 r6  = k6 *y[7]*y[6]
 r7  = k7 *y[9]
 r8  = k8 *y[9]*y[6]
 r9  = k9 *y[11]*y[2]
 r10 = k10*y[11]*y[1]
 r11 = k11*y[13]
 r12 = k12*y[10]*y[2]
 r13 = k13*y[14]
 r14 = k14*y[1]*y[6]
 r15 = k15*y[3]
 r16 = k16*y[4]
 r17 = k17*y[4]
 r18 = k18*y[16]
 r19 = k19*y[16]
 r20 = k20*y[17]*y[6]
 r21 = k21*y[19]
 r22 = k22*y[19]
 r23 = k23*y[1]*y[4]
 r24 = k24*y[19]*y[1]
 r25 = k25*y[20]

 dy[1]  = -r1-r10-r14-r23-r24+
          r2+r3+r9+r11+r12+r22+r25
 dy[2]  = -r2-r3-r9-r12+r1+r21
 dy[3]  = -r15+r1+r17+r19+r22
 dy[4]  = -r2-r16-r17-r23+r15
 dy[5]  = -r3+r4+r4+r6+r7+r13+r20
 dy[6]  = -r6-r8-r14-r20+r3+r18+r18
 dy[7]  = -r4-r5-r6+r13
 dy[8]  = r4+r5+r6+r7
 dy[9]  = -r7-r8
 dy[10] = -r12+r7+r9
 dy[11] = -r9-r10+r8+r11
 dy[12] = r9
 dy[13] = -r11+r10
 dy[14] = -r13+r12
 dy[15] = r14
 dy[16] = -r18-r19+r16
 dy[17] = -r20
 dy[18] = r20
 dy[19] = -r21-r22-r24+r23+r25
 dy[20] = -r25+r24
end

function fjac(J,y,p,t)
      J .= 0.0
      J[1,1]   = -k1-k10*y[11]-k14*y[6]-k23*y[4]-k24*y[19]
      J[1,11]  = -k10*y[1]+k9*y[2]
      J[1,6]   = -k14*y[1]
      J[1,4]   = -k23*y[1]+k2*y[2]
      J[1,19]  = -k24*y[1]+k22
      J[1,2]   = k2*y[4]+k9*y[11]+k3*y[5]+k12*y[10]
      J[1,13]  = k11
      J[1,20]  = k25
      J[1,5]   = k3*y[2]
      J[1,10]  = k12*y[2]

      J[2,4]   = -k2*y[2]
      J[2,5]   = -k3*y[2]
      J[2,11]  = -k9*y[2]
      J[2,10]  = -k12*y[2]
      J[2,19]  = k21
      J[2,1]   = k1
      J[2,2]   = -k2*y[4]-k3*y[5]-k9*y[11]-k12*y[10]

      J[3,1]   = k1
      J[3,4]   = k17
      J[3,16]  = k19
      J[3,19]  = k22
      J[3,3]   = -k15

      J[4,4]   = -k2*y[2]-k16-k17-k23*y[1]
      J[4,2]   = -k2*y[4]
      J[4,1]   = -k23*y[4]
      J[4,3]   = k15

      J[5,5]   = -k3*y[2]
      J[5,2]   = -k3*y[5]
      J[5,7]   = 2k4+k6*y[6]
      J[5,6]   = k6*y[7]+k20*y[17]
      J[5,9]   = k7
      J[5,14]  = k13
      J[5,17]  = k20*y[6]

      J[6,6]   = -k6*y[7]-k8*y[9]-k14*y[1]-k20*y[17]
      J[6,7]   = -k6*y[6]
      J[6,9]   = -k8*y[6]
      J[6,1]   = -k14*y[6]
      J[6,17]  = -k20*y[6]
      J[6,2]   = k3*y[5]
      J[6,5]   = k3*y[2]
      J[6,16]  = 2k18

      J[7,7]   = -k4-k5-k6*y[6]
      J[7,6]   = -k6*y[7]
      J[7,14]  = k13

      J[8,7]   = k4+k5+k6*y[6]
      J[8,6]   = k6*y[7]
      J[8,9]   = k7

      J[9,9]   = -k7-k8*y[6]
      J[9,6]   = -k8*y[9]

      J[10,10] = -k12*y[2]
      J[10,2]  = -k12*y[10]+k9*y[11]
      J[10,9]  = k7
      J[10,11] = k9*y[2]

      J[11,11] = -k9*y[2]-k10*y[1]
      J[11,2]  = -k9*y[11]
      J[11,1]  = -k10*y[11]
      J[11,9]  = k8*y[6]
      J[11,6]  = k8*y[9]
      J[11,13] = k11

      J[12,11] = k9*y[2]
      J[12,2]  = k9*y[11]

      J[13,13] = -k11
      J[13,11] = k10*y[1]
      J[13,1]  = k10*y[11]

      J[14,14] = -k13
      J[14,10] = k12*y[2]
      J[14,2]  = k12*y[10]

      J[15,1]  = k14*y[6]
      J[15,6]  = k14*y[1]

      J[16,16] = -k18-k19
      J[16,4]  = k16

      J[17,17] = -k20*y[6]
      J[17,6]  = -k20*y[17]

      J[18,17] = k20*y[6]
      J[18,6]  = k20*y[17]

      J[19,19] = -k21-k22-k24*y[1]
      J[19,1]  = -k24*y[19]+k23*y[4]
      J[19,4]  = k23*y[1]
      J[19,20] = k25

      J[20,20] = -k25
      J[20,1]  = k24*y[19]
      J[20,19] = k24*y[1]

      return
end

u0 = zeros(20)
u0[2]  = 0.2
u0[4]  = 0.04
u0[7]  = 0.1
u0[8]  = 0.3
u0[9]  = 0.01
u0[17] = 0.007
prob = ODEProblem(ODEFunction{true, SciMLBase.FullSpecialize}(f2, jac=fjac),u0,(0.0,60.0))

sol = solve(prob,Tsit5())

const N = 32

xyd_brusselator = range(0,stop=1,length=N)
brusselator_f(x, y, t) = (((x-0.3)^2 + (y-0.6)^2) <= 0.1^2) * (t >= 1.1) * 5.
limit(a, N) = a == N+1 ? 1 : a == 0 ? N : a
function brusselator_2d_loop(du, u, p, t)
  A, B, alpha, dx = p
  alpha = alpha/dx^2
  @inbounds for I in CartesianIndices((N, N))
    i, j = Tuple(I)
    x, y = xyd_brusselator[I[1]], xyd_brusselator[I[2]]
    ip1, im1, jp1, jm1 = limit(i+1, N), limit(i-1, N), limit(j+1, N), limit(j-1, N)
    du[i,j,1] = alpha*(u[im1,j,1] + u[ip1,j,1] + u[i,jp1,1] + u[i,jm1,1] - 4u[i,j,1]) +
                B + u[i,j,1]^2*u[i,j,2] - (A + 1)*u[i,j,1] + brusselator_f(x, y, t)
    du[i,j,2] = alpha*(u[im1,j,2] + u[ip1,j,2] + u[i,jp1,2] + u[i,jm1,2] - 4u[i,j,2]) +
                A*u[i,j,1] - u[i,j,1]^2*u[i,j,2]
    end
end
p = (3.4, 1., 10., step(xyd_brusselator))

input = rand(N,N,2)
output = similar(input)
f = ODEFunction{true, SciMLBase.FullSpecialize}(brusselator_2d_loop)
function init_brusselator_2d(xyd)
  N = length(xyd)
  u = zeros(N, N, 2)
  for I in CartesianIndices((N, N))
    x = xyd[I[1]]
    y = xyd[I[2]]
    u[I,1] = 22*(y*(1-y))^(3/2)
    u[I,2] = 27*(x*(1-x))^(3/2)
  end
  u
end

u0 = init_brusselator_2d(xyd_brusselator)
prob = ODEProblem(f,u0,(0.,11.5),p);

sol = solve(prob,Tsit5())
```

The first two stiff ODEs exist with maxiters. The last one actually completes, but used to `dtmin`, so it was a false negative and now just a slow way to compute it but it works. So these are the cases where in theory `dtmin` should cause earlier exits, non-stiff ODE solver on a stiff ODE, and it's not even the thing causing exits here because it's small enough to not be "the" factor.

3. One of the arguments that I have seen for dtmin was that it can reduce the possibility of numerical error. But numerical error with dt being super small isn't compounding, the issue is that you get `x + dt*y = x`. If you look at what happens with very small steps on exponential growth (exponential growth is an ODE which has exponential error growth and is thus a hard not easy test for ODE solver error behavior), then you can see what I mean:

```julia
using OrdinaryDiffEq, Plots
f = (u,p,t) -> -p.a*u
prob = ODEProblem(f, [10.0], (0.0, 0.1), (a=-100.0,))
sol1 = solve(prob, Vern7(), dtmin=0);
sol2 = solve(prob, Vern7(), tstops=[sol1.t[2] + eps()], dtmin=0)
sol3 = solve(prob, Vern7(), tstops=[sol1.t[2] + 0.01eps()], dtmin=0)
sol4 = solve(prob, Vern7(), tstops=[sol1.t[2] + 0.001eps()], dtmin=0)

plot(sol1)
plot!(sol2)
plot!(sol3)
plot!(sol4)

t = 0.0:1e-5:0.1
@show maximum((sol1(t) - sol2(t)) ./ sol1(t)) # 8.678693835607108e-7
@show maximum((sol1(t) - sol3(t)) ./ sol1(t)) # 2.680284044683277e-13
@show maximum((sol1(t) - sol4(t)) ./ sol1(t)) # 0.0
```

Small dt's forced by tstops are safe. I've been playing around in many of these examples that have come up, it's safe. The worst that happens is you take a 1e-18 sized step and don't update `u`, but that's because the `u` update is below floating point resolution. You resolve the tstop just fine! So I don't think dtmin is enforcing a safety behavior.

And finally...

4. It can do good things to have dtmin=0, and dt<eps(t)! This one kind of shocked me, but now I kind of get it. For non-stiff ODEs, it doesn't really matter. But for stiff ODEs and DAEs, you can had some interesting adverse behaviors due to the way that nonlinear solvers work. Sometimes rejecting the step and shrinking dt can improve the next step prediction and make the Jacobian less singular. Another interesting behavior is that sometimes a "real" sized step has too much error because a DAE's nonlinear algebraic equations can be changing their branches, i.e. there is a discontinuity. Handling that discontinuity with dt=1e-18 can make it so you effectively are doing something akin to a callback, time almost doesn't move forward, the differential variables don't change, but the algebraic equations can be bumped into a new state. And with any larger step, the error estimate denies the change. Thus allowing extremely small steps improves robustness to discontinuities especially on hard problems. This seems to be one of the reasons why CVODE and IDA does well on a few cases we didn't before. It's not anything to do with the solver, it's just we gave up early with dtmin, while Sundials would just keep trying with smaller than reasonable dt's, resolve the issue and move on.

With those 4 points, I think it's time to set dtmin=0.
@ChrisRackauckas
Copy link
Member

@sjdaines I would still like to dig into those results you have on your CI there. I played with this a lot in other examples and led to the conclusion that we should not just change dtmin=0 here but also generally in DifferentialEquations.jl. Full justification in SciML/OrdinaryDiffEq.jl#2098. With this, I can't find a case where CVODE succeeds but FBDF doesn't, so it would be good to test more. It turns out this little bit was a major blocker!

@ChrisRackauckas ChrisRackauckas merged commit 849ef80 into SciML:master Dec 30, 2023
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v4.19.4 CVODE recoverable warnings now cause failures
3 participants