Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asymptotic expansion for large orders. Besselk(nu, x) #8

Merged
merged 9 commits into from
Jan 22, 2022

Conversation

heltonmc
Copy link
Member

This is a rough first draft at implementing a more efficient algorithm for large orders. This uses equation 10.41.4 from https://dlmf.nist.gov/10.41. It is also found in Abramowitz and Stegun Handbook of mathematical functions equation 9.7.8. This also appears to be the base algorithm from the Amos fortran routines (https://dl.acm.org/doi/10.1145/7921.214331).

This expansion also can be used for besseli and then use downward recurrence for all order of besseli.
The key trick here is determining at what order is it appropriate to use these. For larger orders the less terms you need in the sum but the more loss in time you have for the recurrence. Finding the optimal point will be key.

On the other hand, calculating the U polynomials is an absolute pain and they can't be found anywhere. They require recurrence with both integration and derivatives. These were mostly hand calculated and still double checking through them.

The current implementation is not at all optimized and leads to SEVERE rounding errors. I am having to put big floats through them to pass the tests.

Ideally the U polynomials can be rearranged to just get their coefficients and then use horner's scheme....... I'm still working through that but we have to be very careful how things are rounded for this to work.

@heltonmc heltonmc marked this pull request as draft January 18, 2022 03:55
@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2022

Codecov Report

Merging #8 (ef12079) into master (e5dfe6c) will increase coverage by 2.47%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #8      +/-   ##
==========================================
+ Coverage   95.78%   98.26%   +2.47%     
==========================================
  Files          11       11              
  Lines         475      575     +100     
==========================================
+ Hits          455      565     +110     
+ Misses         20       10      -10     
Impacted Files Coverage Δ
src/Bessels.jl 100.00% <ø> (ø)
src/U_polynomials.jl 100.00% <100.00%> (ø)
src/besseli.jl 100.00% <100.00%> (ø)
src/besselk.jl 100.00% <100.00%> (ø)
src/math_constants.jl 100.00% <100.00%> (ø)
src/recurrence.jl 100.00% <100.00%> (ø)
src/hankel.jl

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5dfe6c...ef12079. Read the comment docs.

@heltonmc
Copy link
Member Author

heltonmc commented Jan 18, 2022

To simplify this problem let's look at a Float32 implementation. This needs about 5 U polynomials. A naive implementation would be something like...

 function bk32(v, x::T) where T <: AbstractFloat
    z = x / v
    zs = sqrt(1 + z^2)
    n = zs + log(z / (1 + zs))
    coef = exp(-v * n) * sqrt(pi / (2*v)) / sqrt(zs)
    p = inv(zs)
 
    u0 = one(x)
    u1 = p / 24 * (3 - 5*p^2) * -1 / v
    u2 = p^2 / 1152 * (81 - 462*p^2 + 385*p^4) / v^2
    u3 = p^3 / 414720 * (30375 - 369603 * p^2 + 765765*p^4 - 425425*p^6) * -1 / v^3
    u4 = p^4 / 39813120 * (4465125 - 94121676*p^2 + 349922430*p^4 - 446185740*p^6 + 185910725*p^8) / v^4
    u5 = p^5 / 6688604160 * (-188699385875*p^10 + 566098157625*p^8 - 614135872350*p^6 + 284499769554*p^4 - 49286948607*p^2 + 1519035525) * -1 / v^5

    out = u0 + u1 + u2 + u3 + u4 + u5
 
    return coef*out
 end

Where we add up the U polynomials in series. This actually provides good enough accuracy in the following example all the way up to Float64.

## Compare relative error / eps to ArbNumerics.jl

# using BigFloats as accuracy check
julia> abs(1 - bk32(BigFloat(200), BigFloat(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float64)
0.93578

# using Float64
julia> abs(1 - bk32(Float64(200), Float64(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float64)
77.7788

# using Float32
julia> abs(1 - bk32(Float32(200), Float32(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float32)
229.8961

Though, it requires high precision to get accurate results......
For speed we can try to order the u0+u1+u2... polynomial and find its coefficients. We can then use horner's scheme like this (this block can be evaluated very fast <80ns for any order)...

function bk32_poly(v, x)
    z = x / v
    zs = sqrt(1 + z^2)
    n = zs + log(z / (1 + zs))
    n = exp(-v * n) * sqrt(pi / (2*v)) / sqrt(zs)
    p = inv(zs)

    poly_coeffs = (
        1,
        -1/(8*v),
        9/(128*v^2),
        (640*v^2 - 225)/(3072*v^3), 
        (11025 - 39424*v^2)/(98304*v^4),
        (1168128*v^2 - 297675)/(1310720*v^5),
        (123200*v^2 - 871497)/(368640*v^4),
        (608480847 - 152472320*v^2)/(82575360*v^5),
        144001/(16384*v^4),
        (54454400*v^2 - 2257934679)/(53084160*v^5),
        -7436429/(663552*v^4),
        108313205/(1179648*v^5),
        37182145/(7962624*v^4),
        -5391411025/(63700992*v^5),
        5391411025/(191102976*v^5)
    )
    return evalpoly(p, poly_coeffs) * n
end

Though its results seem to suffer even worse and not even as accurate in high precision... Though this could be an error from me in setting the problem up.... have to double check. However, it produces the same results for Float32

## Compare relative error / eps to ArbNumerics.jl

# using BigFloats as check
julia> abs(1 - bk32_poly(BigFloat(200), BigFloat(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float64)
8790.07158


# using Float64
julia> abs(1 - bk32_poly(Float64(200), Float64(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float64)
8712.076

# using Float32
julia> abs(1 - bk32_poly(Float32(200), Float32(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float32)
229.8961288

@oscardssmith Do you have thoughts on best way to compute bk32 accurately and quickly? The good thing about this approach is that it will provide very accurate results (if we can get rounding right) for Float64 with around 8-9 U polynomials and Float128 to around 16-18 terms depending on the limits of the expansion. The coefficients for now have been left in their exact form.

@oscardssmith
Copy link
Member

What about

 function bk32(v, x::T) where T <: AbstractFloat
    z = x / v
    zs = sqrt(1 + z^2)
    n = zs + log(z / (1 + zs))
    coef = exp(-v * n) * sqrt(pi / (2*v)) / sqrt(zs)
    p = inv(zs)
    p2  = p*p
    u0 = one(x)
    u1 = 1 / 24 * evalpolly(p2, (3, -5))
    u2 = 1 / 1152 * evalpoly(p2, (81, -462, 385))
    u3 = 1 / 414720 * evalpoly(p2, (30375, -369603, 765765, -425425))
    ...

    out = evalpoly(-p/v, (u0, u1, u2, u3, u4, u5))
 
    return coef*out
 end

@oscardssmith
Copy link
Member

Also, for Float32, the best approach if accuracy is desired is generally to do the computation in Float64. It's almost free to do the transformation, and it avoids having to deal with rounding or over/underflow.

@heltonmc
Copy link
Member Author

heltonmc commented Jan 18, 2022

Ya that is really impressive!! I like how we can keep coefficients all exact. And very fast all things considered, while providing similar errors!

julia> abs(1 - bk32(BigFloat(200), BigFloat(100.0)) / besselk(200, ArbFloat(100.0, digits=70)))
2.077863330447360112e-16

julia> abs(1 - bk32_new(BigFloat(200), BigFloat(100.0)) / besselk(200, ArbFloat(100.0, digits=70)))
2.077760317644734432e-16

So we do lose some accuracy when doing it in Float64. So I think we will have to make some compensation here. I will incorporate more U polys and test..

julia> abs(1 - bk32_new(Float64(200), Float64(100.0)) / besselk(200, ArbFloat(100.0, digits=70)))
1.69273029004913010678098354694302671752893517308314639783117236510783504027076e-14

And Float32

julia> abs(1 - bk32_new(Float32(200), Float32(100.0)) / besselk(200, ArbFloat(100.0, digits=70)))
2.74075466924646385826117391071280851161159713028415252243504344281333675837797e-5

Edit: And that sounds good for your previous comment! I think we can use even less U polynomials depending on where we want our expansion to be then.

@oscardssmith
Copy link
Member

oscardssmith commented Jan 18, 2022

Also for Float64 (and higher), you definitely want to spend more work computing p2 accurately. p2 currently is only accurate to 3-5 ULP (very roughly). The first obvious improvement is to get rid of the sqrt and ^2 that currently cancel each-other out p2 = inv((x/v)^2+1). This can be computed more accurately as (v/hypot(v,x))^2 (or possibly v^2/fma(max(v,x), max(v,x), min(v,x)^2)

@heltonmc
Copy link
Member Author

Sounds great. I am quickly checking it over and looks like v^2/fma(max(v,x), max(v,x), min(v,x)^2) will provide the correctly rounded result more often.

It looks like the biggest loss is coming in the evalpoly for the higher U polynomials. For example.

#Float 64
julia> v = 100.0; x = 10.0
julia> p2  = v^2/fma(max(v,x), max(v,x), min(v,x)^2) # rounded correctly
0.9900990099009901
julia> u8 = 1 / 22191183337881600 * evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
0.0021583037489126965

#BigFloat
julia> v = BigFloat(100.0); x = BigFloat(10.0)
julia> p2  = v^2/fma(max(v,x), max(v,x), min(v,x)^2) # rounded correctly
0.9900990099009900990099009900990099009900990099009900990099009900990099009901022

julia> u8 = 1 / 22191183337881600 * evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
0.002158303737181252245696702046689380173846888351401502380264769750067125428000861

So p2 is correctly rounded now but very large increasing loss for the higher U polynomials. it also looks like the calculation of n = z + log(z / (1+sqrt(1+z^2)) is off by 2 decimal places.

The Float32 implementation is working great. We only need 4 U polynomials to get accurately rounded results. The cutoff of where we want to switch over from forward recurrence can happen as low as nu<15 but It is actually faster to just do recurrence there. So we can adjust the cutoff as necessary. For besselk this doesn't matter too much because it is always fast. i.e low orders are very fast generated with besselk0 and besselk1 then forward recurrence. For besseli forward recurrence isn't stable so we need backward recurrence from this asymptotic expansion.

So we will need to use this routine to generate besseli(nu+1) and besseli(nu) and use backward recurrence. So in this case we will want the cutoff point as low as we can. I'll have to test more thoroughly to find this optimal point but it'll be essential to get these correctly rounded as they will generate most of the other orders. P.s. these routines also work for fractional orders and complex values but we'll tackle that afterwards 😄

@heltonmc
Copy link
Member Author

@oscardssmith I'm just seeing your discourse post on more accurate evalpoly routines https://discourse.julialang.org/t/more-accurate-evalpoly/45932. Did you pursue those routines any further?

Looks like there is another reference here https://arxiv.org/pdf/cs/0610122.pdf. Do you think the compensated horner would be a good candidate here? I'm unsure if there's another way around this.

@oscardssmith
Copy link
Member

oscardssmith commented Jan 19, 2022

Base.Math.ext_horner as of 1.7 has a partially compensated horner. I'll write you a fully compensated one tonight. the partially compensated one is about 3x slower than uncompensated, and the fully compensated is roughly 5x slower. the difference is that the partial compensation is only fully accurate if the higher order terms converge to 0.

@oscardssmith
Copy link
Member

Oh, I already put the fully compensated version in the discourse issue you linked. Here it is for reference.

@inline function two_sum(a, b)
    hi = a + b
    a1 = hi - b
    lo = (a - a1) + (b - (hi - a1))
    return hi, lo
end

@inline function exthorner(x, p::Tuple)
    hi, lo = p[end], zero(x)
    for i in length(p)-1:-1:1
        pi = p[i]
        prod = hi*x
        err1 = fma(hi, x, -prod)
        hi, err2 = two_sum(pi,prod)
        lo = fma(lo, x, err1 + err2)
    end
    return hi, lo
end

@heltonmc
Copy link
Member Author

It looks like the issue is when p2 is close to one we get really bad errors.

p2 = 0.98
julia> evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
2.05403516225376e14
julia> evalpoly(BigFloat(p2), (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
2.054035159478429426618583143791227160776082859513700290192954935066092917184701e+14
julia> exthorner(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
(2.05403516225376e14, -45981.289898326024)

I might not be using exthorner right but it looks like it still is returning incorrectly. For larger values of p2 we get good results...

julia> p2 = 5.0
5.0

julia> evalpoly(BigFloat(p2), (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
6.12479961153197164448576e+25

julia> evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
6.124799611531971e25

So the issue is p2 is almost always close to one as v is usually equivalent to x or much larger which amplifies the rounding errors. So the case when x>>v we will get good results. Though, for that region another expansion might converge better as this is for large values of v....

@oscardssmith
Copy link
Member

oscardssmith commented Jan 19, 2022

Part of the problem is that the coefficients you are using are too large to be losslessly represented by Float64. If you compare exthorner vs evalpoly(BigFloat(p2), Float64.((134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))), I find an error of 10e-11

@heltonmc
Copy link
Member Author

heltonmc commented Jan 19, 2022

It looks like the largest errors are coming from calculating coef.

z = x / v
zs = sqrt(1 + z^2)
n = zs + log(z / (1 + zs))
coef = exp(-v * n) * sqrt(pi / (2*v)) / sqrt(zs)

In some cases this produces only 12-13 correct digits. When this block is done in BigFloats it looks like we can get the right answer.

Edit:

After some more digging the calculation of exp(-v*n) can give quite poor performance over 2 ULP. When I do that calculation in the BigFloat I get the right answer.

julia> exp(28.509909247209745)
2.4082205921371577e12

julia> exp(big"28.5099092472097450000000000000000000000000000000")
2.408220592137158798062714571311086664973698363776289190685153859052305376026371e+12

Also, using a hard coded value for sqrt(pi / 2) is best (it isn't rounded correct). New code bit here.

z = x/v
zs = hypot(1, z)
n = zs + log(z) - log1p(zs)
coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs)

@oscardssmith
Copy link
Member

oscardssmith commented Jan 20, 2022

If you want, I could update exthorner to use double-double coefficients which would recover the rest of the accuracy... That said, if there was a way to make this work without doing so, it would be a lot easier.

@heltonmc
Copy link
Member Author

heltonmc commented Jan 20, 2022

Ok so the good news is I'm almost positive the evalpolys are not the main factor. Its the early part calculating the coefficients. If I do those calculations in BigFloat I get almost the exact answer.

function besselk_large_orders(v, x::T) where T <: AbstractFloat
    x = BigFloat(x); v = BigFloat(v)
    z = x/v
    zs = hypot(1, z)
    n = zs + log(z) - log1p(zs)
    coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs)
    p = inv(zs)
    p2  = v^2/fma(max(v,x), max(v,x), min(v,x)^2)
    p2 = Float64(p2); p = Float64(p); coef=Float64(coef); v = Float64(v)

    u0 = one(Float64)
    u1 = 1 / 24 * evalpoly(p2, (3, -5))
    u2 = 1 / 1152 * evalpoly(p2, (81, -462, 385))
    u3 = 1 / 414720 * evalpoly(p2, (30375, -369603, 765765, -425425))
    u4 = 1 / 39813120 * evalpoly(p2, (4465125, -94121676, 349922430, -446185740, 185910725))
    u5 = 1 / 6688604160 * evalpoly(p2, (1519035525, -49286948607, 284499769554, -614135872350, 566098157625, -188699385875))
    u6 = 1 / 4815794995200 * evalpoly(p2, (2757049477875, -127577298354750, 1050760774457901, -3369032068261860,5104696716244125, -3685299006138750, 1023694168371875))
    u7 = 1 / 115579079884800 * evalpoly(p2, (199689155040375, -12493049053044375, 138799253740521843, -613221795981706275, 1347119637570231525, -1570320948552481125, 931766432052080625,  -221849150488590625))
    u8 = 1 / 22191183337881600 * evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
    return coef*evalpoly(-p/v, (u0, u1, u2, u3, u4, u5, u6, u7, u8))
end

So essentially just converting the numbers to big floats and then converting back to Float64 before the evalpoly. We get almost correctly rounded answers

julia> besselk_large_orders(100.0, 100.0)
7.617129630494086e-25

julia> besselk(100, ArbFloat(100.0))
7.6171296304940854161829684946539327153e-25

#for reference this is it without converting to BigFloats
julia> besselk_large_orders(100.0, 100.0)
7.617129630493957e-25

So the main piece to try and compute as exact as possible is coef. Do you have any tips on computing this block of code with corrected rounding?

    z = x/v
    zs = hypot(1, z)
    n = zs + log(z) - log1p(zs)
    coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs)

@oscardssmith
Copy link
Member

I've looked a little bit for better ways to compute this, but I think that you pretty much have to use higher precision. That said, DoubleDouble would be a decent amount faster than BigFloat for this.

@heltonmc
Copy link
Member Author

heltonmc commented Jan 20, 2022

So testing this the relative errors using pure Float64 are around 1e-14 peaking at about 9e-14. This is actually the same level of accuracy SpecialFunctions.jl provides (took a while to realize I should be comparing solely to ArbNumerics for this). If we do those first arguments in BigFloats we can get relative errors down to ~9e-16. At that point we are limited by the small loss of precision doing all the evalpoly in Float64. Note that if we do the entire routine in BigFloats we get relative errors down to 1e-20ish so more than enough precision for Float64.

Do we think doing the earlier computations in DoubleFloats is the best move? I'll go with probably, though I am not sure the performance decrease.... is there an efficient way of doing this? I have noticed a performance hit just making the conversion from Float64 to DoubleFloat...

@oscardssmith
Copy link
Member

Theoretically, I would expect DoubleFloats to be roughly 4x slower, but I'd want to test to make sure. If that's the case, I think it shouldn't be too big a hit overall since the evalpolys are doing a fairly considerable amount of work.

@heltonmc
Copy link
Member Author

heltonmc commented Jan 20, 2022

Edit: and for Float64 we actually only need 4-5 polynomials to get to 1e-14 relative errors. This roughly halves the computation time for f64 while the d64 is unchanged. So the f64 is 128x faster...

Hoping there is a much better way to do this because its a considerable performance hit (128x).... here code snippets I'm testing.

function d64(v, x::T) where T <: AbstractFloat
    z = Double64(x) / v
    zs = hypot(1, z)
    n = zs + log(z) - log1p(zs)
    coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs)
    p = inv(zs)
    p2  = v^2/fma(max(v,x), max(v,x), min(v,x)^2)

    p2_64 = T(p2)
    p64 = T(p)
    coef64 = T(coef)

    u0 = one(Float64)
    u1 = 1 / 24 * evalpoly(p2_64, (3, -5))
    u2 = 1 / 1152 * evalpoly(p2_64, (81, -462, 385))
    u3 = 1 / 414720 * evalpoly(p2_64, (30375, -369603, 765765, -425425))
    u4 = 1 / 39813120 * evalpoly(p2_64, (4465125, -94121676, 349922430, -446185740, 185910725))
    return coef64*evalpoly(-p64 / v, (u0, u1, u2, u3, u4))
end

function f64(v, x::T) where T <: AbstractFloat
    z = x/v
    zs = hypot(1, z)
    n = zs + log(z) - log1p(zs)
    coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs)
    p = inv(zs)
    p2  = v^2/fma(max(v,x), max(v,x), min(v,x)^2)

    u0 = one(Float64)
    u1 = 1 / 24 * evalpoly(p2, (3, -5))
    u2 = 1 / 1152 * evalpoly(p2, (81, -462, 385))
    u3 = 1 / 414720 * evalpoly(p2, (30375, -369603, 765765, -425425))
    u4 = 1 / 39813120 * evalpoly(p2, (4465125, -94121676, 349922430, -446185740, 185910725))
    return coef*evalpoly(-p/v, (u0, u1, u2, u3, u4))
end

And here are the benchmarks

julia> @benchmark d64(100, x) setup=(x=rand()*100)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min  max):  5.817 μs   35.292 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     7.717 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.761 μs ± 482.594 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                         ▂▄▆█▄▃                
  ▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▅██████▇▅▄▃▃▂▂▂▃▃▂▃▂▂ ▃
  5.82 μs         Histogram: frequency by time        8.45 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f64(100, x) setup=(x=rand()*100)
BenchmarkTools.Trial: 10000 samples with 982 evaluations.
 Range (min  max):  58.851 ns  155.974 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     61.695 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   62.423 ns ±   3.852 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▇▄ ▃▃▄█▇▂▂▂▂▂▅▃▃▂▁▁▁▂▁▁                                    ▂
  ▄███▇█████████████████████▇▆▆▆▆▆▇▆▆▆▆▇▇▇▆▇▆▅▆▅▅▅▅▆▅▄▄▅▅▅▅▄▅▅ █
  58.9 ns       Histogram: log(frequency) by time      78.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Edit 2: For reference SpecialFunctions.jl takes 216ns for that benchmark producing similar relative errors to the 62ns benchmark... Matlab has relative errors about 3e-14 for this situtation

@oscardssmith
Copy link
Member

function d64(v, x::T) where T <: AbstractFloat
    z = Double64(x) / v
    zs = sqrt(muladd( z, z, 1))
    n = zs + log(z/(zs+1))
    coef = Double64(1.2533141373155003,-9.164289990229585e-17) / sqrt(zs*v) * exp(-v*n)
    p = inv(zs)
    p2  = v^2/fma(max(v,x), max(v,x), min(v,x)^2)

    p2_64 = T(p2)
    p64 = T(p)

    u0 = one(Float64)
    u1 = 1 / 24 * evalpoly(p2_64, (3, -5))
    u2 = 1 / 1152 * evalpoly(p2_64, (81, -462, 385))
    u3 = 1 / 414720 * evalpoly(p2_64, (30375, -369603, 765765, -425425))
    u4 = 1 / 39813120 * evalpoly(p2_64, (4465125, -94121676, 349922430, -446185740, 185910725))
    poly = evalpoly(-p64 / v, (u0, u1, u2, u3, u4))
    return muladd(coef.hi, poly, coef.lo*poly)
end

is about 2.2x faster than your version of the Double64 version. This is still a work in progress, I think there is another decent amount that can be taken off of it.

@heltonmc
Copy link
Member Author

heltonmc commented Jan 20, 2022

Looks great!! It is also just as fast for me to just do the whole routine in Double64 (so also do the evalpoly in the same precision)... It might be best at that point to just offer the Float64 with around 1e-14 relative error with the potential for future more accurate exp calculation reducing it to 1e-15 and either a Double64 or Float128 routine with 1e-33ish relative error. The one thing I like about the Float128 routines is that these functions exponentially decay/increase so having the larger exponents would be nice. Though the opposite is true for besselj and bessely as they just oscillate around zero (so Double64 would be great for that).

I think I'm not too keen on giving up such a large performance hit though for that added digit.... considering this routine is such a base routine to generate all of besseli.

@oscardssmith
Copy link
Member

It appears that all of the time is being spent in the log(::Double64) and exp(::Double64) which are both 150x slower than the Float64 versions. This should be totally fixable via a PR to DoubleFloats

@oscardssmith
Copy link
Member

I'm less convinced an easy speedup exists now. Multifloat's exp is 3x faster, but it also has about 5x more error on the random example I tested. ArbNumerics is also similar. There is, however some hope.

@oscardssmith
Copy link
Member

JuliaMath/DoubleFloats.jl#136 is 10x faster. Not sure of as easy a path to get log faster.

@heltonmc heltonmc marked this pull request as ready for review January 21, 2022 05:29
@heltonmc
Copy link
Member Author

That sounds good and thanks for that PR! I'm thinking I will need to do some more testing for higher precision (either Double64 or Float128) to see how many terms etc. we will need to include. It could be that other methods are better in this domain but I will benchmark and test. For now, I think this is ready to go as this PR is getting quite long already.

@heltonmc
Copy link
Member Author

I've made a few changes after some testing. It appears the U polynomials are great and converge fairly rapidly for large orders. However, they are not easily computed with limited floating point digits. Thus, they remain a fast way to compute large orders with about 1e-14 relative error. This is consistent with the fortran routines provided by SpecialFunctions.

Now I will note two concerns where the accuracy may suffer. For large arguments and small x. So nu>100 and x<5 the expansion is less accurate about 1e-12 relative error. I think this is unacceptable and unfortunately we should have another branch that we fall back on for small arguments. At this point this is very easily computed with the power series that converges rapidly.

The second concern is what to do in the middle ranges of nu between 2 and 100. For besselk this is not a problem. We are using besselk0 and besselk1 which have errors of about 1 ULP and will use forward recurrence and therefore all of the higher orders have similar accuracy to wherever we start until we switch to the more efficient asymptotic expansion.

For besseli this is a concern though. We have to start at higher orders and use downward recurrence. Currently we are using the asymptotic expansion and going down. But again we are then limited by accuracy of asymptotic expansion at 1e-14 for all orders... This is now pretty slow and not very accurate.

There are two ways going forward I see. For besseli we use either Miller's algorithm or a continued fraction approach for the middle orders. I currently have the continued fraction coded up and producing good errors so I will break this into PRs and implement the continued fraction approach separately for easier review.

I think getting the algorithms as accurate as we can first and then I can benchmark different algorithms for accuracy and speed as we go through iterations.. Though testing all algorithms is definitely time prohibitive.

@oscardssmith
Copy link
Member

Wow, this gets incredibly complex. Have you seen https://fredrikj.net/blog/2021/12/computing-special-functions-using-numerical-integration/ by the way? It suggests that at least for arbitrary precision, numerical integration can be a fast approach. I'm not sure if it is worth it for the type of precision we are looking at here.

@heltonmc
Copy link
Member Author

heltonmc commented Jan 22, 2022

Yes, I think I'm going to make schematics for all these branches to show where we are switching over at. I guess they are called Special functions for a reason. Creating SIMDable functions might be tricky but if you need these for a vector of nus you can fall back on to just using recurrence which is exceptionally fast. The challenge is you want a single random x or nu how do you accommodate best for this. I'm also not super pleased with having the relative errors be potentially different over the argument values..... (thinking towards derivatives/auto-diff).

But yes, I think for Float32 and Float64 the current approaches will be the fastest. But for Float128 or higher we may need to look towards to numerical integration. I actually have a lot of experience with Laplace based integrals that these functions can be expressed as. There is also the exp-arc method http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.6055 that I've been wanting to try.... edit (has potential for no branches)

I think for now getting good methods for Float32 and Float64 for all of besselk, besseli, besselj and bessely will be the focus. At that point it good be a good time for a 0.1.0 release?

@oscardssmith
Copy link
Member

That plan sounds good.

@heltonmc heltonmc merged commit 8ebb0f2 into master Jan 22, 2022
@heltonmc heltonmc deleted the asymptotic-expansion branch January 22, 2022 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants