Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asymptotic expansion for large arg and small-ish order #48

Merged
merged 7 commits into from
Sep 14, 2022

Conversation

cgeoga
Copy link
Contributor

@cgeoga cgeoga commented Aug 19, 2022

Okay, here's a start to an asymptotic expansion. This one is probably again going to be a bit controversial to you guys, because once the SIMD-powered uniform asymptotic expansion things go live this will probably only be faster for nu = 30 or smaller, say. But then, that is the entirety of the spatial statistics applications, so in my biased opinion it's probably worth a branch. I make the current cutoff 20, though, because for the largest args the Float32 tests are failing. Not entirely sure what to make of that.

I haven't formatted things how you like it because I assume there will be plenty of iteration here. But even without the exponential improvement I'm getting pretty good accuracies here, and meaningfully better speeds for small enough v.

Curious to hear what you guys think!

I should also say, I recognize that there are a lot of redundant checks at this point about the size of x and nu and stuff. So I know that will need a little work.

@codecov
Copy link

codecov bot commented Aug 19, 2022

Codecov Report

Merging #48 (f5728a8) into master (86da54f) will increase coverage by 0.03%.
The diff coverage is 95.83%.

@@            Coverage Diff             @@
##           master      #48      +/-   ##
==========================================
+ Coverage   93.76%   93.79%   +0.03%     
==========================================
  Files          18       18              
  Lines        1491     1515      +24     
==========================================
+ Hits         1398     1421      +23     
- Misses         93       94       +1     
Impacted Files Coverage Δ
src/besselk.jl 98.42% <95.45%> (-0.39%) ⬇️
src/constants.jl 80.00% <100.00%> (+0.68%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@heltonmc
Copy link
Member

Awesome!! Thanks for contributing.

Just some general comments. How does the computational performance depend on the order? It seemed like there really wasn't much difference between 10 or 12 for the order? I think the real question is going to be when we want to employ this. I checked the accuracies against Arb and here are the plots for the relative errors for a couple different orders.
Screenshot from 2022-08-19 16-27-29
Screenshot from 2022-08-19 16-28-01
Screenshot from 2022-08-19 16-28-47

Anything below order 10 didn't give great results so that is definitely a good cutoff. I'm a bit surprised how well they look even for largeish orders... How did you determine the cutoff you used?

@heltonmc
Copy link
Member

And here are the comparisons to the master branch of Bessels.jl and to SpecialFunctions.jl and with order 10. (SpecialFunctions.jl can't give us higher orders because of error)
Screenshot from 2022-08-19 16-47-53
Screenshot from 2022-08-19 16-49-25
Screenshot from 2022-08-19 16-51-14

@heltonmc
Copy link
Member

O sorry. I commented without looking through the source code carefully. The higher orders are filled using recurrence (I was baffled by that accuracy for a second).

Ok - can I think about where to make the cutoffs for a bit? Are you posing that for lowish orders (v<20) and large arguments this method should be used? It seems like in that domain it can give almost an extra digit of accuracy in that range so I am definitely in favor of including it there. But I think up to what order definitely matters. Recurrence can be slow. We also need to be careful avoiding underflow for very large arguments because they might be zero while the higher orders are not.

@cgeoga
Copy link
Contributor Author

cgeoga commented Aug 19, 2022

Wow, that's a detailed investigation! Thanks for taking such a thoughtful look into it.

Absolutely take whatever time you need to think about cutoffs and stuff. I would say that v<20 is my initial proposition for the cutoff for precisely what you say. In my fiddling around the timings for Float64 were better even with v<30, but the underflow happened for Float32 when v==20 in the tests, so that's what lead me to that value.

As another note on underflow, if it continues to be an issue then maybe there should actually be a check of low < x < high instead of just low < x. The uniform asymptotic expansion also has the same nice theory and stuff per the NIST book and so if x is that large that we're worried about underflow then I would have to think both expansions will be accurate enough, and so it seems sensible to me to prefer the one that is less likely to underflow.

EDIT: oh, and I just picked 10 out of a hat with the expectation that we'd revisit a bit based on where you wanted the lower cutoff for x to be. Or maybe even use 16 if 35 < x < 50 and then 10 if 50 <= x or something, if you don't think that's too many branches. I think you will have better judgment on that kind of decision than me.

src/besselk.jl Outdated Show resolved Hide resolved
src/besselk.jl Outdated
function besselk_asymptoticexp(v, x::T) where T
_besselk_asymptoticexp(v, float(x), 10)
end
besselk_asexp_cutoff(nu, x) = (nu < 20.0) && (x > 25.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have different cutoffs for Float64 and Float32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but I don't actually have much of an intuition for what ways they should be different. I'll play around with it a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a const ASEXP_CUTOFF in the consts file. For now they're both 35, which is what I thought I had typed in this function actually. But I'm sort of still thinking of that value as a placeholder.

src/besselk.jl Outdated Show resolved Hide resolved
@heltonmc
Copy link
Member

heltonmc commented Aug 19, 2022

I'm wondering if the order could be determined by the ratio of x/nu.... so that it could be determined at runtime with no branches? This would require some care obviously but I don't think that will affect performance too much (I doubt that loop is unrolling). This would have several advantages if such a relation could work out...

The underflow we can probably worry about at the end here. I usually do the calculation in Float64 anyway to the desired tolerance because with these asymptotic expansion we are usually dealing with really large or small numbers that are easy to underflow... but yes I think for very large arguments they are not being filled with recurrence....

I think the steps here are first determine where the expansion can be explicitly applied without the use of recurrence. What is the maximum order we can use here? I see some notes on this but can we use high orders here?

Edit: here it is using an order of 60 in the asymptotic expansion without recurrence. I have widened the range a bit. You can see my favorite bug in ArbNumerics popping up but we should be able to extract the cutoffs pretty easily from these....

Screenshot from 2022-08-19 17-38-01

@cgeoga
Copy link
Contributor Author

cgeoga commented Aug 19, 2022

When you say order 60, do you mean the asymptotic expansion is order 60? Or that for \nu < 60 it is not doing recurrence?

@heltonmc
Copy link
Member

Asymptotic expansion order 60 ( nu is on y axis)

@cgeoga
Copy link
Contributor Author

cgeoga commented Aug 19, 2022

I thought so, but wow. I've never pushed the expansion order anywhere near that high. There is an 8^j term in there, and 8^60 is a big number. I totally would have thought that would be trouble. Is it even faster than the uniform expansion at that value?

@heltonmc
Copy link
Member

Quickly looking at it. You are using a float value so that is only 1.5....e54 which is not unreasonable when we could be calculating numbers much larger than that. And computationally it is fine because you aren't computing 8^60 at once but through the loop. Overall using an order 60 took about 90ns on my computer which seemed very reasonable as a nice upper cutoff. The uniform asymptotic expansion will be faster (how much faster will depend on how much we can get it to vectorize in the other PR) but this approach can produce an extra digit of accuracy nearly almost full precision so I think it being just a little bit slower in some domains warrants it's use!

src/besselk.jl Outdated Show resolved Hide resolved
src/besselk.jl Outdated
for _ in 1:order
# add to the series:
term_v = ak_numv/(factj*_z*eightj)
ser_v += term_v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a simple convergence check here? We may not want to calculate the relative tolerance because that will require a division but we could just break here for term_v < tol? We need to verify that term_v is always greater than one and this will work though if v < 1 it looks like we start somewhere close to one.

This will solve having to dynamically determine the order? I don't think it will add too much performance cost and we only need to check one variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't have thought this would be very helpful considering that the expansion diverges for all finite x, but with some print statements and stuff it looks plausible. What would the tol be? Presumably it does need to depend on v and x somehow, though, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is true. However, we can confine the range of inputs where we employ this expansion and test it pretty rigorously. It may diverge after a point but ideally our tolerance breaks before that and we can control for this I think.

Well - we could just check for relative tolerance but that requires a division. i think tol can just be eps but it really depends on what term_v starts out at. Which if it's larger than 1 then simple absolute error will work. Let's think through it but this would make it easier than determining the number of terms from x, v and then just running for that amount. Though, if there is a straight forward relation that will be easy too.

src/besselk.jl Outdated Show resolved Hide resolved
@cgeoga
Copy link
Contributor Author

cgeoga commented Aug 19, 2022

I think if you really like the crushing accuracy, maybe we should take the exponential improvement part into consideration. I didn't include it in this because it calls the upper incomplete gamma function, and my hand-implementation of that requires the exponential integral, which would thus require re-writing or lifting from SpecialFunctions.jl. But I wonder if one or two terms of the correction wouldn't be worth it. What's annoying is that it uses reverse indices from the a_k coefficients (see NIST 10.41.13), so it's really annoying to try and do quickly.

EDIT: I think I need to call it a day, but in case you are interested to tinker in the mean time here is my upper incomplete gamma for when the first arg is a negative integer. I think what might actually be easy is to use the last a_k and z^k and add a few terms from the top of the correction sum and maybe iterate backwards to just add a few more terms. But I'll get to looking at it this weekend so no pressure.

src/besselk.jl Outdated Show resolved Hide resolved
Co-authored-by: Michael Helton <[email protected]>
@cgeoga
Copy link
Contributor Author

cgeoga commented Aug 26, 2022

Sorry for radio silence here. Still thinking about the exponential improvement thing. It might be worthwhile, but it would take a pretty careful implementation. The expression just uses powers of x and the a_k vectors, which we already have, so I think what would ultimately make the most sense is trying to evaluate all of it at once. But doing that just requires a bit more time and focus than I have had this week. I'll also be away next week for a conference, but the week after I should be able to get to this.

EDIT: I clicked "commit suggestion" to your suggested speedup and I guess something is broken now. I haven't looked into it a ton yet, but I will at some point.

@heltonmc
Copy link
Member

Not a problem! I think the exponentially improved version is definitely worthwhile. But for those often times including the remainder piece improves the convergence properties for later terms? But I'm not for sure if it actually increases the convergence for the earlier terms? I would imagine this is great for higher precision calculations (Double64) which allows the expansions to be used in a larger range as these continue the convergence of the later terms... for the single and double precision calculations I'm not for sure what the tradeoffs will be for it as they usually are already pretty quickly converging. Unless we want to apply the expansion in a larger window....

Ya it looks like my suggestion didn't actually delete the old code that I was trying to replace so you now have a bunch of hanging code with my suggestion and the old implementation. Ill be able to look more carefully at this next week. I would be willing to merge just the asymptotic expansions and then if we want to get to the exponentially improved we could do that separately in the future. Because I think the regular expansions are definitely worth it for now... (once we fix cutoffs and stuff)

@heltonmc
Copy link
Member

heltonmc commented Sep 9, 2022

I'm kind of wondering if we should make this code as similar to the asymptotic expansion for besseli which is here

Bessels.jl/src/besseli.jl

Lines 300 to 325 in 8eb7c77

function besseli_large_argument(v, x::T) where T
a = exp(x / 2)
coef = a / sqrt(2 ** x))
return T(_besseli_large_argument(v, x) * coef * a)
end
besseli_large_argument_scaled(v, x::T) where T = T(_besseli_large_argument(v, x) / sqrt(2 ** x)))
function _besseli_large_argument(v, x::T) where T
MaxIter = 5000
S = promote_type(T, Float64)
v, x = S(v), S(x)
fv2 = 4 * v^2
term = one(S)
res = term
s = -term
for i in 1:MaxIter
offset = muladd(2, i, -1)
term *= muladd(offset, -offset, fv2) / (8 * x * i)
res = muladd(term, s, res)
s = -s
abs(term) <= eps(T) && break
end
return res
end

This is just the basic NIST form http://dlmf.nist.gov/10.40.E2 right? So keeping it as similar to the besseli code (which I think could use a few edits) would look like...

function besselk_large_argument(v, x::T) where T
    a = exp(-x / 2)
    coef = a * sqrt(pi / 2x)
    return T(_besseli_large_argument(v, x) * coef * a)
end

besselk_large_argument_scaled(v, x::T) where T =  T(_besselk_large_argument(v, x) * sqrt(pi / 2x))

function _besselk_large_argument(v, x::T) where T
    MaxIter = 5000
    S = promote_type(T, Float64)
    v, x = S(v), S(x)

    fv2 = 4 * v^2
    term = one(S)
    out = term
    for i in 1:1000
        offset = muladd(2, i, -1)
        term *= muladd(offset, -offset, fv2) / (8 * x * i)
        out = muladd(term, term, out)
        abs(term) <= eps(T) && break
    end
    return out
end

I'm not for sure if that's even faster because obviously it has that convergence check we discussed but it does also take into account that we need the scaled version as well.

@heltonmc
Copy link
Member

heltonmc commented Sep 9, 2022

But again we do need to decide how we want to handle possibly using recurrence or where/if we want to employ that. This is a possibility not possible in the besseli scheme...

Edit: We should probably benchmark a little more to see which version is faster but quickly looking at

function _besselk_large_argument(v, x::T) where T
    MaxIter = 5000
    S = promote_type(T, Float64)
    v, x = S(v), S(x)

    fv2 = 4 * v^2
    term = one(S)
    out = term
    for i in 1:1000
        offset = muladd(2, i, -1)
        term *= muladd(offset, -offset, fv2) / (8 * x * i)
        out = muladd(term, term, out)
        #abs(term) <= eps(T) && break
    end
    return out
end
function _besselk_as_pair(v, x::T) where T
           fv = 4*v*v
           z = 8*x
           knu = one(T)
           ak_nu = fv - 1
           for i in 1:1000
               term_v = ak_nu / z
               knu += term_v
               a = muladd(2, i, one(T))^2
               z *= 8*x*(i + one(T))
               ak_nu *= fv - a
           end
           return knu
       end

It looks like _besselk_large_argument is a bit faster. Both should give similar results.

@cgeoga
Copy link
Contributor Author

cgeoga commented Sep 9, 2022

A good point---it should presumably be possible to re-write at least the series part in one function that gets re-used by both besseli and besselk considering that the difference is just an alternating sign or a fixed one. This function where you pass in a 1 or -1 as a signswap variable seems to have no speed cost and could be re-used:

function _besselik_large_argument_series(v, x::T, signswap) where T
    MaxIter = 5000
    S = promote_type(T, Float64)
    v, x = S(v), S(x)
    fv2 = 4 * v^2
    term = one(S)
    res = term
    s = -term
    for i in 1:MaxIter
        offset = muladd(2, i, -1)
        term *= muladd(offset, -offset, fv2) / (8 * x * i)
        res = muladd(term, s, res)
        s *= signswap
        abs(term) <= eps(T) && break
    end
    return res
end

Assuming I put that into this PR, where would you like this function that serves both besseli and besselk to go?

It would be nice to ensure that the compiler somehow knows that all signswap will do is swap the sign and presumably then it won't use a multiplication, but that's sort of lower level than my paygrade.

With regards to recurrence, I originally put that in there because it seemed to help at the (approximation) orders that I expected to use, like 5-10. If you're cranking it all the way up to 60, I wouldn't be surprised if the recurrence didn't actually help all that much with accuracy and could just be removed.

I also think I'd vote for merging something like this without the exponential improvement for now. I am pretty concerned that even with the fastest possible expintx and stuff it would barely be faster, and the accuracy here already seems good enough.

@heltonmc
Copy link
Member

heltonmc commented Sep 9, 2022

I agree on just merging something of this form for sure! This function can just go into besselk file and then we can call it in the appropriate file with the scaled versions etc. But yes let's try to keep it as simple as possible. We will need to define a cutoff where we can employ this though because including the higher series leaves the possibility it won't ever converge and just actually start growing in terms. The signswap was about 3% difference for me which I am fine with but that code is just slightly different than the original with the extra assignment.

Yes it might be best to not rely on recurrence here honestly because I think this will be able to be employed in a larger range given the more orders.

@cgeoga
Copy link
Contributor Author

cgeoga commented Sep 9, 2022

Makes sense. Tinkering around here, and there's actually something I don't get if you don't mind. In your _besselk_large_argument, your last line in the loop is out = muladd(term, term, out). I don't get that---shouldn't it just be out += term?

As written, your version and my version for the series don't give the same answer, and when I sub in your version then all the tests fail. But they don't fail for besseli, so I assume I'm just somehow not scaling things the same?

If I swap in a modified form of your code that looks like this:

function _besselk_as_new(v, x::T) where T
    fv   = 4*v*v
    term = one(v)
    out  = term
    for i in 1:[some iter]
      offset = muladd(2, i, -1)
      term  *= muladd(offset, -offset, fv)/(8*x*i)
      out   += term
    end
    return out
end

I get the same output now. Am I missing something?

@heltonmc
Copy link
Member

heltonmc commented Sep 10, 2022

Looking at my original code... I didn't convert the kernel right so it was still using the besseli when I tested it real quick.

function besselk_large_argument(v, x::T) where T
    a = exp(-x / 2)
    coef = a * sqrt(pi / 2x)
    return T(_besselk_large_argument(v, x) * coef * a)
end

besselk_large_argument_scaled(v, x::T) where T =  T(_besselk_large_argument(v, x) * sqrt(pi / 2x))

function _besselk_large_argument(v, x::T) where T
    MaxIter = 5000 
    S = promote_type(T, Float64) 
    v, x = S(v), S(x) 
 
    fv2 = 4 * v^2 
    term = one(S) 
    res = term 
    s = term 
    for i in 1:MaxIter 
        offset = muladd(2, i, -1) 
        term *= muladd(offset, -offset, fv2) / (8 * x * i) 
        res = muladd(term, s, res) 
        abs(term) <= eps(T) && break 
    end 
    return res 
end

This is the appropriate form. So it is really just the s that needs to change signs in the original besseli implementation.

@cgeoga
Copy link
Contributor Author

cgeoga commented Sep 13, 2022

Ah, I see. I really need to get more into this muladd/fma game. It's crazy to me how much of an improvement that makes. I always thought that if I just wrote the most obvious low-level code possible the compiler could do that kind of fusion stuff better than I could in code transformations or whatever it is called. But wow. It's surprising to me that you have s \equiv 1 and res = muladd(term, s, res) and prefer that to res += term (the one from my fix in the previous comment), because I would think that the extra multiplication has to cost something. But I guess that's the magic of FMA.

Anyways, pushed up that version here. I was thinking about the code duplication question a bit and started doubting myself on how useful it actually was to merge the series things for besselk and besseli and have a sign flipping indicator as an argument. It's not really much code to duplicate and it is perhaps a bit readable this way, so for the moment I've committed without merging the two functions. Let me know if you'd like me to change that.

From some earlier skims I gather that the last thing you'd like to discuss is cutoffs to use this. The current one that I put in here of x > 35 does pass all of the tests. I know you are interested in some function of nu and x as a cutoff, but maybe it is worthwhile to merge this as-is and work on that as a separate PR? NIST 10.40.10 gives an error bound that might ostensibly be helpful in coming up with a smarter cutoff, but it does not seem like a trivial project to me.

@heltonmc
Copy link
Member

Yes - I agree. Keeping the functions separate is fine with me. And yes the cutoff could be something to discuss further but no matter what it has a quadratic dependence. I will look further into it in the future but it does also keep the total computational time by putting an upper bound on nu.

I think it's definitely time to get this merged.

@heltonmc heltonmc merged commit e6f0ac3 into JuliaMath:master Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants