Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable order in large-order expansions (at least in besselk) #67

Open
cgeoga opened this issue Nov 14, 2022 · 11 comments
Open

Variable order in large-order expansions (at least in besselk) #67

cgeoga opened this issue Nov 14, 2022 · 11 comments

Comments

@cgeoga
Copy link
Contributor

cgeoga commented Nov 14, 2022

I finally got around to cleaning up the uniform asymptotic expansion code in BesselK.jl and I ended up with a solution that is within a few ns of Bessels.besselk_large_orders while still allowing a user to pick the order of the expansion. I'm wondering if any part of this new work would be interesting as a PR here. There are a few parts to it:

  1. Auto-generating the U_k polynomials, which I do with this file. The result is that you automatically generate a file that looks like this, and you can see that there's a struct for each polynomial that trims out zeros and has a slightly specialized eval method. The generation uses Polynomials.jl, but once the file is generated you don't need it, and so it would not be a new dependency.

  2. The actual equivalent to Bessels.besselk_large_orders now looks like this (file ref):

@generated function _besselk_vz_asv(v, z, ::Val{N}) where{N}
  quote
    ez  = sqrt(1+z*z)+log(z/(1+sqrt(1+z*z)))
    pz  = inv(sqrt(1+z*z))
    out = sqrt(pi/(2*v))/sqrt(sqrt(1+z*z))
    (ser, sgn, _v) = (zero(z), one(z), one(v))
    evaled_polys = tuple($([:($(Symbol(:uk_, j, :_poly))(pz)) for j in 0:(N-1)]...))
    Base.Cartesian.@nexprs $N j -> begin
      ser += sgn*evaled_polys[j]/_v
      sgn *= -one(z)
      _v  *= v
    end
    exp(-v*ez)*out*ser
  end
end

_besselk_asv(v, z, maxorder) = _besselk_vz_asv(v, z/v, maxorder)

Which maybe is objectionable because it is generated function that automatically generates and uses function names. But it does work, and the @nexprs is a way to be completely sure the loop is unrolling and stuff. I have a feeling that with some smarter inner-loop tweaks this could at least match the current performance.

There are a few potential advantages to this:

  1. Maybe there is some performance on the table where one could more gradually lower the order of the approximation or something.
  2. The function that generates the U_k file uses Polynomials.jl, and so one could actually generate exact rational representations for all of the U_k polynomial coefficients and potentially use something like that to get more accurate coefficients for big floats or whtatever.

I haven't really explored either of those options. But if they're potentially interesting, I'm happy to prepare a PR that puts in the generated U_k polynomials and stuff. I'm sure other people here will have thoughts about the custom struct and methods and stuff, so I'm happy to discuss all aspects of it if there is any interest in the first place.

@heltonmc
Copy link
Member

heltonmc commented Nov 15, 2022

Thank you for writing this up Chris! Very interesting and quite enjoyed diving into this code. I'll have to admit my first reaction was how many users would want to use this feature or even know how to... as in would they know how many terms to consider in this expansion and how would this feature play well with the other algorithms for different branches? I'd assume that the _besselk_asv could have an order argument but probably not for besselk?

It might help to give some more context here as I don't really think I've documented this. With the SIMD PR the U-polynomials are starting to be a small fraction of the overall routine. I'll try to show how much with these benchmarks..

julia> @btime besselk(50.0, x) setup=(x=rand()+50.0)
  75.063 ns (0 allocations: 0 bytes)
1.9277371284156387e-13

julia> @btime Bessels.besselk_large_orders(50.0, x) setup=(x=rand()+50.0)
  57.477 ns (0 allocations: 0 bytes)
2.1893935965831894e-13

julia> @btime Bessels.Uk_poly_Kn(1.2,50.0, x,Float64) setup=(x=rand()*5.0)
  14.679 ns (0 allocations: 0 bytes)
1.0172531392362487

So there are a couple things going on there.... just getting to the besselk_large_orders method takes a ridiculous long time. 18 ns for not any work just moving through different branches is less than ideal.

The U-polynomial function though, only takes about 20% of the time of the total routine and about 25% of the time the besselk_large_orders method. Which is of course a significant fraction of the computation time but even if we can drastically reduce that time we are limited by the rest of the function.

This effect is apparent in the Float32 routine which uses only 5 terms in the U-polynomials instead of 10 terms in the Float64 shown above.

julia> @btime Bessels.Uk_poly_Kn(1.2,50.0, x,Float32) setup=(x=Float32(rand()*5.0))
  7.913 ns (0 allocations: 0 bytes)

So using 5 terms instead of 10 decreases the U-polynomial function by half but in reality that is about 7 ns. We can benchmark the top function for Float 32 to see the difference..

julia> @btime besselk(50.0f0, x) setup=(x=Float32(rand()+50.0))
  65.242 ns (0 allocations: 0 bytes)

Which is about 10 ns (or ~13%) faster than the Float64 implementation mainly due to the 7 ns faster U-polynomial evaluation and then doing some of the computations in lower precision. So definitely noticeable....

So I think I made the decision to just use a fixed amount of U-polynomials for either Float32 or Float64 primarily because it was easier but also because I didn't see that significant of a jump in computation time at the time using 5 terms instead of 10. But we could absolutely think about having an additional check in the Float64 implementation to use a variable amount of terms. I do that for the regular Bessel functions but that's because the expansion is slower to converge so there was a large difference going from 20 terms to 10 terms.

Steps forward could definitely be improving other parts of the routine so any performance increase in the U-polynomials would be reflected in the top of the routine. But I think the big thing for me is definitely having a higher precision implementation! I did try to tackle that by deriving the exact coefficients out. Ideally we could have a routine that would work in either Double64, Float128, and BigFloat precision. I think this would be a huge advantage!

@cgeoga
Copy link
Contributor Author

cgeoga commented Nov 15, 2022

Wow. It blows my mind that the branching is so much of the cost of that function call. How does one even go about debugging or trying to improve something like that? I certainly don't have the impression that all branches cost 18ns, so what is going on with these ones?

Given that, I totally see your point that just fixing the number of polynomials is probably the way to go. Considering that extra branches might actually be pretty expensive, it seems obviously preferable to spend those CPU cycles getting a more accurate answer instead of waiting so long for the faster one that you barely save time.

With regard to computing the polynomials in higher precision, the function to get them is pretty simple with NIST 10.41.9:

using Polynomials

function Uk_polynomials(T, max_order)
  P0   = Polynomial([one(T)])
  out  = [P0]
  mul_int  = Polynomial([one(T), zero(T), T(-5)])
  mul_frnt = Polynomial([zero(T), zero(T), one(T), zero(T), -one(T)])/T(2)
  for j in 1:max_order
    Pjm1 = out[end]
    Pjm1_int = integrate(mul_int*Pjm1)/T(8)
    Pjm1_drv = derivative(Pjm1)
    newP     = mul_frnt*Pjm1_drv + Pjm1_int - Pjm1_int(zero(T))/T(8)
    push!(out, newP)
  end
  out
end

and then with that function in scope you could do, say,

julia> Uk_polynomials(Rational{BigInt}, 3)
4-element Vector{Polynomial{Rational{BigInt}, :x}}:
 Polynomial(1//1)
 Polynomial(1//8*x - 5//24*x^3)
 Polynomial(9//128*x^2 - 77//192*x^4 + 385//1152*x^6)
 Polynomial(75//1024*x^3 - 4563//5120*x^5 + 17017//9216*x^7 - 85085//82944*x^9)

(I originally did the first 10 and that required going to BigInt). I would think this is the best way to get things in higher precision, because there is no rounding at all this way and then maybe it would be best to convert the numerator and denominators to BigFloat, divide to get another BigFloat, and then round to whatever lower precision type? Does this get at what you mean for having the routine for getting them in higher precision, or am I sort of missing the point?

I assume that bringing Polynomials.jl into the dependency tree is not very desirable, though. So this would probably be something to run offline or put into a build step or something.

@heltonmc
Copy link
Member

Yes, the branching aspect is definitely expensive. Some of the checks are more expensive than others as some require computing square roots or cube roots which adds to that check overhead.... but they are expensive and I do feel like some of these examples (along with examples from some of the other threads 😅) deserves a bigger write up or something. I am in crunch time for thesis defense stuff but hopefully could write something up about branches and prediction etc.

So any improvements to figure out more quickly what branch to take would be very welcome as that will improve the overall function significant. There is also the significant overhead of just computing the coefficients before the U-polynomials which is like 75% of the time in the whole besselk_large_orders... that is a large portion but we need to be careful as we almost need to provide correctly rounded inputs to get full precision in answer. Any speed improvements there while keeping the accuracy would be very beneficial! It is just with the SIMD instructions now in place the U-polynomials are very fast unless of course you are measuring the difference between say 10 and 25 terms. It probably isn't worth fine tuning the approximation from say 9 to 6 terms or from 6 to 4 terms. Now perhaps if there was a way to have some formula that could pick the number of terms to use without having any branches that might be beneficial. Maybe not exactly in this application because this asymptotic expansion converges very rapidly but maybe in some of the different expansions where we would need say 5 terms or 30 terms over a small range of inputs....

I would definitely prefer hard coding these into the source file than to use Polynomials.jl.. We have them up to n=21 stored as exact integers.

function _Uk_poly_Jn(p2)
u21 = evalpoly(p2, (1990919576929585163761247434580873723897677550573731281207275390625, -1094071724893410272201314846735800345180930752601602224440504638671875, 103359845691236916270005420886293061281368263471845448279855946394531250, -3993558675585831919973605001813276532326292885934464373884268722794718750, 83832389176247515253189196480455786065824463879054932596114704927571390625, -1101977288759423115916484463959767795703317649005495798822550381533967842875, 9865413224996821913093061266347613112205666995211786544733850490740903131000, -63509799534443193918054738326913581450607131662586232327271654588312820660616, 305080179205137176095342856930260393560550505786421683990077008259370104309410, -1122099959965657526344757894103766710826629020929459670948834078441128579791750, 3217185258543282976637373169047731813093578126603884759856059695249999306047500, -7276835259402589085458442196686990645669654847254510877738804991391058411412500, 13076651508858985557227319801010895818335803028865299751194038965212274849406250, -18719023753854332443081892087572754119280558462994056787647908433841920235468750, 21307327225910629020600045595173860611314592374884901403059587980149276271875000, -19156728115567236234371593788102013666866274144599851347429312225076676478125000, 13430107219321888006291381503763318985372222835071804983658005207838642173828125, -7186150593017474773099947110903785965879266917528290917207712073663895771484375, 2834031566467842832851805860262261718160136985649155528263587796394390332031250, -776308737033643856044315139790308583914612827783322139063211433790569824218750, 131897976398031751060811874321878385924211075364673046295410087596954345703125, -10468093364923154846096180501736379835254847251164527483762705364837646484375)) / 5456052820246562178600318839637653652351064473600000000
u20 = evalpoly(p2, (24030618487110150352755402740028995969072485932314476318359375, -11984379509393886990049682627416290126645799829432886859195312500, 1028816773596376675865176501621547688509187479681533744365175781250, -36136687211104276266628386141898079997731413843884113675698994212500, 689368394712060288513879526213143279387995331221181709158682715671875, -8226054591925395046897310265711439061232478740113589342810359827971600, 66729727980314206345594148553023187689855688854650386958448499394886808, -388235990422199036481157075090292917209702371811112714609151030385239376, 1679621555358289731717827244171539003686077073065087747585589973854567950, -5539024239213671573375139503069183935283937579486175606155069574939719000, 14158993985687610069291256086138044030538819617545349322200892501730615500, -28351273454978996507857547213496220444080209057761228571819950144231575000, 44700502703654649774362294361156754154712937287563245669685024068473468750, -55504285315413734114196925709059066734450069228526494661323483726671250000, 53996017865021964397812742861916826936440170527545458081631955240159375000, -40677946447845849750014263157052100427921007881029648608310681741346250000, 23250401373415767366970579801426233116241475047289980901600427561701171875, -9744288621182737996606001891415854305391392962559389056448157541132812500, 2823768330714179467082676027577350697536971031741905897356560592675781250, -505537818270094147077012813306995849751437826286925078626556809570312500, 42128151522507845589751067775582987479286485523910423218879734130859375)) / 658943577324464031231922565173629668158341120000000
u19 = evalpoly(p2, (3761719809509744584195141470215239968860161850335693359375, -1694136104847790923543013061207680485991618397125797873046875, 131518243789528012257323287684549590712219590887856743595703125, -4179129511260217209133623104486559148268490002722848244991240625, 72088266652871136165181276891337618088740053757385007292240377500, -776773977214820033621339638022401758862209813648991378355261599500, 5677394779818071523358076045292335690018332546439714641415150037636, -29667822591847140970922062603154078948366431649274897672067206014580, 114800233558787974267088388638654159779991991716667850063043161177890, -336788945225975997668966486515468748933754942064781256509404101275850, 760599837839891632010677514872412397427959022966733699428711208643750, -1333776105497652608917805362422659621440699261908512326717363285968750, 1821026790520428617034899967974569000414933301802005578083458082187500, -1929493390007550512825649545883182667978851358936168551128507440937500, 1570570304135828069768211937927131121377451398878663909061910457812500, -963355744456233323886326422779275308259891167457840027230510476562500, 430744376085805585076333464506126217301082462279781595185335615234375, -132496442776420617057548516091451843431549350710841121872822607421875, 25067118709566753991500713640672585065184296412786618544560205078125, -2198870062242697718552694179006367110981078632700580574084228515625)) / 980570799589976236952265721984567958568960000000
u18 = evalpoly(p2, (17402648254366970318896442155853313710334325579833984375, -7038996381042630440706320955207218277836259136214144531250, 491515845627995608630308828025937223574193727753376100859375, -14053430579296992819276840860026711644143962037898462198750000, 217977088167107844476671248240850549184342689357081793983627500, -2108435312318793739666233505677933977608124223061200130648299000, 13796125542556918366536045481173150324297945459901440053845493532, -64294535084989436534160300189785402934009324116808204634140090960, 220746706223415685134843293656748091845034254294736425040548604450, -570806748959722117700578156666367879819767325936645047354739587500, 1126546324172034608288726239622146620694308510621969350884171781250, -1707362295067866555400074337741150183369276187369879432556935250000, 1985632470023144029098600377248581775333320983961309350778555937500, -1757491631661175749558620292180560395062012011157079033996696875000, 1163006497421870129179321717633624986894646606476886508372367187500, -557267390747417776045984733477369082240579802825315540376718750000, 182639522233726398821209338567430722027981720466294556325849609375, -36632957438678377189819992330283878474735514770655121685644531250, 3391940503581331221279628919470729488401436552838437193115234375)) / 40857116649582343206344405082690331607040000000
u17 = evalpoly(p2, (86098445290621992919710288959076381992996044921875, -31086866964344776434901340294726759585844523939453125, 1940900724642360326780559417863159500982538936987375000, -49633551391469586967911551642521760042791355027620125000, 687906693873155071735691272079937403263685738018470814500, -5933198793919698447982330692598158428032189878414161012124, 34502031994800767344892422746634773296018107824906632419848, -142225164683992765990030747899409462088308925365682017857240, 429212352515026773475471295511415883382060846328770993005250, -967487903877461958325237917115228310512630590714451916640750, 1646551275524985252738847924398650627672209050734790851265000, -2121288587929318041669932345514745412624685501425210767875000, 2056954829464921151146089484749565775448230537475447606562500, -1477558757838664620053280135342775196503230238657956773437500, 762596613990574496129519527341431797422236075424587596875000, -267445311988545069854943494211913506927580321987738265625000, 57077468859498937494621367436714109067308943816271513671875, -5595830280343033087707977199677853830128327825124658203125)) / 1719575616564913434610454759372488704000000
u16 = evalpoly(p2, (23579874823845695868333654121829112397998046875, -7548208883093506123919073318645848440986281250000, 418571121445853600940987397062487467273966779625000, -9508794888602532435079114127493385198079570710470000, 116929016866180520574682795493722856433879981336112500, -892416493340003758691506680694321990355675722520838096, 4572999438129700505777642155788222350797540050279284760, -16514434054042671665093745143345645271999901089177788400, 43316362356947864694367448897560019067412728259897505250, -83973458255376217113431722753277764119484494850353270000, 121193409013842312774251134029225218007034205572281375000, -129900460304939941764670297010274798325693206430601250000, 102023293586271009821923530738271106285703628596295312500, -57054562339825052419659481088225652020508564894843750000, 21516499723877657209702638528530839860502345904015625000, -4906117886528008036369575428500807148681084440781250000, 511053946513334170455164107135500744654279629248046875)) / 3770999159133582093443979735465984000000
u15 = evalpoly(p2, (8178936810213560828419581728001773291015625, -2303431987527333128955769182299845911305390625, 112597271053778779753048514469995937998172890625, -2254933495791765108580529087615802250458013685625, 24403480234538299231733883413666768614198435948125, -163359140754958502896104062604202934925448173291477, 730367145705123976114617970888707594104468177381925, -2284251621937242886581917667066122422330060024456125, 5136561256208409671660362298619778869859994724706875, -8420533422834140468835467666391400380550043688871875, 10085018700249896522602267572484630409640764997271875, -8735135969643867540297524795790262235822823374296875, 5329871927856528282113994744458999865006055974609375, -2173722139119126857509156976742601742357422740234375, 532039967451707060045861997017872377315039521484375, -59115551939078562227317999668652486368337724609375)) / 9820310310243703368343697227776000000
u14 = evalpoly(p2, (5448320367052402487647812713291015625, -1338184074771428116079233795614103631250, 57170953417612444837142230812990944671875, -1000503839668383458999731491914516564625300, 9440449669103391509091075981237243128469201, -54857705817658080981995319669299096598482382, 211477117385619365164298957821904115470535555, -564850830044980230366140582063618983657685400, 1070439683260179398514645952824098811025619475, -1451823699927947602004297385351260623500372750, 1401302601668131482630653233972052729323190625, -940627071986145750405920450097257805227812500, 417630985812040040477569028872405152769921875, -110320224449895843354117801955418504167031250, 13133360053559028970728309756597440972265625)) / 45846453362482275295722209280000
u13 = evalpoly(p2, (17438611142828905996129258798828125, -3698121486504259988094897605296209375, 136735019134677724428035696765082813750, -2069933923586966756183324291232117362250, 16843538631795795357786827345307534156063, -83924867223075156862785921508524155665245, 274983827478138958623041508409195988431140, -616410216242554698436702353237166008656700, 962926533925253374359704288384340809260875, -1049095945162229046324321461816274931715625, 782463969315283937856703223178540650343750, -381190503845282445953419057314665534156250, 109361210755577700442544717509565392265625, -14020668045586884672121117629431460546875)) / 955134445051714068660879360000
u12 = evalpoly(p2, (120907703923613748239829527671875, -21882222767154197351962677311437500, 692277766674325563109617997687563750, -8958590476947726766450604043798559500, 62055079517573388459132793029571876461, -261201165596865827608687437905740780920, 714528665351965363868467348116538170900, -1314368459332124683504418467809129387000, 1642838631056253395899341314188134178125, -1378260730939829908037976894025628887500, 743739612850105971846901081692858843750, -233469939346545526651936774615688437500, 32426380464797989812768996474401171875)) / 39797268543821419527536640000
u11 = evalpoly(p2, (15237265774872558064250728125, -2321657500166464779660536015625, 62011003282542082472252466220875, -676389476843440422173605288596087, 3926191452593448964331218647028194, -13704902022868787041097596217578170, 30589806122850866110941936529264950, -44801790321820682384740638703320750, 42936745153513007436411401865860625, -25963913760458280822131901737878125, 8997860461116953237934638500359375, -1363312191078326248171914924296875)) / 27636992044320430227456000
u10 = evalpoly(p2, (9745329584487361980740625, -1230031256571145165088463750, 27299183373230345667273718125, -246750339886026017414509498824, 1177120360439828012193658602930, -3327704366990695147540934069220, 5876803711285273203043452095250, -6564241639632418015173104205000, 4513386761946134740461797128125, -1745632061522350031610173343750, 290938676920391671935028890625)) / 88580102706155225088000
u9 = evalpoly(p2, (6427469716717690265625, -659033454841709672064375, 11921080954211358275362500, -87432034049652400520788332, 334380732677827878090447630, -741743213039573443221773250, 992115946599792610768672500, -790370708270219620781737500, 345821892003106984030190625, -64041091111686478524109375)) / 263631258054033408000
u8 = evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125)) / 22191183337881600
u7 = evalpoly(p2, (199689155040375, -12493049053044375, 138799253740521843, -613221795981706275, 1347119637570231525, -1570320948552481125, 931766432052080625, -221849150488590625)) / 115579079884800
u6 = evalpoly(p2, (2757049477875, -127577298354750, 1050760774457901, -3369032068261860,5104696716244125, -3685299006138750, 1023694168371875)) / 4815794995200
u5 = evalpoly(p2, (1519035525, -49286948607, 284499769554, -614135872350, 566098157625, -188699385875)) / 6688604160
u4 = evalpoly(p2, (4465125, -94121676, 349922430, -446185740, 185910725)) / 39813120
u3 = evalpoly(p2, (30375, -369603, 765765, -425425)) / 414720
u2 = evalpoly(p2, (81, -462, 385)) / 1152
u1 = evalpoly(p2, (3, -5)) / 24
u0 = one(p2)
. I would prefer though that we avoid using BigFloats and BigInt because they are so slow and allocate... It would be nice to have dedicated routines for Double64 and Float128 that did not allocate and were fairly fast. This should be possible but we need a way to store these coefficients in a way that could be used efficiently without allocating... this would honestly be a really great addition here to add methods for besselk for Double64 and Float128....

@cgeoga
Copy link
Contributor Author

cgeoga commented Nov 15, 2022

I hear that. I also didn't notice that the integer form was already in there. If you didn't use an automated polynomial tool, I shudder to think how much grinding went into that. I admire your dedication. Also, I forgot you're graduating now! I'm happy to leave this on hold and come back when you have more time and bandwidth.

In terms of concrete things that I think I understand how to do, I could certainly provide analogs of _Uk_poly_Jn and the besselk equivalents as hard-coded functions that match the style and design you have here for types like Double64 and Float128. The one remaining question, though, is how to write those methods without pulling those packages in as deps? DoubleFloats has a lot of deps and would really pollute the tree. QuadMath doesn't, so maybe that wouldn't be crazy. But maybe there is some crazy way to store the BigInt version locally and then have a @generated function or something crazy that does those conversions and inlines things at compile time (so no Big* things get used to actually evaluate the function). After my games with that at the start of the thread it doesn't seem impossible to me. Maybe we just add methods like bessel[*](::Val{T}, v, x), use the compile-time function to generate it when things are in scope, and go from there?

Or is there some easy way to hard-code those coefficients without bringing the struct Double64 or Float128 into scope?

Using just Int64, by the way, my function above to generate them has integer overflow at Uk_polynomials(Rational{Int64}, 9). But maybe there's a smarter way to get around that by managing gcds better or something.

EDIT: Actually, I just remembered that Requires.jl exists. So that would probably be easiest, if you're willing to bring that into the dependencies.

@heltonmc
Copy link
Member

heltonmc commented Nov 15, 2022

😄 the amount of time I've spent deriving coefficients for all these expansions definitely surpasses any amount of time I've spent on this actual library haha! I've gotten a little smarter about it now than a year ago but it is painful. And no worries about the time I can definitely think and play around with it right now I just won't be able to flesh it out completely.

The one remaining question, though, is how to write those methods without pulling those packages in as deps?

Yes - this was a question I grappled with earlier. Oscar might have a better suggestion for this... However, an advantage of having the exact integers is we really don't have to do too much effort for this (there is a caveat I'll say later). If we have a test function of the first 10 coefficients using the exact integer representation.

function test(x::T, p2) where T
    u0 = one(T)
    u1 = evalpoly(p2, (3, -5)) / 24
    u2 = evalpoly(p2, (81, -462, 385)) / 1152
    u3 = evalpoly(p2, (30375, -369603, 765765, -425425)) / 414720
    u4 = evalpoly(p2, (4465125, -94121676, 349922430, -446185740, 185910725)) / 39813120
    u5 = evalpoly(p2, (1519035525, -49286948607, 284499769554, -614135872350, 566098157625, -188699385875)) / 6688604160
    u6 = evalpoly(p2, (2757049477875, -127577298354750, 1050760774457901, -3369032068261860,5104696716244125, -3685299006138750, 1023694168371875)) / 4815794995200
    u7 = evalpoly(p2, (199689155040375, -12493049053044375, 138799253740521843, -613221795981706275, 1347119637570231525, -1570320948552481125, 931766432052080625,  -221849150488590625)) / 115579079884800
    u8 = evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125)) / 22191183337881600
    u9 = evalpoly(p2, (6427469716717690265625, -659033454841709672064375, 11921080954211358275362500, -87432034049652400520788332, 334380732677827878090447630, -741743213039573443221773250, 992115946599792610768672500, -790370708270219620781737500, 345821892003106984030190625, -64041091111686478524109375)) / 263631258054033408000
    u10 = evalpoly(p2, (9745329584487361980740625, -1230031256571145165088463750, 27299183373230345667273718125, -246750339886026017414509498824, 1177120360439828012193658602930, -3327704366990695147540934069220, 5876803711285273203043452095250, -6564241639632418015173104205000, 4513386761946134740461797128125, -1745632061522350031610173343750, 290938676920391671935028890625)) / 88580102706155225088000

    return evalpoly(x, (u0, u1, u2, u3, u4, u5, u6, u7, u8, u9, u10))
end

We can neatly use this function in any precision we want without loading dependencies into the package. The user simply has to load them...

julia> using Quadmath

julia> using DoubleFloats

julia> test(big"1.2", big"1.4")
656724.3266940676809279037485970819304152637485970819304152637485970819304153963

julia> test(Double64(1.2), Double64(1.4))
656724.3266942296

julia> @benchmark test(Double64(1.2), Double64(x)) setup=(x=rand())
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.864 μs   2.908 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.867 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.870 μs ± 29.317 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█                                                          
  ██▂▂▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▂▁▁▁▁▁▂▁▁▂▁▁▁▁▁▂▂▁▂▁▂▂▁▂▁▂▂▂▁▂▁▂▂ ▂
  1.86 μs        Histogram: frequency by time        2.04 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test(BigFloat(1.2), BigFloat(x)) setup=(x=rand())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  11.761 μs    5.188 ms  ┊ GC (min  max):  0.00%  68.27%
 Time  (median):     12.728 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   16.669 μs ± 111.479 μs  ┊ GC (mean ± σ):  10.28% ±  1.53%

  ▃▇█▇▅▃▁       ▂▃▃▂▁▁   ▂▁▁▁▂▁▁▁                              ▂
  ███████▇▅▆███████████▇███████████▇▆▅▆▇▆▆▅▆▅▅▄▄▅▃▄▄▄▅▂▃▂▄▄▃▄▃ █
  11.8 μs       Histogram: log(frequency) by time      30.6 μs <

 Memory estimate: 22.77 KiB, allocs estimate: 513.

julia> @benchmark test(Float128(1.2), Float128(x)) setup=(x=rand())
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.977 μs   5.537 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     3.068 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.076 μs ± 65.653 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▁▁▁▄▅▇█▇▇▆▆▆▆▄▃▃▂▁▁ ▁                        
  ▁▁▁▁▁▁▂▂▂▃▄▅▅▆▇███████████████████████▇▇▇▇▆▅▅▅▄▃▃▂▂▂▂▂▂▁▁▁ ▄
  2.98 μs        Histogram: frequency by time        3.18 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Clearly we can see the performance advantages of using Float128 and Double64 over BigFloat. Yes, BigFloat is in higher precision but our cutoffs need to be fixed for a specific precision Double64, Float128, Float256.. etc. I don't really envision this library as an arbitrary precision routine because the algorithms are just different.

And so this just works out of the box with any precision! It is definitely an advantage of Julia just being composable like that. It's also why I spent some time just getting the exact coefficients down.

julia> typeof(test(Float32(1.2), Float32(1.4)))
Float32

julia> typeof(test(Float64(1.2), Float64(1.4)))
Float64

julia> typeof(test(Double64(1.2), Double64(1.4)))
Double64 (alias for Double64)

julia> typeof(test(Float128(1.2), Float128(1.4)))
Float128

julia> typeof(test(BigFloat(1.2), BigFloat(1.4)))
BigFloat

So there is a caveat to this.... all those integers can be represented by Int128 but for higher terms they can't be and need to be represented by BigInt.... the trouble with this is promotion rules. If we look at the promotion rules for Int128

julia> promote_type(Float32, Int128)
Float32

julia> promote_type(Float64, Int128)
Float64

julia> promote_type(Double64, Int128)
Double64 (alias for Double64)

julia> promote_type(Float128, Int128)
Float128

This will preserve the input types very well... but that is not the case for BigInt...

julia> promote_type(Float128, BigInt)
BigFloat

Which will promote them all to BigFloat which is bad... just to exemplify this is I've changed one coefficient to be a BigInt instead of Int128

function test2(x::T, p2) where T
    u0 = one(T)
    u10 = evalpoly(p2, (9745329584487361980740625, -1230031256571145165088463750, 27299183373230345667273718125, -246750339886026017414509498824, 1177120360439828012193658602930, -3327704366990695147540934069220, 5876803711285273203043452095250, -6564241639632418015173104205000, 4513386761946134740461797128125, -1745632061522350031610173343750, 29093867692039167193502889062512334455677)) / 88580102706155225088000

    return evalpoly(x, (u0, u10))
end

Now we can see that no matter what we put into the function we will get BigFloat

julia> typeof(test2(Float128(1.2), Float128(1.4)))
BigFloat

julia> typeof(test2(Float64(1.2), Float64(1.4)))
BigFloat

Which of course is bad.... Of course the simplest thing to do is to just convert the BigInt to whatever your input type is like this...

function test2(x::T, p2) where T
    u0 = one(T)
    u10 = evalpoly(p2, (9745329584487361980740625, -1230031256571145165088463750, 27299183373230345667273718125, -246750339886026017414509498824, 1177120360439828012193658602930, -3327704366990695147540934069220, 5876803711285273203043452095250, -6564241639632418015173104205000, 4513386761946134740461797128125, -1745632061522350031610173343750, T(29093867692039167193502889062512334455677))) / 88580102706155225088000

    return evalpoly(x, (u0, u10))
end

So I've just convert the BigInt there..... This solves the issue of preserving the right type.

julia> typeof(test2(Float64(1.2), Float64(1.4)))
Float64

julia> typeof(test2(Float128(1.2), Float128(1.4)))
Float128

But it doesn't make this function completely non-allocating.

julia> @benchmark test2(Float128(1.2), Float128(x)) setup=(x=rand())
BenchmarkTools.Trial: 10000 samples with 183 evaluations.
 Range (min  max):  572.754 ns   10.086 μs  ┊ GC (min  max): 0.00%  93.20%
 Time  (median):     595.407 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   605.754 ns ± 235.024 ns  ┊ GC (mean ± σ):  1.02% ±  2.47%

    ▁▃▄▄▃▃▂▃▄▃▆██▇▅▂   ▁▁        ▁▁  ▁▁               ▁▁        ▂
  ▄█████████████████▇▆█████▇▇▇█▇██████████▇▅▆▆▇▆▆█▇█████████▇▇▇ █
  573 ns        Histogram: log(frequency) by time        669 ns <

 Memory estimate: 192 bytes, allocs estimate: 4.

Which is also bad 😄 So if we could fix that issue I think this could work. Perhaps having some bigger concrete Integer type like Int256? There are a couple of options for this I think. But ideally a method that works with these integer types but also if you have BigFloat maybe from some polynomial minimization that would work as well. Perhaps Oscar has some tips too.

Edit: It looks like all integers would fit into an Int256 size up until at least n=21. It would be nice if this was just in Base though.....

@heltonmc
Copy link
Member

heltonmc commented Nov 18, 2022

I looked a bit more into this and all the methods we use for besselk lend itself well to higher precision arguments. However, we will need gamma in higher precision for the whole function. I can hopefully work on porting a higher precision gamma function so we can get this done. I think there is a lot of benefit to having such a function in higher precision (and for complex values). That is really my focus for this package right now and over the next few months is to get complex vlaues working and higher precision....

I think the main hurdle is how do we want to store the higher precision coefficients. Kind of a personal gripe but I do wish there was just native support for Float128 or Int256 in base julia. But we will need a way to store high precision coefficients that can be used for DoubleFloats.jl, QuadMath.jl and MultiFloats.jl and whatever higher precision package in the julia ecosystem. We can't just add all of these dependencies so it would be nice if the coefficients could be converted or parsed to these precisions at compile time depending on the input argument. My initial try at this actually resulted in allocations unless they were stored as constants in the exact precision....

@cgeoga
Copy link
Contributor Author

cgeoga commented Nov 18, 2022

Sorry to have lost the thread here. Do you have an opinion about using Requires.jl? It seems like with that very light dep it would be possible to just store all the coefficients natively in their own types in a file that sits behind a @requires barrier, unless I'm mistaken. That seems pretty easy.

@heltonmc
Copy link
Member

Honestly, I just have a purely subjective inclination to not take on any dependency unless for a very good reason 🤷‍♂️ very unjulian but could be convinced. Again, just an unjustified preference.

So are you imagining a file that has all the coefficients stored as BigFloats then converted to the necessary precision depending on what argument the function is called with? Surely there is a way to do that.....

Or do you mean have all the coefficients listed in BigFloats, Float128, Double64? That way seems a little less attractive just due to the amount of duplication......

@heltonmc
Copy link
Member

heltonmc commented Nov 18, 2022

Like can we somehow get test2 to just work at compile time?

julia> using DoubleFloats

julia> const P = (Double64(big"1.23"), Double64(big"1.8"), Double64(big"2.3"))
(1.23, 1.8, 2.3)

julia> function test(x)
           return evalpoly(x, P)
       end
test (generic function with 1 method)

julia> function test2(x)
           PP = (Double64(big"1.23"), Double64(big"1.8"), Double64(big"2.3"))
           return evalpoly(x, PP)
       end
test2 (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime test(Double64(rand()))
  34.045 ns (0 allocations: 0 bytes)
2.797159851038373

julia> @btime test2(Double64(rand()))
  322.688 ns (6 allocations: 312 bytes)
3.375839473234667

EDIT: Or even better get this to work for any type...

function test3(x::T) where T
    PP = (T(big"1.23"), T(big"1.8"), T(big"2.3"))
    return evalpoly(x, PP)
end

# Maybe better with parse?
function test3(x::T) where T
    PP = (parse(T, "1.23"), parse(T, "1.8"), parse(T, "2.3"))
    return evalpoly(x, PP)
end

@cgeoga
Copy link
Contributor Author

cgeoga commented Nov 18, 2022

Hm. Well, for your test functions, they should presumably at least be @generated so that we can be absolutely sure that the conversion happens at compile time (right?). Maybe it is as simple as this:

@generated test4(v::V, x::X) where{V,X}
  T = promote_type(V, X)
  coefs = generate_U_coefs(T) # fixed number
  quote 
    u20 = evalpoly(...)
    u19 = evalpoly(...)
    [...]
  end
end

But another thought that just occurs to me is that maybe it wouldn't be very hard to write our own little polynomial type that does the few operations we need to use my earlier function to generate the coefs. That is again completely doable at compile time and would cut down a lot on the amount of hard-inline giant integer coefficient lists. SpecialFunctions actually seems to do this precise thing in their expint implementation.

@heltonmc
Copy link
Member

Yes - I think the generated functions could be a good solution! Ideally, we just need to generate the constants in the right type that can work for a bunch of input types. I think your method for the polynomials is nice, but I do also want to have a method that is general for all the other methods. So like for any minimax polynomials we could apply a similar method instead of regenerating all the polynomials. I'm just hoping we could find a good solution that could work for the other expansions besides the U-polynomials and also for the minimax polynomials as well! It could be that the generated functions works for all of these cases nicely!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants