-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sin/cos/atan2/sqrt implementation analysis #191
Comments
Thanks! Your LUT depth test should give at least the results of the migen cossin (which is bit-accurately modeled and analyzed in the notebook) because that uses very coarse pre-scaled derivatives to save two multipliers. To get 1.4 LSB max and 0.6 LSB rms error we only need a 10 deep LUT (for one octant), 4 kB. I think the relevant newlib implementation of sin/cos/sincos is not what you link to. You should look at this one (.h). Comparing the current (double) libm impl with a (float) newlib is probably not meaningful. That's also the reason one has 7 terms and the other 5. As mentioned, the phase will be fixed point anyway at its source. Converting it to f32 and back (for LUT lookup) is likely a bad idea. We don't need the wrapping of a float implementation. I don't know off-hand how well an expansion with fixed point works. Let's do the following (comments welcome).
Other:
|
This implementation uses double-precision intermediate values. If I'm going to adapt this to an all single-precision version, does it also make sense to do the same for the current rust libm version and compare them?
I'm still not entirely clear on this. For instance, if we get a reference period of 100 (in units of the internal counter period) and the LUT table has size 1024, then to index the LUT we first need to scale input timestamps so that the reference period is some power of 2. For instance, the scaling factor on ADC phases in this example could be 128/100. This requires conversion to f32, rounding and conversion back to integer. Or, am I missing something here?
I believe this also requires intermediate floating-point math (for the pi/4 scaling factor). Although I guess I could approximate this with integer math (i.e., 3/4).
Worth noting that the current libm implementation already uses f32 exclusively. It'll be worth evaluating the performance, of course, but there may not be much work to do here. Still, I'll see what I can find about i32 implementations.
Ok. |
That item is after the LUT checkpoint (not now). My assumption is that those ARM people tuned and optimized the algo to a representative and relevant mix of embedded ARM cores, including ours. We do have a double precision FPU (but AFAICT aren't really well equipped to use it due to LLVM). Iff we go past the checkpoint, I'd just stick to that impl more or less verbatim first (assuming the concept is optimal). And then if it's noticably bad, maybe try again with single precision.
Right. I had updated #170 describing the math a couple days ago. We need to agree on the math in that issue first. Yes, there is a division in there. But I'm not sure whether it's better to do that in (a) i32 with some numerator/denominator scaling or clever division algorithm, (b) i64, or (c) f32.
In the worst case the linear interpolation by
I wonder whether it is actually important for us to be accurate here. It may be fine to accept quite some deviation from |
Ok.
Ok.
Yes, true. So, getting the interpolation for a sin value could be ((cos * LSBs * 804) >> (depth + #(LSBs) + 10) (and adjusted for half-up rounding). 804, 10 is just one set of values we could use to approx pi/4. LSBs is the LSB values giving the phase interval between successive LUT values and #(LSBs) is the number of LSB bits.
Ok. We can investigate this after the first checkpoint. |
Use the full i32 range for the intermediate calculations. The i32 phase will be split into: 3 octant bits, N LUT bits, M interpolation bits. Having Z = 20 total phase bits going into the computation is sufficient for the accuracy/quantization we need. So it'll be something like N = 10, 7 <= M that need to go into the interpolation. Then use a K bit fixed point representation for the pi/4 constant to max out the i32 multiplication range before bias addition and right shifting. It might be most accurate to distribute the 32 bit evenly among the three factors 17 (or 18) bit LUT, M, and K, like 11, 11, 10) or doing two separate i32 multiplications and two biasing and two shifts. Watch out where you actually need the sign and where it is know throughout the octant. I think only the M delta-phi bits are signed. The rest is fixed sign. The misoc core should have all that figured out. Note that the LUT data is 17/18 bit effectively with 2x16 bit storage because of the known sign. It looks like with that we should be able to exceed |
With all that we may or may not be limited by the quadratic term. It's not worth optimizing beyond that. p.s. Max error is |
Closing since #197 has been merged. atan2 analysis will be opened in a new issue. |
cos/sin
Conclusions presented at the end of this post.
The main options for a cos/sin implementation are:
fixed-point LUT
A fixed-point LUT implementation stores a 16-bit fixed-point sine value and 16-bit fixed-point cosine value (as a combined 32-bit integer) for some chosen number of phase values, where the phase values are equally spaced between 0 and pi/4. The number of phase values stored is a power of 2.
algorithm description
There is a Jupyter notebook for a migen implementation that is mostly the same as the algorithm described below.
There are a few different approaches to a fixed-point LUT. We could:
Option (1) is possibly the fastest because it does not require mapping inputs, but requires 8x the memory of (2) and (3). This greatly increased memory usage may actually make it slower than the other methods. Options (2) and (3) use the same size LUT, but (3) is preferred in cases where sin and cos are computed simultaneously. The reason for this is that the same phase value can be used for the sin and cos. This allows the phase mapping logic to be performed only once.
The mapping of all phase values to the 0 to pi/4 range is shown in the table below.
The table also displays the 3 MSB bit values corresponding to each range when the LUT size is a power of 2. We could use these to select the sin/cos expressions to compute. Alternatively, the Migen implementation uses some clever bit manipulations (described below) to arrive at the phase. Our implementation should probably use the same procedure, but it's worth actually benchmarking this to see which is faster. The process used by Migen is:
The final step is to perform interpolation. Two options for this are linear interpolation between the bounding values in the LUT or a linear interpolation based on the derivative value at each point. The Migen implementation devotes some of the LUT bits to the derivative. However, since the linear interpolation method using bounding values does not require any additional storage and is cheap to compute, I believe it makes more sense in this application.
If the integer phase range is 8x that of the LUT (remember, the LUT only stores sin/cos for phases between 0 and pi/4), we can use it directly for indexing the LUT (mask off the 3 MSBs). If the integer phase range is less than 8x the LUT index range but a power of 2 we can left shift the input (after masking off the 3 MSBs) and use the resulting phase to directly index the LUT.
If the range is greater than 8x the LUT index range but a power of 2 we can store the LSBs corresponding to the difference in size. We then right-shift the input phase by this amount and use that value to index the LUT. In the case of linear interpolation, we also retrieve the next higher index value from the LUT. These correspond to the lower and upper bounds of a linear interpolation. We take the difference of the bounds, multiply by the value in the LSBs and right shift by the number of LSBs. This is added to the lower bound. By using fixed-point values, we can do the interpolation using 2 memory accesses, an integer subtraction, an integer multiplication, an integer bit shift and an integer addition. Everything except for the memory accesses must be done twice: once each for the sin and cos.
When the integer phase is not a power of 2 we must map it to a power of two and then perform the steps above. The basic way to do this is to convert the phase to an
f32
, multiply it by a scaling factor, round and then convert back to an integer. This could be problematic for very slow reference frequencies. In that case, we could right shift down to a value that doesn't incur significant floating point errors and then perform the same process (i.e., scale, round, convert to int). This conversion to float and back again is not ideal. However, I don't see a way around it if it's possible to have arbitrary reference periods.accuracy
We can the assess the accuracy of an implementation by testing whether it introduces errors in comparison to an "ideal" implementation in the following signal processing flow:
A value is sampled by the 16-bit ADC. That value is then multiplied by the sine function implementation using an arbitrary phase value between 0 and pi/4. The product is then quantized by the 16-bit DAC.
We approximate the ideal implementation using a double-precision floating-point implementation. I've currently used simple truncation for simplicity. But, if we'd like avoid biases introduced by intermediate trunctation we could use a non-biasing rounding method such as convergent rounding. Additionally, I've ignored the accuracy of the cos implementation, assuming that the sin analysis also reflects the cos accuracy.
Output:
A reasonable choice of LUT depth is probably in the range of 10-13 bits (4-32KiB). Although a standard deviation less than 1 (which would mean most results do not suffer any accuracy loss) would be great, this would require a 256KiB LUT, which is probably excessive.
alternative implementations
CMSIS uses a floating-point output LUT. It stores values for the full range 0 to 2pi using 512 values (depth of 9) and then performs linear interpolation on phases in-between. Nothing about this implementation is particularly compelling. Based on the analysis performed earlier for the Migen implementation, this implementation should have poor accuracy and high memory usage.
We could perform a variant of this implementation with an integer input and fixed-point output. Additionally, we could omit the logic used to confine the phase to be within the 2pi range (actually the integer equivalent). This resulting implementation would be similar to the Migen implementation except that it would omit the phase mapping stage and require 8x the storage for the same accuracy. Since the phase mapping only requires a few operations, this is probably a bad tradeoff.
floating-point polynomial approximation
libm/musl
The libm implementation first stores the sign bit, masks it off the phase and then performs a branch of the now positive phase value. x represents the phase. The ranges are:
It then performs an argument reduction if needed. Since this part is not relevant to us it won't be discussed in detail. However, it's worth noting that the argument reduction isn't attempted until after we check to see if we're in a phase range less than 2pi. Therefore, it shouldn't incur a performance penalty when not needed. It is also worth noting that this argument reduction uses 64-bit floating-point intermediate values. An upstream change to libm sincosf would also need to remedy this. This would be development effort not directly applicable to stabilizer.
If the phase is in range (1) we return sin and cos values directly. When in range (2) the polynomial approximation is performed on the phase directly. In all ranges, the each polynomial approximation (sin and cos) has to work for phase values between -pi/4 and pi/4. The rest of the phase mapping is thematically similar to the Migen implementation. The first obvious difference is that branches are used to select the appropriate angle and trig function to compute rather than the clever bit manipulations Migen can use because it uses power of 2 fixed-point. Additionally, libm must deal with negative arguments. The fact that libm must handle negative phase values is another downside to this implementation from our perspective. Moreover, the implications of this is worse than just unneeded phase value checks since the polynomial approximation is designed to be accurate between -pi/4 and pi/4, whereas ours only needs to be accurate between 0 and pi/4. We should be able to get improved accuracy/performance by centering our expansion at pi/8 instead of 0.
The polynomial approximation is just a Taylor series using the first 5 terms. In other words, sin is computed as
x - x**3/3! + x**5/5! - x**7/7! + x**9/9!
and cos is computed as
1 - x**2/2! + x**4/4! - x**6/6! + x**8/8!
The powers are computed and the coefficients are stored as constants. An upstream change to libm should only need to change these coefficients to be single-precision floating point and change the function signature to take a single-precision phase value. To understand why libm uses 5 terms, it's helpful to plot the error and the 32-bit floating point precision between 0 and pi/4. This is shown in the plot below.
For 5 terms, the error is almost always below the floating-point precision of the corresponding value. Moreover, adding extra terms does not improve the accuracy. Therefore, 5 terms provides the cheapest computation without incurring loss of accuracy.
newlib
Newlib has a sincos implementation but it is just separate calls to sin and cos. This means that all argument reduction code is duplicated. The sin and cos implementations are reasonably similar to those of libm. The general idea is to restrict the argument to the range -pi/4 to pi/4 and then perform a Taylor expansion. Newlib's argument reduction differs from that of libm. Whereas libm performs a branch directly to an appropriate sin or cos Taylor expansion for any absolute phase less than 9pi/4 before performing a more general argument reduction, newlib performs this reduction for anything outside of -pi/4 to pi/4. For our purposes, libm's strategy is probably superior to that of newlib. To assess the performance implications of this decision for more general purposes would require benchmarking.
Newlib uses the first 7 terms of the Taylor series as an approximation. This choice doesn't make sense to me, since (based on the analysis above) any terms above 5 do not provide additional accuracy.
My initial impression is that the libm implementation is better than that of newlib, though benchmarks would be needed to verify this if we do want to use a floating-point implementation.
custom solution
I expect that one of the major downsides to implementing a 32-bit sincos function in libm is the need to implement 32-bit
rem_pio2f
andrem_pio2_large
implementations. These account for a large amount of the code used in sincos and are not needed for our case. The logic in sincos apart fromrem_pio2f
is not particularly complicated, and therefore a custom solution is a reasonable possibility. Still, for the sake of maintenance and because the full implementation should not incur a performance penalty, it would still probably be preferable to implement a full solution in libm and use that if we go the floating-point route.additional thoughts
There are a number of important considerations for deciding between a fixed-point implementation and a floating-point implementation:
A fixed-point LUT implementation (which is similar to the one used in misoc) seems to be the best solution at this time.
Analyses of atan2 and sqrt implementations will follow.
TODO atan2
TODO sqrt
The text was updated successfully, but these errors were encountered: