Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilities for decimal <--> floating conversion #15359

Merged
merged 21 commits into from
May 29, 2024

Conversation

pmattione-nvidia
Copy link
Contributor

@pmattione-nvidia pmattione-nvidia commented Mar 21, 2024

These are some utilities used by the upcoming decimal <--> floating conversion PR. This has been submitted separately from that PR in order to spread out the complexity for review. These functions are not called by any code in this PR.

One function is used to extract the components of the floating point number. Another function is used to set a floating point's sign bit and add some additional powers of two. These are done using integer and bit operations, which is much faster than using the built-in functions and bottle-necking on the FP64 pipeline. The final function is used to count the # of significant bits in a number.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@pmattione-nvidia pmattione-nvidia added libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Mar 21, 2024
@pmattione-nvidia pmattione-nvidia self-assigned this Mar 21, 2024
@pmattione-nvidia pmattione-nvidia requested review from a team as code owners March 21, 2024 04:08
@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue conda Java Affects Java cuDF API. labels Mar 21, 2024
@pmattione-nvidia pmattione-nvidia changed the base branch from branch-24.04 to branch-24.06 March 21, 2024 04:08
@pmattione-nvidia pmattione-nvidia added improvement Improvement / enhancement to an existing function Python Affects Python cuDF API. and removed Python Affects Python cuDF API. labels Mar 21, 2024
@ttnghia ttnghia removed request for a team March 21, 2024 06:18
@ttnghia ttnghia removed Python Affects Python cuDF API. CMake CMake build issue conda labels Mar 21, 2024
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cooperating on reviews and revisions!

@pmattione-nvidia
Copy link
Contributor Author

@pmattione-nvidia I started a review on this but there are at least 3 conversations that were resolved without the corresponding changes being applied. Can you please verify that all previous review requests were applied?

All of these changes have been applied and committed.

cpp/include/cudf/fixed_point/floating_conversion.hpp Outdated Show resolved Hide resolved
IntegralType integer_rep = bit_cast_to_integer(floating);

// Set the sign bit
integer_rep |= (IntegralType(is_negative) << sign_bit_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? If the sign bit is already 1, it's not possible to change it with an OR operator, regardless of is_negative. Are we assuming the sign bit is always 0 when entering this function? If so, that is not documented.

Are there test cases for this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sign is always zero on input, I'll add that comment. It's not part of this PR, but when I switch from the old decimal/floating conversion code to this, the fixed_point tests test all of this functionality, and all of those tests pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might I suggest a reminder in the comment added that 0 means positive value, so the or will do what we want. Simply saying it must be positive implies the reader knows this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@bdice bdice May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the design of this function.

  1. If is_negative is true, this should be a no-op -- we shouldn't even cast back and forth.
  2. Where is this being called? It doesn't appear to be used in this PR, and it doesn't seem like it's designed for public consumption.
  3. Can we avoid bitwise manipulation entirely and just negate the value? i.e. return is_negative ? -floating : floating;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR contains utilities for the upcoming primary PR for the decimal <--> floating conversion. I broke the code into several PRs as it is quite large. So nothing is calling this yet.

This specific function is used at the end of decimal --> floating. In the main algorithm we do a lot of bit-shifting so the sign is initially cleared, and here we are setting it at the end. Doing it this way (cast + or) requires no branching, so you don't pay a performance penalty for the branch. A couple months ago I did a bunch of benchmarking trying different methods and this method was the fastest.

cpp/include/cudf/fixed_point/floating_conversion.hpp Outdated Show resolved Hide resolved
* @param value The integer whose bits are being counted
* @return The number of significant bits: the # of bits - # of leading zeroes
*/
template <typename T, typename cuda::std::enable_if_t<(cuda::std::is_unsigned_v<T>)>* = nullptr>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the trait is_unsigned_v will return true for types like uint16_t. However, only 32/64/128 bit values appear to be handled below. Should we be more specific in this trait, and only accept 32/64/128-bit unsigned integers? (Will we get a compile error currently for unsupported types?)

Copy link
Contributor Author

@pmattione-nvidia pmattione-nvidia May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a static assert at the start of the function call so it won't compile. I'll add a note to the doxygen template parameter that only 32/64/128 bit values are allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using both SFINAE and a static_assert seems unnecessary. Can you just make this plain SFINAE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this is forming into a solution and I like it. I also appreciate you breaking this up into multiple parts.

cpp/include/cudf/fixed_point/floating_conversion.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/fixed_point/floating_conversion.hpp Outdated Show resolved Hide resolved
* @param floating The floating value to extract the significand from
* @return The integral significand, bit-shifted to a (large) whole number
*/
CUDF_HOST_DEVICE inline static IntegralType get_base2_value(FloatingType floating)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be undesirable to call these in succession. I know that this isn't the plan, but in a year or two it may not be all that obvious that

auto b = get_base2_value(f);
auto s = get_is_negative(f);
auto e = get_exp2(f);

isn't ideal. Would it make sense to change these to take the integer representation so that implementers later will default to good behavior when they see the need of somethin they already have?

auto const bits = bit_cast_to_integer(f);
auto b = get_base2_value(bits);
auto s = get_is_negative(bits); // using bits because they already have that! yay!
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the compiler is able to optimize it so that there's no difference at runtime. However it's probably a good idea to factor this out anyway, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

IntegralType integer_rep = bit_cast_to_integer(floating);

// Set the sign bit
integer_rep |= (IntegralType(is_negative) << sign_bit_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might I suggest a reminder in the comment added that 0 means positive value, so the or will do what we want. Simply saying it must be positive implies the reader knows this.

@pmattione-nvidia pmattione-nvidia changed the base branch from branch-24.06 to branch-24.08 May 21, 2024 14:14
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left another round of comments, mostly in reply to existing threads.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no further comments aside from the discussion on overflow. https://github.com/rapidsai/cudf/pull/15359/files#r1617755579

I will approve this to unblock further work, but I feel uncertain about the correctness and performance of this PR (and the decimal/float work in general). I hope an extensive set of tests and benchmarks can be added to support this functionality.

@pmattione-nvidia
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 12336da into rapidsai:branch-24.08 May 29, 2024
70 checks passed
rapids-bot bot pushed a commit that referenced this pull request Jul 11, 2024
This PR contains the main algorithm for the new decimal <--> floating conversion code. This algorithm was written to address the precision issues described [here](#14169). 

### Summary
* The new algorithm is more accurate than the previous code, but it is also far more complex.  
* It can perform conversions that were not even possible in the old code due to overflow (decimal32/64/128 conversions only worked for scale factors up to 10^9/18/38, respectively). Now the entire floating-point range is convertible, including denormals. 
* This new algorithm is significantly faster in some parts of the conversion phase-space, and in some parts slightly slower.  

### Previous PR's 
These contain the supporting parts of this work:  
* [Explicit conversion PR](#15438) 
* [Benchmarking PR](#15334)
* [Powers-of-10 PR](#15353)
* [Utilities PR](#15359). These utilities are updated here to support denormals. 

### Algorithm Outline
We convert floating -> (integer) decimal by:
* Extract the floating-point mantissa (converted to integer) and power-of-2
* For float we use a uint64 to contain our data during the below shifting/scaling, for double uint128_t
* In this shifting integer, we alternately apply the extracted powers-of-2 (bit-shifts, until they're all used) and scale-factor powers-of-10 (multiply/divide) as needed to reach the desired scale factor. 

Decimal -> floating is just the reverse operation.  

### Supplemental Changes
* Testing: Add decimal128, add precise-conversion tests. Remove kludges due to inaccurate conversions. Add test for zeroes. 
* Benchmarking: Enable regions of conversion phase-space for benchmarking that were not possible in the old algorithm. 
* Unary: Cleanup by using CUDF_ENABLE_IF.  Call new conversion code for base-10 fixed-point. 

### Performance for various conversions/input-ranges
* Note: F32/F64 is float/double

New algorithm is **FASTER** by: 
* F64             --> decimal64:   60% for E8    --> E15
* F64             --> decimal128: 13% for E-8  --> E-15
* F64             --> decimal128: 22% for E8    --> E15
* F64             --> decimal128: 27% for E31  --> E38
* decimal32   --> F64:             18% for E-3   --> E4
* decimal64   --> F64:             27% for E-14 --> E-7
* decimal64   --> F64:             17% for E-3   --> E4
* decimal128 --> F64:             21% for E-14 --> E-7
* decimal128 --> F64:             11% for E-3   --> E4
* decimal128 --> F64:             13% for E31   --> E38

New algorithm is **SLOWER** by: 
* F32             --> decimal32:     3% for E-3   --> E4
* F32             --> decimal64:     2% for E-14   --> E14
* F64             --> decimal32:     3% for E-3   --> E4
* decimal32   --> F32:               5% for E-3   --> E4
* decimal128 --> F64:             36% for E-37 --> E-30

Other kernels:
* The PYMOD binary-op benchmark is 7% slower.  

### Performance discussion
* Many conversions have identical speed, indicating these algorithms are often fast and we are instead bottlenecked on overheads such as getting the input to the gpu in the first place. 
* F64 conversions are often much faster than the old algorithm as the new algorithm completely avoids the FP64 pipeline. Other than the cast to double itself, all of the operations are on integers. Thus we don't have threads competing with each other and taking turns for access to the floating-point cores. 
* The conversions are slightly slower for floats with powers-of-10 near zero.  Presumably this is due to code overhead for e.g., handling a large range of inputs, UB-checks for bit shifts, branches for denormals, etc. 
* The conversion is slower for decimal128 conversions with very small exponents, which requires several large divisions (128bit divided by 64bit). 
* The PYMOD kernel is slower due to register pressure from the introduction of the new division routines in the earlier PR. Even though this benchmark does not perform decimal <--> floating conversions, it gets hit because of inlined template code in the kernel increasing the code/register pressure.

Authors:
  - Paul Mattione (https://github.com/pmattione-nvidia)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #15905
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants