Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Use CUB's ScanByKey implementation #1555

Merged
merged 5 commits into from
May 5, 2022

Conversation

alliepiper
Copy link
Collaborator

I will update thrust_benchmark to test this for perf regressions before merging. (Hence the blocked tag)

cc: @zasdfgbnm for awareness (don't want to duplicate effort)

run tests

@alliepiper alliepiper added type: enhancement New feature or request. blocked Cannot make progress. P2: nice to have Desired, but not necessary. testing: gpuCI in progress Started gpuCI testing. labels Oct 26, 2021
@alliepiper alliepiper added this to the 1.16.0 milestone Oct 26, 2021
@alliepiper alliepiper self-assigned this Oct 26, 2021
@alliepiper
Copy link
Collaborator Author

run tests

@alliepiper
Copy link
Collaborator Author

DVS CL: 30580794

@alliepiper alliepiper added the testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). label Oct 26, 2021
@zasdfgbnm
Copy link
Contributor

zasdfgbnm commented Oct 26, 2021

There is one thing that might impact the performance on thrust and libcxx pointers. In cub, https://github.com/NVIDIA/cub/blob/main/cub/agent/agent_scan_by_key.cuh#L124-L129

    using WrappedKeysInputIteratorT = typename If<IsPointer<KeysInputIteratorT>::VALUE,
        CacheModifiedInputIterator<AgentScanByKeyPolicyT::LOAD_MODIFIER, KeyT, OffsetT>,   // Wrap the native input pointer with CacheModifiedInputIterator
        KeysInputIteratorT>::Type;
    using WrappedValuesInputIteratorT = typename If<IsPointer<ValuesInputIteratorT>::VALUE,
        CacheModifiedInputIterator<AgentScanByKeyPolicyT::LOAD_MODIFIER, InputT, OffsetT>,   // Wrap the native input pointer with CacheModifiedInputIterator
        ValuesInputIteratorT>::Type;

These two lines are copied from agent_scan.cuh. But for the implementation of scan-by-key in thrust, the condition for using CacheModifiedInputIterator is is_contiguous_iterator, which is

template <typename Iterator>
struct is_contiguous_iterator_impl
  : integral_constant<
      bool
    ,    is_pointer<Iterator>::value
      || is_thrust_pointer<Iterator>::value
      || is_libcxx_wrap_iter<Iterator>::value
      || is_libstdcxx_normal_iterator<Iterator>::value
      || is_msvc_contiguous_iterator<Iterator>::value
      || proclaim_contiguous_iterator<Iterator>::value
    >
{};

@alliepiper alliepiper modified the milestones: 1.16.0, 1.17.0 Feb 7, 2022
Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth addressing @zasdfgbnm comment by implementing a make_input_iterator version without caching policy:

  template <class It>
  auto __device__ __forceinline__
  make_load_iterator_impl(It it, thrust::detail::true_type /* is_trivial */)
  {
    return raw_pointer_cast(&*it);
  }

  template <class It>
  It __device__ __forceinline__
  make_load_iterator_impl(It it, thrust::detail::false_type /* is_trivial */)
  {
    return it;
  }

  template <class It>
  typename LoadIterator<It>::type __device__ __forceinline__
  make_load_iterator(It it)
  {
    return make_load_iterator_impl(
        it, typename is_contiguous_iterator<It>::type());
  }

This version would be used before dispatching into CUB code. I think will need this kind of facility later in any case.

thrust/system/cuda/detail/scan_by_key.h Show resolved Hide resolved
@alliepiper alliepiper removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: gpuCI in progress Started gpuCI testing. labels Apr 13, 2022
@alliepiper
Copy link
Collaborator Author

It might be worth addressing @zasdfgbnm comment by implementing a make_input_iterator version without caching policy:
[snip]
This version would be used before dispatching into CUB code. I think will need this kind of facility later in any case.

Agreed -- I'll add something like this to the contiguous iterator implementation.

@alliepiper
Copy link
Collaborator Author

run tests

@alliepiper
Copy link
Collaborator Author

  • Added thrust::detail::try_unwrap_contiguous_iterator(Iterator it) that returns a raw pointer for contiguous iterators and passes-through other iterators.
  • Use above in new scan_by_key implementation.
  • Restored the short-circuit num_items==0 that avoids temp mem allocs.

Still need to write/verify benchmarks, and if we're happy with the new contiguous iterator unwrapping API I'll update the other CUB-backed algorithms in a followup.

@alliepiper
Copy link
Collaborator Author

run tests

@alliepiper
Copy link
Collaborator Author

run tests

@alliepiper
Copy link
Collaborator Author

run tests

@alliepiper alliepiper removed the blocked Cannot make progress. label Apr 27, 2022
@alliepiper
Copy link
Collaborator Author

run tests

@alliepiper
Copy link
Collaborator Author

run tests

All off these are internal implementation details in the
`thrust::detail` namespace:

Contiguous iterators only:
- `contiguous_iterator_traits`
- `contiguous_iterator_raw_pointer_t`:
- `contiguous_iterator_raw_pointer_cast`

These work on all iterators, but convert to a
raw pointer if given a contiguous iterator.
- `try_unwrap_contiguous_iterator_return_t`
- `try_unwrap_contiguous_iterator`
@alliepiper
Copy link
Collaborator Author

Split the scan_by_key test into separate exclusive and inclusive variants to reduce memory usage during compilation.

run tests

This test was consuming excessive memory during nvc++ compilation.
Splitting into two TUs should remedy this.

Ran clang-format on the new test files, but the contents are the same.
@alliepiper
Copy link
Collaborator Author

run tests

@alliepiper alliepiper added the testing: gpuCI passed Passed gpuCI testing. label May 4, 2022
Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but nothing critical. Thank you for separating the test!

true>::value));
THRUST_STATIC_ASSERT((check_unwrapped_iterator<T *,
T *,
false>::value));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to test for check_unwrapped_iterator<T, T, true> + check_unwrapped_iterator<T, T, false> that I'm missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the unwrapped raw pointer is the same type as the pass-through, both the true and false cases should pass here. Testing both just ensures that everything is working as expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the unwrapped raw pointer is the same type as the pass-through, both the true and false cases should pass here. Testing both just ensures that everything is working as expected.

testing/is_contiguous_iterator.cu Show resolved Hide resolved
typename ScanOpT,
typename SizeT>
__host__ __device__
ValuesOutIt exclusive_scan_by_key_n(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: this function is a duplicate on inclusive version. We might always have InitValueT and pass cub::NullType{} to indicate exclusiveness. The error messages will get back to scan_by_key, but it was like that before anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll clean this up to reduce duplication.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, let's keep these separate -- otherwise we'll just have to re-split them when we address NVIDIA/cub#384 and add InitialValue support to inclusive scans.

thrust/type_traits/is_contiguous_iterator.h Show resolved Hide resolved
@alliepiper
Copy link
Collaborator Author

run tests

@alliepiper alliepiper added testing: gpuCI in progress Started gpuCI testing. and removed testing: gpuCI passed Passed gpuCI testing. labels May 5, 2022
@alliepiper alliepiper merged commit 55d2b87 into NVIDIA:main May 5, 2022
@alliepiper alliepiper deleted the use_cub_scan_by_key branch May 5, 2022 23:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2: nice to have Desired, but not necessary. testing: gpuCI in progress Started gpuCI testing. type: enhancement New feature or request.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants