Skip to content

Patterns and behaviors

Sean Baxter edited this page May 15, 2016 · 2 revisions

moderngpu exposes functionality through templated pattern functions. The user specializes these patterns with concrete behaviors, typically written as C++ lambdas. The pattern launches a kernel, computes a context for each work-item, and invokes the behavior on each work-item, providing the corresponding context.

moderngpu provides patterns useful for many operations:

  1. cta_launch - Launch a grid of cooperative thread arrays (CTAs) and pass the behavior function the index of the thread (threadIdx.x) and CTA (blockIdx.x).
  2. cta_transform - Launch a grid of CTAs, but size them according to the number of work-items that can be processed given the architecture's specified launch box.
  3. transform - A non-cooperative method. Invoke the behavior once for each work-item.
  4. transform_reduce - Call a behavior once for each work-item and recursively reduce the return values with a user-provided reducer. This pattern enables array-wide sum, max, and min operations.
  5. transform_scan - Like transform_reduce, but computes a reduction for each interval from the start of the array to each element. For an addition operator this is the prefix sum operation.
  6. transform_lbs - A vectorized and load-balanced transform implemented using load-balancing search. The caller specifies the geometry of the problem with a segments descriptor array and the behavior is invoked for each work-item, providing both the ID of the segment the work-item belongs to and its rank within the segment. This is the signature pattern of moderngpu.
  7. lbs_segreduce - Fold together all values in each segment and return one reduction value per segment. The behavior is invoked with the segment ID and rank of each work-item. This pattern makes for consistent performance for simultaneous processing of many irregularly-shaped problems.
  8. transform_compact - An efficient two-pass pattern for conditionally selecting and compacting elements of an array.
  9. lbs_workcreate - Dynamic work-creation accelerated with load-balancing search. This is a two-pass pattern. On the upsweep the pattern returns the number of output work-items to stream for each input work-item. On the downsweep, the pattern encodes parameters for each work-creating segment. This pattern solves many problems that CUDA Dynamic Parallelism was intended to solve, but with exact load-balancing and requiring no special hardware mechanisms.

moderngpu also includes traditional bulk-synchronous parallel general-purpose functions. Most of these can be adapted to the pattern-behavior model with the use of lambda iterators, which wrap lambda behavior functions in the interface of pointers.

  • reduce and scan are the iterator-oriented equivalents of transform_reduce and transform_scan.
  • merge, bulk_remove and bulk_insert are general-purpose array construction functions.
  • mergesort is a basic array-wide sort. segmented_sort is an advanced mergesort that sorts keys and values within segments defined by a segments descriptor array. Thanks to a novel short-circuiting feature, segmented_sort actually improves in performance as the number of segments to sort increases.
  • sorted_search is a vectorized sorted search. A binary search looks for a needle in a sorted haystack. The vectorized sorted search looks for an array of sorted needles in a sorted haystack. A problem with n needles and m haystack items requires _O(n log m) operations to binary search but only _O(n + m) operations to sorted search. This is a critical function for implementing database operations.
  • inner_join is a relational join operator. It's a demonstration of the power of combining vectorized sorted search with the load-balancing search and useful in its own right.