Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT]: Add a callback to the bulk_insert functions #376

Open
esoha-nvidia opened this issue Sep 28, 2023 · 5 comments
Open

[ENHANCEMENT]: Add a callback to the bulk_insert functions #376

esoha-nvidia opened this issue Sep 28, 2023 · 5 comments
Labels
type: feature request New feature request

Comments

@esoha-nvidia
Copy link

Is your feature request related to a problem? Please describe.

I would like to be able to run a function on every key that is inserted by the bulk insert functions.

Describe the solution you'd like

struct MyFunctor {
  void operator()(slot_type *slot, bool is_unique) {
    ...
  }
};

my_set.insert(input.begin(), input.end(), MyFunctor());

Each insert will return the slot and whether this was an insert or the key was already there so this was just a lookup. And then I will do whatever I want with that information. For example, I could use an atomic and fill an array with unique elements, allowing me to perform "retrieve_all" during the insert without reading the table twice.

Describe alternatives you've considered

I can copy and paste all the bulk insert code into my code and then use that. The problem here is that I trust cuco to get the grid shape and everything right and if I do it this way then I might get it wrong. Also, if cuco comes up with improvements to the bulk insert, I could use them.

Additional context

No response

@jrhemstad
Copy link
Collaborator

Hm, I'm not against this, but this kind of custom operation seems like what the device-side API is intended for.

@esoha-nvidia
Copy link
Author

I'd like to avoid duplicating the work that cuco has already done. So if I can use the bulk operations, all the better!

@sleeepyjack
Copy link
Collaborator

sleeepyjack commented Sep 29, 2023

I totally get your point about avoiding writing a kernel that is already 90% existent in cuco.

It's really about API flexibility:

  • Our host bulk APIs aren't very flexible, but solve a simple job fast with just one LOC.
  • The device APIs give you full control (as it even allows you to do kernel fusing, like we did in our GROUPBY work), however it requires a considerable amount of effort on the developer side to get it right.

The solution you are proposing is somewhere in between:
You get the benefit of the easy-to-use host API, plus some additional bells and whistles which otherwise only the device API could give you.

So, when a developer plans to use cuco for a particular complex task, they have the following thought process:

  1. Does the host APi suffice to solve my problem? -> No.
  2. Does a custom functor work? -> Maybe. Requires some investigation.
  3. Does the device API solve my problem? -> Most likely yes, but also takes some time to get it right.

I'm not convinced 2. is helpful when we can just move to 3. directly and be able to optimize the hell out of the custom kernel.

However, we could solve your particular problem with a different approach:
Let's say we provide a bulk API for insert_and_find, returning the tuples {slot_reference, is_new}.
The output tuples will be written to some output iterator.
Instead of naively writing the output to some memory location directly, you could employ one of Thrust's fancy iterators, namely thrust::transform_[input_]output_iterator, which takes your custom functor and applies it to the output pairs.

So you can achieve the same thing that you asked for by using the bulk API with Thrust fancy iterators.

@PointKernel @jrhemstad I have one minor concern about my proposal:
We use a shmem bounce buffer to stage the result tuples and then flush them to the output iterator using memcpy_async. Does this work with Thrust fancy iterators?

@sleeepyjack sleeepyjack added the type: feature request New feature request label Sep 29, 2023
@esoha-nvidia
Copy link
Author

The output iterator is a nice idea. No way it works with the memcpy_async, right? Surely memcpy_async requires that the destination is a real pointer. Worse still, it might compile and work just fine on A100 and before but fail on Hopper.

@sleeepyjack
Copy link
Collaborator

Correction: We use the memcpy_async approach only in multimap. For static_map/static_set we store the output directly:

*(output_begin + idx) = found == ref.end() ? ref.empty_value_sentinel() : (*found).second;

So using a fancy iterator should work in this scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request New feature request
Projects
None yet
Development

No branches or pull requests

3 participants