Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bitwise reduce operator #240

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wangxicoding
Copy link

@wangxicoding wangxicoding commented Jul 13, 2019

nccl-tests for bit redop NVIDIA/nccl-tests#25

@sjeaugey
Copy link
Member

Thanks for the PR.

Can we get a bit of rationale behind this PR, as well as some performance numbers, and library size increase.

On performance, getting full bandwidth should not be hard since bitwise operations should be fast, but it's still good to confirm that.

Reading library size, I assume the idea here is to only implement bitwise operations for int8 (even if we accept any datatype we end up implementing it only as int8).

Finally, what is the main usecases behind and/or/xor ? If the library size increase is limited maybe it is fine to add all 3 but otherwise we might want to add them as they are needed.

@wangxicoding
Copy link
Author

Thank you for your reply.

On my machine, with cuda10, gcc4.8.2, centos 6.3, the library size increase:
master: .so=108M, .a=112M
thisPR: .so=116M, .a=121M

As you say bitwise operations only implement with int8, for the bit operations are performed on bits, equivalent to untyped data.

Performance is as follows, with P40 cards, 100Gb rdma RoCEv2, nogdr, same with sum, almost getting full bandwidth:
1node 8cards with int32 allreduce sum
image

1node 8cards with int32 allreduce band
image

2node 16cards with int32 allreduce sum
image

2node 16cards with int32 allreduce band
image

Well,there may be some work can specialization MULTI<BitOp, T> to use 64bit operations.

For the use of bit operations, I think they can exchange some meta info.
Well, do some sparse operations may be more suitable for them.

For example, we have two cards with sparse data.
gpu 0 | gpu1
data int(3 5 0 0 3 0 0 0) | int(0 1 0 0 2 4 0 0)
get bitmap bit(11001000) | bit(01001100)

@@@@@@@for sum allreduce:
first do BitOr with bitmap,
BitOr bitmap bit(11001100) | bit(11001100)
then collect data based on bitmap,
coll data int(3 5 3 0) | int(0 1 2 4)
do sum allreduce with coll data,
r coll data int(3 6 5 4) | int(3 6 5 4)
At last scatter coll data to original data
data int(3 6 0 0 5 4 0 0) | int(3 6 0 0 5 4 0 0)
@@@@@@@

@@@@@@@for prod allreduce:
like sum allreduce, we can use BitAnd on bitmap
BitAnd bitmap bit(01001000) | bit(01001000)
...
At last we can get prod allreduce data
data int(0 5 0 0 6 0 0 0) | int(0 5 0 0 6 0 0 0)
@@@@@@@

Of course, there are many methods for sparse, such as allGather, but these methods are not suitable for all scenarios. The above method may be better in some scenarios.
By the way, BitAnd/BitOr is not necessary, MAX/MIN can achieve same function, but need extra bytes, maybe not cost-effective.

Finally, if can't add them all, i think xor can be excluded.

@rongou
Copy link
Contributor

rongou commented Jun 28, 2023

@sjeaugey can we consider merging this? I do have a use case for XGBoost (see dmlc/xgboost#9300).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants