Add bitwise reduce operator #240

wangxicoding · 2019-07-13T15:19:16Z

nccl-tests for bit redop NVIDIA/nccl-tests#25

sjeaugey · 2019-07-15T18:48:50Z

Thanks for the PR.

Can we get a bit of rationale behind this PR, as well as some performance numbers, and library size increase.

On performance, getting full bandwidth should not be hard since bitwise operations should be fast, but it's still good to confirm that.

Reading library size, I assume the idea here is to only implement bitwise operations for int8 (even if we accept any datatype we end up implementing it only as int8).

Finally, what is the main usecases behind and/or/xor ? If the library size increase is limited maybe it is fine to add all 3 but otherwise we might want to add them as they are needed.

wangxicoding · 2019-07-16T09:01:00Z

Thank you for your reply.

On my machine, with cuda10, gcc4.8.2, centos 6.3, the library size increase:
master: .so=108M, .a=112M
thisPR: .so=116M, .a=121M

As you say bitwise operations only implement with int8, for the bit operations are performed on bits, equivalent to untyped data.

Performance is as follows, with P40 cards, 100Gb rdma RoCEv2, nogdr, same with sum, almost getting full bandwidth:
1node 8cards with int32 allreduce sum

1node 8cards with int32 allreduce band

2node 16cards with int32 allreduce sum

2node 16cards with int32 allreduce band

Well，there may be some work can specialization MULTI<BitOp, T> to use 64bit operations.

For the use of bit operations, I think they can exchange some meta info.
Well, do some sparse operations may be more suitable for them.

For example, we have two cards with sparse data.
gpu 0 | gpu1
data int(3 5 0 0 3 0 0 0) | int(0 1 0 0 2 4 0 0)
get bitmap bit(11001000) | bit(01001100)

@@@@@@@for sum allreduce:
first do BitOr with bitmap,
BitOr bitmap bit(11001100) | bit(11001100)
then collect data based on bitmap,
coll data int(3 5 3 0) | int(0 1 2 4)
do sum allreduce with coll data,
r coll data int(3 6 5 4) | int(3 6 5 4)
At last scatter coll data to original data
data int(3 6 0 0 5 4 0 0) | int(3 6 0 0 5 4 0 0)
@@@@@@@

@@@@@@@for prod allreduce:
like sum allreduce, we can use BitAnd on bitmap
BitAnd bitmap bit(01001000) | bit(01001000)
...
At last we can get prod allreduce data
data int(0 5 0 0 6 0 0 0) | int(0 5 0 0 6 0 0 0)
@@@@@@@

Of course, there are many methods for sparse, such as allGather, but these methods are not suitable for all scenarios. The above method may be better in some scenarios.
By the way, BitAnd/BitOr is not necessary, MAX/MIN can achieve same function, but need extra bytes, maybe not cost-effective.

Finally, if can't add them all, i think xor can be excluded.

rongou · 2023-06-28T20:20:03Z

@sjeaugey can we consider merging this? I do have a use case for XGBoost (see dmlc/xgboost#9300).

Add ncclBitAnd|ncclBitOr|ncclBitXor redop

9238ac2

wangxicoding mentioned this pull request Jul 13, 2019

Add bit redop test NVIDIA/nccl-tests#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bitwise reduce operator #240

Add bitwise reduce operator #240

wangxicoding commented Jul 13, 2019 •

edited

Loading

sjeaugey commented Jul 15, 2019

wangxicoding commented Jul 16, 2019

rongou commented Jun 28, 2023

Add bitwise reduce operator #240

Are you sure you want to change the base?

Add bitwise reduce operator #240

Conversation

wangxicoding commented Jul 13, 2019 • edited Loading

sjeaugey commented Jul 15, 2019

wangxicoding commented Jul 16, 2019

rongou commented Jun 28, 2023

wangxicoding commented Jul 13, 2019 •

edited

Loading