Communication of each TPU core sending its input to the next core with pjit? #15052

AranKomat · 2023-03-17T10:48:54Z

AranKomat
Mar 17, 2023

I need to implement a certain sequence parallelism, where each core works on each segment of a long sequence. This requires each core to send its input to the core that works on the next segment for local attention, all at the same time in parallel. This can be described as a process like the following:

num_segs, seg_len, dim = x.shape
shard = lambda x: with_sharding_constraint(x, ("num_segs", None, "embed"))
x = shard(x)
rot = jnp.array([num_segs] + list(range(num_segs - 1)))
x_rot = x[rot]
x_rot = shard(x_rot)
y = jnp.concatenate([x_rot, x], axis=1)

In principle, the communication happens only from core_n to core_n+1, which is cheap. But can this also be true in practice? I haven't seen any example of a process like this yet, so I'm concerned if this behavior is not supported.

andyehrenberg · 2023-03-18T22:48:55Z

andyehrenberg
Mar 18, 2023

I'd probably check the output of pjit(fn).lower(args).compile().as_text() and see if it has what you're looking for (perhaps a ppermute somewhere?). If it's not behaving properly you could check out shard_map, which lets you get more manual control over parallelism within a function you want to jit/pjit.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Communication of each TPU core sending its input to the next core with pjit? #15052

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Communication of each TPU core sending its input to the next core with pjit? #15052

AranKomat Mar 17, 2023

Replies: 1 comment

andyehrenberg Mar 18, 2023

AranKomat
Mar 17, 2023

andyehrenberg
Mar 18, 2023