You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As per jax-ml/jax#23427, I'm noticing that XLA on CPU isn't doing a fused reduction sum for a very simple function if the input tensor is > 32 elements:
If I run this with n_elem = 32 and then again with n_elem=33, I get the same lowered stablehlo, but different compiled output. In the case of the tensor with length 32, I see loop fusion:
Hello!
As per jax-ml/jax#23427, I'm noticing that XLA on CPU isn't doing a fused reduction sum for a very simple function if the input tensor is > 32 elements:
If I run this with n_elem = 32 and then again with n_elem=33, I get the same lowered stablehlo, but different compiled output. In the case of the tensor with length 32, I see loop fusion:
In the case of the tensor with length 33, the fusion goes away and I'm seeing two passes over the data (once for cosine, then another for reduction):
I'd expect both cases to be fused - am I missing something here?
Thanks.
The text was updated successfully, but these errors were encountered: