Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLA not fusing reduce for CPU tensor with > 32 elements #16792

Open
JeffGreen opened this issue Sep 4, 2024 · 1 comment
Open

XLA not fusing reduce for CPU tensor with > 32 elements #16792

JeffGreen opened this issue Sep 4, 2024 · 1 comment
Assignees

Comments

@JeffGreen
Copy link

Hello!

As per jax-ml/jax#23427, I'm noticing that XLA on CPU isn't doing a fused reduction sum for a very simple function if the input tensor is > 32 elements:

@jax.jit
def func_a(input_tensor):
    return jnp.sum(jnp.cos(input_tensor))

n_elem = 32

print("************LOWERED**********")
print(jax.jit(func_a).lower(jnp.ones(n_elem, dtype=jnp.float32)).as_text())

print("************COMPILED**********")
print(jax.jit(func_a).lower(jnp.ones(n_elem, dtype=jnp.float32)).compile().as_text())

If I run this with n_elem = 32 and then again with n_elem=33, I get the same lowered stablehlo, but different compiled output. In the case of the tensor with length 32, I see loop fusion:

HloModule jit_func_a, is_scheduled=true, entry_computation_layout={(f32[32]{0})->f32[]}, allow_spmd_sharding_propagation_to_parameters={true}, allow_spmd_sharding_propagation_to_output={true}

%region_0.2 (Arg_0.3: f32[], Arg_1.4: f32[]) -> f32[] {
  %Arg_0.3 = f32[] parameter(0), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  %Arg_1.4 = f32[] parameter(1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  ROOT %add.5 = f32[] add(f32[] %Arg_0.3, f32[] %Arg_1.4), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

%fused_computation (param_0.2: f32[32]) -> f32[] {
  %param_0.2 = f32[32]{0} parameter(0)
  %cosine.1 = f32[32]{0} cosine(f32[32]{0} %param_0.2), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/cos" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
  %constant.1 = f32[] constant(0)
  ROOT %reduce.1 = f32[] reduce(f32[32]{0} %cosine.1, f32[] %constant.1), dimensions={0}, to_apply=%region_0.2, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

ENTRY %main.12 (Arg_0.1: f32[32]) -> f32[] {
  %Arg_0.1 = f32[32]{0} parameter(0), metadata={op_name="input_tensor"}
  ROOT %fusion = f32[] fusion(f32[32]{0} %Arg_0.1), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

In the case of the tensor with length 33, the fusion goes away and I'm seeing two passes over the data (once for cosine, then another for reduction):

HloModule jit_func_a, is_scheduled=true, entry_computation_layout={(f32[33]{0})->f32[]}, allow_spmd_sharding_propagation_to_parameters={true}, allow_spmd_sharding_propagation_to_output={true}

%region_0.2 (Arg_0.3: f32[], Arg_1.4: f32[]) -> f32[] {
  %Arg_0.3 = f32[] parameter(0), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  %Arg_1.4 = f32[] parameter(1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum"}
  ROOT %add.5 = f32[] add(f32[] %Arg_0.3, f32[] %Arg_1.4), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

ENTRY %main.12 (Arg_0.1: f32[33]) -> f32[] {
  %Arg_0.1 = f32[33]{0} parameter(0), metadata={op_name="input_tensor"}
  %cosine.0 = f32[33]{0} cosine(f32[33]{0} %Arg_0.1), metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/cos" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
  %constant.0 = f32[] constant(0)
  %reduce-window = f32[2]{0} reduce-window(f32[33]{0} %cosine.0, f32[] %constant.0), window={size=32 stride=32 pad=15_16}, to_apply=%region_0.2
  ROOT %reduce.0 = f32[] reduce(f32[2]{0} %reduce-window, f32[] %constant.0), dimensions={0}, to_apply=%region_0.2, metadata={op_name="jit(func_a)/jit(main)/jit(func_a)/reduce_sum" source_file="/home/jgreen/gits/learn/python/scratch/jax_reduce_bug.py" source_line=7}
}

I'd expect both cases to be fused - am I missing something here?

Thanks.

@NaiyerRizz NaiyerRizz self-assigned this Oct 24, 2024
@NaiyerRizz
Copy link

I am able to reproduce the issue.Kindly find below the link for the colab gist for the same:
https://colab.sandbox.google.com/gist/NaiyerRizz/b847c55ac24d613e226f20e70cc13540/issue.ipynb#scrollTo=2a-qQosoemSv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants