Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Inconsistency caused by 65535f16*0f16 after using compute_inline #12377

Closed
cxx122 opened this issue Aug 11, 2022 · 5 comments
Closed

[Bug] Inconsistency caused by 65535f16*0f16 after using compute_inline #12377

cxx122 opened this issue Aug 11, 2022 · 5 comments
Assignees

Comments

@cxx122
Copy link

cxx122 commented Aug 11, 2022

TENSOR_0 = te.compute([14], lambda rck:te.max_value("float16")*te.min_value("uint16"), name ="TENSOR_1")
TENSOR_1 = te.compute([11], lambda oco:te.max_value("uint16")*TENSOR_0[oco], name ="TENSOR_2")

The tir program before compute_inline:

@main = primfn(TENSOR_1_1: handle, TENSOR_2_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {TENSOR_1: Buffer(TENSOR_1_2: Pointer(float16), float16, [14], []),
             TENSOR_2: Buffer(TENSOR_2_2: Pointer(float16), float16, [11], [])}
  buffer_map = {TENSOR_1_1: TENSOR_1, TENSOR_2_1: TENSOR_2}
  preflattened_buffer_map = {TENSOR_1_1: TENSOR_1_3: Buffer(TENSOR_1_2, float16, [14], []), TENSOR_2_1: TENSOR_2_3: Buffer(TENSOR_2_2, float16, [11], [])} {
  for (rck: int32, 0, 11) {
    TENSOR_1[rck] = 0f16
  }
  for (oco: int32, 0, 11) {
    TENSOR_2[oco] = (65535f16*TENSOR_1[oco])
  }
}

The tir program after compute_inline:

@main = primfn(TENSOR_1_1: handle, TENSOR_2_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {TENSOR_1: Buffer(TENSOR_1_2: Pointer(float16), float16, [14], []),
             TENSOR_2: Buffer(TENSOR_2_2: Pointer(float16), float16, [11], [])}
  buffer_map = {TENSOR_1_1: TENSOR_1, TENSOR_2_1: TENSOR_2}
  preflattened_buffer_map = {TENSOR_1_1: TENSOR_1_3: Buffer(TENSOR_1_2, float16, [14], []), TENSOR_2_1: TENSOR_2_3: Buffer(TENSOR_2_2, float16, [11], [])} {
  for (oco: int32, 0, 11) {
    TENSOR_2[oco] = 0f16
  }
}

Actual behavior

AssertionError: 
Not equal to tolerance rtol=1e-05, atol=1e-07

x and y nan location mismatch:
 x: array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
      dtype=float16)
 y: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float16)

Environment

Operating System: Ubuntu 18.04, TVM version: tag0.9.0 [d361585]

Steps to reproduce

import os
import numpy as np
import tvm
from tvm import te, auto_scheduler, topi
import tvm.testing

TENSOR_0 = te.compute([14], lambda rck:te.max_value("float16")*te.min_value("uint16"), name ="TENSOR_1")
TENSOR_1 = te.compute([11], lambda oco:te.max_value("uint16")*TENSOR_0[oco], name ="TENSOR_2")
s = te.create_schedule(TENSOR_1.op)
tensor_list = [TENSOR_0,TENSOR_1]

dev = tvm.cpu(0)
pre_list = []
after_list = []
for tensor in tensor_list:
    shape = [x.value if 'value' in dir(x) and isinstance(x.value, int) else 1 for x in tensor.shape]
    params = (5*np.random.uniform(size=shape)).astype(tensor.dtype)
    pre_list.append(tvm.nd.array(params.copy(), dev))
    after_list.append(tvm.nd.array(params.copy(), dev))

pre_mod = tvm.lower(s, tensor_list, simple_mode=True)
with tvm.transform.PassContext(opt_level=4):
    f = tvm.build(pre_mod)
f(*pre_list)

s[TENSOR_0].compute_inline()

now_mod = tvm.lower(s, tensor_list, simple_mode=True)
with tvm.transform.PassContext(opt_level=4):
    f = tvm.build(now_mod)
f(*after_list)

tvm.testing.assert_allclose(pre_list[1].numpy(), after_list[1].numpy(),rtol=1e-5)
@cxx122 cxx122 changed the title [Bug] Inconsistent caused by 65535f16*0f16 after using compute_inline [Bug] Inconsistency caused by 65535f16*0f16 after using compute_inline Aug 11, 2022
@ganler
Copy link
Contributor

ganler commented Aug 17, 2022

@cxx122 Thanks for the report. It seems you are trying to compute "65535f16 * 0f16" which returns "nan" as an undefined behavior.

image

Since its output is "nan" and according to IEEE 754 that "nan" is not comparable, I don't think it is suitable to regard this as an inconsistency bug since the computation itself is ill-formed and undefined. From a fuzzing prespective, IMO, those should be regarded as false alarms that the algorithm should try to avoid sythesizing programs with undefined behaviors (like CSmith).

@ganler
Copy link
Contributor

ganler commented Aug 17, 2022

Similarly in many rest bugs reports, as opt_level=4 is specified which indicates fast math optimization, it is highly possible to have those numerical "inconsistency" when the computation is not well-formed.

@cxx122
Copy link
Author

cxx122 commented Aug 18, 2022

Thanks, when I submitted this bug I also considered that it might be because of this problem. This may not be a bug in the strict sense.

@wrongtest-intellif wrongtest-intellif self-assigned this Aug 20, 2022
@wrongtest-intellif
Copy link
Contributor

Actually there is no "65535f16", should be nan because it exceeds maximum of fp16, it seems to be an issue in literal construction and constant folding.

@ganler
Copy link
Contributor

ganler commented Aug 20, 2022

@wrongtest-intellif Good point. "65535f16" is actually inf. But inf * 0 gets us a nan. :-)

@cxx122 cxx122 closed this as completed Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants