Test failure on Windows #4623

xkszltl · 2020-07-26T22:27:48Z

Describe the bug

[ RUN      ] TensorOpTest.CastFromString
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 9.9999997473787516e-05.
i:6, provider_type: CPUExecutionProvider
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 9.9999997473787516e-05.
i:7, provider_type: CPUExecutionProvider
[  FAILED  ] TensorOpTest.CastFromString (134 ms)

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Win Server 2019
ONNX Runtime installed from (source or binary): source
ONNX Runtime version: 1.4.0
Python version: 3.8.5
Visual Studio version (if applicable): 2019

Screenshots

The text was updated successfully, but these errors were encountered:

skottmckay · 2020-07-27T06:45:47Z

This is a little strange given these tests pass in CI builds (unit tests are run for all builds).

Looking at the test code the expected value should be NaN so not sure why it is -NaN in your build. Any chance something is redefining the NAN macro in your setup?

onnxruntime/onnxruntime/test/providers/cpu/tensor/tensor_op_test.cc

Lines 294 to 298 in 222fd08

    
           std::initializer_list<std::string> string_data = {"-inf", "+INF", "0.9767611f", "0.28280696f", 
        
                                                             "-0.12019656f", "5.0f", "NaN", "nan"}; 
        
           const std::initializer_list<float> float_output = {-(std::numeric_limits<float>::infinity()), std::numeric_limits<float>::infinity(), 
        
                                                              0.9767611f, 0.28280696f, 
        
                                                              -0.12019656f, 5.0f, NAN, NAN};

The Cast implementation uses std::stof in this case to convert the input so wouldn't be affected by the NAN macro

onnxruntime/onnxruntime/core/providers/cpu/tensor/cast_op.cc

Line 108 in 222fd08

mutable_data[i] = std::stof(in->Data<std::string>()[i]);

xkszltl · 2020-07-27T10:31:14Z

No I don't have any macro redefinition for NAN.

Also I noticed there's nan-specific check.
Why it is not captured before going through line 153?

onnxruntime/onnxruntime/test/providers/provider_test_utils.cc

Line 147 in 1f69a58

} else if (std::isnan(expected[i])) {

skottmckay · 2020-07-27T21:18:06Z

Not sure why the nan-specific check isn't being satisfied. It is on my machine, and expected is actually -nan(ind) . So the output from your run for expected and actual match what I see, but the test passes for me as the std::isnan check on line 147 is satisfied.

expected,8
0x000001ecf0c06920 {-inf, inf, 0.976761103, 0.282806963, -0.120196559, 5.00000000, -nan(ind), -nan(ind)}
    [0]: -inf
    [1]: inf
    [2]: 0.976761103
    [3]: 0.282806963
    [4]: -0.120196559
    [5]: 5.00000000
    [6]: -nan(ind)
    [7]: -nan(ind)
output,8
0x000001ecf2d9b080 {-inf, inf, 0.976761103, 0.282806963, -0.120196559, 5.00000000, nan, nan}
    [0]: -inf
    [1]: inf
    [2]: 0.976761103
    [3]: 0.282806963
    [4]: -0.120196559
    [5]: 5.00000000
    [6]: nan
    [7]: nan

Are you able to step into that call in a debugger? On Windows 10 it is defined in corecrt_math.h so wondering if there's something different in your setup.

    template <class _Ty>
    _Check_return_ inline bool isnan(_In_ _Ty _X) throw()
    {
        return fpclassify(_X) == FP_NAN;
    }

xkszltl · 2020-07-28T09:53:44Z

Maybe it is some strange compiler optimization?
This is how we compile it
https://github.com/xkszltl/Roaster/blob/8f7339d9d9cad1d60735d982035e95f4d867f74a/win/pkgs/ort.ps1#L150

We do have extra /GL and /LTCG but I think the negative sign is already strange in the first place.

skottmckay · 2020-07-28T11:22:24Z

Any chance you're using /fp:fast?

https://docs.microsoft.com/en-us/cpp/build/reference/fp-specify-floating-point-behavior?view=vs-2019

xkszltl · 2020-07-28T13:09:25Z

I don't add that explicitly.
As you can see in the script, here's all I add:

-DCMAKE_C_FLAGS="/GL /MP /Zi /arch:AVX2"
-DCMAKE_CXX_FLAGS="/EHsc /GL /MP /Zi /arch:AVX2"
-DCMAKE_EXE_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_SHARED_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_STATIC_LINKER_FLAGS="/LTCG:incremental"

I can try adding an extra /fp:precise at the end if you want.
Maybe /arch:AVX2 is more suspicious, but ort has official option for that as well.

xkszltl · 2020-07-28T14:14:12Z

Same error when I append /fp:precise:

PS D:\roaster-scratch\onnxruntime\build> .\Release\onnxruntime_test_all.exe --gtest_filter='*.CastFromString'
Note: Google Test filter = *.CastFromString
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TensorOpTest
[ RUN      ] TensorOpTest.CastFromString
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 0.004999999888241291.
i:6, provider_type: CPUExecutionProvider
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 0.004999999888241291.
i:7, provider_type: CPUExecutionProvider
D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:123 onnxruntime::CudaCall D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:117 onnxruntime::CudaCall CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=1129250816 ; hostname=Roaster-Win ; expr=cudaSetDevice(device_id_);


Provider:CUDAExecutionProvider
unknown file: error: C++ exception with description "D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:123 onnxruntime::CudaCall D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:117 onnxruntime::CudaCall CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=1129250816 ; hostname=Roaster-Win ; expr=cudaSetDevice(device_id_);

" thrown in the test body.
[  FAILED  ] TensorOpTest.CastFromString (26 ms)
[----------] 1 test from TensorOpTest (28 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (43 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TensorOpTest.CastFromString

 1 FAILED TEST

skottmckay · 2020-07-28T22:54:23Z

I'm not sure why std::isnan is failing to return true. Both -nan(ind) and nan should be considered to be a nan. https://stackoverflow.com/questions/61253965/nan-differences-stdnan-vs-quiet-nan-vs-macro-nan has some more details if that's helpful.

What happens if you build without any of the extra flags?

xkszltl · 2020-07-29T09:57:45Z

So far:

Removing /arch:AVX2 does not help.
Adding /fp:precise does not help.

Trying to remove /GL and /LTCG:incremental.

xkszltl · 2020-07-29T10:04:59Z

Quote from the stackoverflow answer:

By the way, it's like min/max macros, even if one had the bad idea to define them, don't use them

Love that.

Seems people also found their msvc has negative NAN.

xkszltl · 2020-07-29T10:23:32Z

Removing /GL and /LTCG:incremental works.
That usually indicates the existence of UB.

xkszltl · 2020-08-03T00:39:10Z

@skottmckay
Have you found the cause when you close the ticket?
LTO should not change the behavior of that if (isnan()), unless there's bug in code making compiler think it's always false.

skottmckay · 2020-08-03T03:38:10Z

Multiple of our CI builds enable LTO (all the release package builds do) and we don't see any issues there. Given that, what would be actionable from an onnxruntime perspective?

xkszltl · 2020-08-03T14:48:56Z

This is a strong (meaning I don't have other clue) indication about either msvc or ort was wrong.
Were you able to repro with the following on your side?

-DCMAKE_C_FLAGS="/GL /MP /Zi"
-DCMAKE_CXX_FLAGS="/EHsc /GL /MP /Zi"
-DCMAKE_EXE_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_SHARED_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_STATIC_LINKER_FLAGS="/LTCG:incremental"

We had trouble enabling ort's LTO flag due to conflict with protobuf used elsewhere in our system.
It used to be #902, and now it's banned.

++ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache '-DCMAKE_C_FLAGS=-fdebug-prefix-map='\''/tmp/scratch'\''
='\''/usr/local/src'\'' -g -march=haswell -mtune=generic' '-DCMAKE_CXX_FLAGS=-fdebug-prefix-map='\''/tmp/scratch'\''='\''/usr/local/src'\'' -g -march=haswell -mtune=generic' -DCMAKE_INSTALL_PREFIX=/tmp/scratch/onnxruntime/install.yUpv0xXdzi/root/usr/loca
l -DCMAKE_POLICY_DEFAULT_CMP0060=NEW -DCMAKE_VERBOSE_MAKEFILE=ON -DPython_ADDITIONAL_VERSIONS=3.6 -Deigen_SOURCE_PATH=/usr/local/include/eigen3 -Donnxruntime_BUILD_FOR_NATIVE_MACHINE=OFF -Donnxruntime_BUILD_SHARED_LIB=ON -Donnxruntime_CUDNN_HOME=/usr/lib
64 -Donnxruntime_ENABLE_LANGUAGE_INTEROP_OPS=ON -Donnxruntime_ENABLE_LTO=ON -Donnxruntime_ENABLE_PYTHON=ON -Donnxruntime_PREFER_SYSTEM_LIB=ON -Donnxruntime_RUN_ONNX_TESTS=ON -Donnxruntime_USE_CUDA=ON -Donnxruntime_USE_DNNL=ON -Donnxruntime_USE_EIGEN_FOR_
BLAS=ON -Donnxruntime_USE_FULL_PROTOBUF=ON -Donnxruntime_USE_JEMALLOC=OFF -Donnxruntime_USE_LLVM=ON -Donnxruntime_USE_MKLML=OFF -Donnxruntime_USE_NGRAPH=OFF -Donnxruntime_USE_NUPHAR=OFF -Donnxruntime_USE_OPENBLAS=OFF -Donnxruntime_USE_OPENMP=OFF -Donnxru
ntime_USE_PREINSTALLED_EIGEN=OFF -Donnxruntime_USE_TENSORRT=OFF -Donnxruntime_USE_TVM=OFF -GNinja ../cmake
-- The C compiler identification is GNU 8.4.0
-- The CXX compiler identification is GNU 8.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc-8 - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++-8 - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found PythonInterp: /usr/bin/python3 (found suitable version "3.6.9", minimum required is "3.5") 
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.6m.so (found suitable version "3.6.9", minimum required is "3.5") 
-- Found Protobuf: /usr/local/lib/libprotobuf.so;-lpthread (found version "3.12.4") 
Use protobuf from preinstalled system lib
CMake Warning at CMakeLists.txt:421 (message):
  Please enable Protobuf_USE_STATIC_LIBS


CMake Error at CMakeLists.txt:452 (target_link_options):
  Cannot specify link options for target "protoc" which is not built by this
  project.


# date: USE_SYSTEM_TZ_DB ON
# date: USE_TZ_DB_IN_DOT OFF
# date: BUILD_SHARED_LIBS OFF
# date: ENABLE_DATE_TESTING OFF
CMake Warning at CMakeLists.txt:497 (find_package):                                                                                                                                                                                                   [0/1878]
  By not providing "Findre2.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "re2", but
  CMake did not find one.

  Could not find a package configuration file provided by "re2" with any of
  the following names:

    re2Config.cmake
    re2-config.cmake

  Add the installation prefix of "re2" to CMAKE_PREFIX_PATH or set "re2_DIR"
  to a directory containing one of the above files.  If "re2" provides a
  separate development package or SDK, be sure it has been installed.


-- Performing Test HAS_UNUSED_BUT_SET_VARIABLE
-- Performing Test HAS_UNUSED_BUT_SET_VARIABLE - Success
-- Performing Test HAS_UNUSED_PARAMETER
-- Performing Test HAS_UNUSED_PARAMETER - Success
-- Performing Test HAS_UNUSED_VARIABLE
-- Performing Test HAS_UNUSED_VARIABLE - Success
-- Performing Test HAS_CAST_FUNCTION_TYPE
-- Performing Test HAS_CAST_FUNCTION_TYPE - Success
-- Performing Test HAS_PARENTHESES
-- Performing Test HAS_PARENTHESES - Success
-- Performing Test HAS_USELESS_CAST
-- Performing Test HAS_USELESS_CAST - Success
-- Performing Test HAS_NONNULL_COMPARE
-- Performing Test HAS_NONNULL_COMPARE - Success
-- Performing Test HAS_TAUTOLOGICAL_POINTER_COMPARE
-- Performing Test HAS_TAUTOLOGICAL_POINTER_COMPARE - Failed
-- Performing Test HAS_CATCH_VALUE
-- Performing Test HAS_CATCH_VALUE - Success
-- Performing Test HAS_MISSING_BRACES
-- Performing Test HAS_MISSING_BRACES - Success
-- Performing Test HAS_IGNORED_ATTRIBUTES
-- Performing Test HAS_IGNORED_ATTRIBUTES - Success
-- Performing Test HAS_DEPRECATED_COPY
-- Performing Test HAS_DEPRECATED_COPY - Failed
-- Performing Test HAS_DEPRECATED_DECLARATIONS
-- Performing Test HAS_DEPRECATED_DECLARATIONS - Success
-- Performing Test HAS_CLASS_MEMACCESS
-- Performing Test HAS_CLASS_MEMACCESS - Success
-- Performing Test HAS_MAYBE_UNINITIALIZED
-- Performing Test HAS_MAYBE_UNINITIALIZED - Success
-- The CUDA compiler identification is NVIDIA 11.0.194
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CMAKE_CUDA_COMPILER_VERSION: 11.0.194
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/gcc-8
-- Performing Test HAS_AVX512F
-- Performing Test HAS_AVX512F - Success
-- Performing Test COMPILES_AVX512F
-- Performing Test COMPILES_AVX512F - Success
-- Performing Test HAS_AVX512CORE
-- Performing Test HAS_AVX512CORE - Success
-- Performing Test COMPILES_AVX512CORE
-- Performing Test COMPILES_AVX512CORE - Success
-- Found PythonInterp: /usr/bin/python3 (found version "3.6.9") 
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.6m.so
-- Found NumPy: /usr/local/lib/python3.6/dist-packages/numpy/core/include (found version "1.19.1") 
-- NumPy ver. 1.19.1 found (include: /usr/local/lib/python3.6/dist-packages/numpy/core/include)
-- Configuring incomplete, errors occurred!
See also "/tmp/scratch/onnxruntime/build/CMakeFiles/CMakeOutput.log".
See also "/tmp/scratch/onnxruntime/build/CMakeFiles/CMakeError.log".

skottmckay · 2020-08-04T05:12:24Z

I was able to repro.

You could try this change to workaround it: https://github.com/microsoft/onnxruntime/compare/skottmckay/WorkaroundLinkerBug

Or alternatively just using /LTCG instead of /LTCG:incremental works.

xkszltl · 2020-08-04T05:21:55Z

Great!
Do you find this VS bug somewhere or have an ongoing discussion with VS team?

skottmckay · 2020-08-04T05:25:43Z

No - I looked at the generated assembly and tried a few things to make it behave. If you have time to create a minimal repro of the issue to file a VS bug please do.

If the change works for you I'll create a PR for it.

xkszltl · 2020-08-04T09:33:55Z

Will test and that you know.
You don't need to wait for the result before merging as the PR will fix an issue you repro regardless of my result.

skottmckay · 2020-08-05T10:07:53Z

It fixes it on my machine, but as I don't know the internal details of what the LTCG issue is I don't know if re-ordering the std::isinf and std::isnan will fix it with all MSVC toolchains. As such, additional data points showing it works would be very helpful.

xkszltl · 2020-08-05T10:21:49Z

Of course. Saying just because I'm working on tight schedule so maybe there are some delay.

xkszltl · 2020-08-05T15:50:50Z

Works for me

skottmckay added platform:windows issues related to the Windows platform type:support labels Jul 27, 2020

skottmckay closed this as completed Aug 2, 2020

skottmckay mentioned this issue Aug 5, 2020

Workaround for MSVC linker bug #4713

Merged

This was referenced May 14, 2022

[Snyk] Security upgrade mocha from 8.3.2 to 9.1.2 ekmixon/onnxruntime#75

Open

[Snyk] Fix for 1 vulnerabilities ekmixon/onnxruntime#76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failure on Windows #4623

Test failure on Windows #4623

xkszltl commented Jul 26, 2020

skottmckay commented Jul 27, 2020

xkszltl commented Jul 27, 2020

skottmckay commented Jul 27, 2020

xkszltl commented Jul 28, 2020

skottmckay commented Jul 28, 2020

xkszltl commented Jul 28, 2020

xkszltl commented Jul 28, 2020

skottmckay commented Jul 28, 2020

xkszltl commented Jul 29, 2020

xkszltl commented Jul 29, 2020 •

edited

Loading

xkszltl commented Jul 29, 2020

xkszltl commented Aug 3, 2020

skottmckay commented Aug 3, 2020

xkszltl commented Aug 3, 2020

skottmckay commented Aug 4, 2020

xkszltl commented Aug 4, 2020

skottmckay commented Aug 4, 2020

xkszltl commented Aug 4, 2020

skottmckay commented Aug 5, 2020

xkszltl commented Aug 5, 2020

xkszltl commented Aug 5, 2020

Test failure on Windows #4623

Test failure on Windows #4623

Comments

xkszltl commented Jul 26, 2020

skottmckay commented Jul 27, 2020

xkszltl commented Jul 27, 2020

skottmckay commented Jul 27, 2020

xkszltl commented Jul 28, 2020

skottmckay commented Jul 28, 2020

xkszltl commented Jul 28, 2020

xkszltl commented Jul 28, 2020

skottmckay commented Jul 28, 2020

xkszltl commented Jul 29, 2020

xkszltl commented Jul 29, 2020 • edited Loading

xkszltl commented Jul 29, 2020

xkszltl commented Aug 3, 2020

skottmckay commented Aug 3, 2020

xkszltl commented Aug 3, 2020

skottmckay commented Aug 4, 2020

xkszltl commented Aug 4, 2020

skottmckay commented Aug 4, 2020

xkszltl commented Aug 4, 2020

skottmckay commented Aug 5, 2020

xkszltl commented Aug 5, 2020

xkszltl commented Aug 5, 2020

xkszltl commented Jul 29, 2020 •

edited

Loading