Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure on Windows #4623

Closed
xkszltl opened this issue Jul 26, 2020 · 21 comments
Closed

Test failure on Windows #4623

xkszltl opened this issue Jul 26, 2020 · 21 comments
Labels
platform:windows issues related to the Windows platform

Comments

@xkszltl
Copy link
Contributor

xkszltl commented Jul 26, 2020

Describe the bug

[ RUN      ] TensorOpTest.CastFromString
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 9.9999997473787516e-05.
i:6, provider_type: CPUExecutionProvider
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 9.9999997473787516e-05.
i:7, provider_type: CPUExecutionProvider
[  FAILED  ] TensorOpTest.CastFromString (134 ms)

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Win Server 2019
  • ONNX Runtime installed from (source or binary): source
  • ONNX Runtime version: 1.4.0
  • Python version: 3.8.5
  • Visual Studio version (if applicable): 2019

Screenshots

image

image

@skottmckay
Copy link
Contributor

This is a little strange given these tests pass in CI builds (unit tests are run for all builds).

Looking at the test code the expected value should be NaN so not sure why it is -NaN in your build. Any chance something is redefining the NAN macro in your setup?

std::initializer_list<std::string> string_data = {"-inf", "+INF", "0.9767611f", "0.28280696f",
"-0.12019656f", "5.0f", "NaN", "nan"};
const std::initializer_list<float> float_output = {-(std::numeric_limits<float>::infinity()), std::numeric_limits<float>::infinity(),
0.9767611f, 0.28280696f,
-0.12019656f, 5.0f, NAN, NAN};

The Cast implementation uses std::stof in this case to convert the input so wouldn't be affected by the NAN macro

mutable_data[i] = std::stof(in->Data<std::string>()[i]);

@skottmckay skottmckay added platform:windows issues related to the Windows platform type:support labels Jul 27, 2020
@xkszltl
Copy link
Contributor Author

xkszltl commented Jul 27, 2020

No I don't have any macro redefinition for NAN.

Also I noticed there's nan-specific check.
Why it is not captured before going through line 153?

} else if (std::isnan(expected[i])) {

@skottmckay
Copy link
Contributor

Not sure why the nan-specific check isn't being satisfied. It is on my machine, and expected is actually -nan(ind) . So the output from your run for expected and actual match what I see, but the test passes for me as the std::isnan check on line 147 is satisfied.

expected,8
0x000001ecf0c06920 {-inf, inf, 0.976761103, 0.282806963, -0.120196559, 5.00000000, -nan(ind), -nan(ind)}
    [0]: -inf
    [1]: inf
    [2]: 0.976761103
    [3]: 0.282806963
    [4]: -0.120196559
    [5]: 5.00000000
    [6]: -nan(ind)
    [7]: -nan(ind)
output,8
0x000001ecf2d9b080 {-inf, inf, 0.976761103, 0.282806963, -0.120196559, 5.00000000, nan, nan}
    [0]: -inf
    [1]: inf
    [2]: 0.976761103
    [3]: 0.282806963
    [4]: -0.120196559
    [5]: 5.00000000
    [6]: nan
    [7]: nan

Are you able to step into that call in a debugger? On Windows 10 it is defined in corecrt_math.h so wondering if there's something different in your setup.

    template <class _Ty>
    _Check_return_ inline bool isnan(_In_ _Ty _X) throw()
    {
        return fpclassify(_X) == FP_NAN;
    }

@xkszltl
Copy link
Contributor Author

xkszltl commented Jul 28, 2020

Maybe it is some strange compiler optimization?
This is how we compile it
https://github.com/xkszltl/Roaster/blob/8f7339d9d9cad1d60735d982035e95f4d867f74a/win/pkgs/ort.ps1#L150

We do have extra /GL and /LTCG but I think the negative sign is already strange in the first place.

@skottmckay
Copy link
Contributor

@xkszltl
Copy link
Contributor Author

xkszltl commented Jul 28, 2020

I don't add that explicitly.
As you can see in the script, here's all I add:

-DCMAKE_C_FLAGS="/GL /MP /Zi /arch:AVX2"
-DCMAKE_CXX_FLAGS="/EHsc /GL /MP /Zi /arch:AVX2"
-DCMAKE_EXE_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_SHARED_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_STATIC_LINKER_FLAGS="/LTCG:incremental"

I can try adding an extra /fp:precise at the end if you want.
Maybe /arch:AVX2 is more suspicious, but ort has official option for that as well.

@xkszltl
Copy link
Contributor Author

xkszltl commented Jul 28, 2020

Same error when I append /fp:precise:

PS D:\roaster-scratch\onnxruntime\build> .\Release\onnxruntime_test_all.exe --gtest_filter='*.CastFromString'
Note: Google Test filter = *.CastFromString
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TensorOpTest
[ RUN      ] TensorOpTest.CastFromString
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 0.004999999888241291.
i:6, provider_type: CPUExecutionProvider
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 0.004999999888241291.
i:7, provider_type: CPUExecutionProvider
D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:123 onnxruntime::CudaCall D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:117 onnxruntime::CudaCall CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=1129250816 ; hostname=Roaster-Win ; expr=cudaSetDevice(device_id_);


Provider:CUDAExecutionProvider
unknown file: error: C++ exception with description "D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:123 onnxruntime::CudaCall D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:117 onnxruntime::CudaCall CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=1129250816 ; hostname=Roaster-Win ; expr=cudaSetDevice(device_id_);

" thrown in the test body.
[  FAILED  ] TensorOpTest.CastFromString (26 ms)
[----------] 1 test from TensorOpTest (28 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (43 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TensorOpTest.CastFromString

 1 FAILED TEST

@skottmckay
Copy link
Contributor

I'm not sure why std::isnan is failing to return true. Both -nan(ind) and nan should be considered to be a nan. https://stackoverflow.com/questions/61253965/nan-differences-stdnan-vs-quiet-nan-vs-macro-nan has some more details if that's helpful.

What happens if you build without any of the extra flags?

@xkszltl
Copy link
Contributor Author

xkszltl commented Jul 29, 2020

So far:

  1. Removing /arch:AVX2 does not help.
  2. Adding /fp:precise does not help.

Trying to remove /GL and /LTCG:incremental.

@xkszltl
Copy link
Contributor Author

xkszltl commented Jul 29, 2020

Quote from the stackoverflow answer:

By the way, it's like min/max macros, even if one had the bad idea to define them, don't use them

Love that.

Seems people also found their msvc has negative NAN.

@xkszltl
Copy link
Contributor Author

xkszltl commented Jul 29, 2020

Removing /GL and /LTCG:incremental works.
That usually indicates the existence of UB.

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 3, 2020

@skottmckay
Have you found the cause when you close the ticket?
LTO should not change the behavior of that if (isnan()), unless there's bug in code making compiler think it's always false.

@skottmckay
Copy link
Contributor

Multiple of our CI builds enable LTO (all the release package builds do) and we don't see any issues there. Given that, what would be actionable from an onnxruntime perspective?

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 3, 2020

This is a strong (meaning I don't have other clue) indication about either msvc or ort was wrong.
Were you able to repro with the following on your side?

-DCMAKE_C_FLAGS="/GL /MP /Zi"
-DCMAKE_CXX_FLAGS="/EHsc /GL /MP /Zi"
-DCMAKE_EXE_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_SHARED_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_STATIC_LINKER_FLAGS="/LTCG:incremental"

We had trouble enabling ort's LTO flag due to conflict with protobuf used elsewhere in our system.
It used to be #902, and now it's banned.

++ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache '-DCMAKE_C_FLAGS=-fdebug-prefix-map='\''/tmp/scratch'\''
='\''/usr/local/src'\'' -g -march=haswell -mtune=generic' '-DCMAKE_CXX_FLAGS=-fdebug-prefix-map='\''/tmp/scratch'\''='\''/usr/local/src'\'' -g -march=haswell -mtune=generic' -DCMAKE_INSTALL_PREFIX=/tmp/scratch/onnxruntime/install.yUpv0xXdzi/root/usr/loca
l -DCMAKE_POLICY_DEFAULT_CMP0060=NEW -DCMAKE_VERBOSE_MAKEFILE=ON -DPython_ADDITIONAL_VERSIONS=3.6 -Deigen_SOURCE_PATH=/usr/local/include/eigen3 -Donnxruntime_BUILD_FOR_NATIVE_MACHINE=OFF -Donnxruntime_BUILD_SHARED_LIB=ON -Donnxruntime_CUDNN_HOME=/usr/lib
64 -Donnxruntime_ENABLE_LANGUAGE_INTEROP_OPS=ON -Donnxruntime_ENABLE_LTO=ON -Donnxruntime_ENABLE_PYTHON=ON -Donnxruntime_PREFER_SYSTEM_LIB=ON -Donnxruntime_RUN_ONNX_TESTS=ON -Donnxruntime_USE_CUDA=ON -Donnxruntime_USE_DNNL=ON -Donnxruntime_USE_EIGEN_FOR_
BLAS=ON -Donnxruntime_USE_FULL_PROTOBUF=ON -Donnxruntime_USE_JEMALLOC=OFF -Donnxruntime_USE_LLVM=ON -Donnxruntime_USE_MKLML=OFF -Donnxruntime_USE_NGRAPH=OFF -Donnxruntime_USE_NUPHAR=OFF -Donnxruntime_USE_OPENBLAS=OFF -Donnxruntime_USE_OPENMP=OFF -Donnxru
ntime_USE_PREINSTALLED_EIGEN=OFF -Donnxruntime_USE_TENSORRT=OFF -Donnxruntime_USE_TVM=OFF -GNinja ../cmake
-- The C compiler identification is GNU 8.4.0
-- The CXX compiler identification is GNU 8.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc-8 - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++-8 - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found PythonInterp: /usr/bin/python3 (found suitable version "3.6.9", minimum required is "3.5") 
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.6m.so (found suitable version "3.6.9", minimum required is "3.5") 
-- Found Protobuf: /usr/local/lib/libprotobuf.so;-lpthread (found version "3.12.4") 
Use protobuf from preinstalled system lib
CMake Warning at CMakeLists.txt:421 (message):
  Please enable Protobuf_USE_STATIC_LIBS


CMake Error at CMakeLists.txt:452 (target_link_options):
  Cannot specify link options for target "protoc" which is not built by this
  project.


# date: USE_SYSTEM_TZ_DB ON
# date: USE_TZ_DB_IN_DOT OFF
# date: BUILD_SHARED_LIBS OFF
# date: ENABLE_DATE_TESTING OFF
CMake Warning at CMakeLists.txt:497 (find_package):                                                                                                                                                                                                   [0/1878]
  By not providing "Findre2.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "re2", but
  CMake did not find one.

  Could not find a package configuration file provided by "re2" with any of
  the following names:

    re2Config.cmake
    re2-config.cmake

  Add the installation prefix of "re2" to CMAKE_PREFIX_PATH or set "re2_DIR"
  to a directory containing one of the above files.  If "re2" provides a
  separate development package or SDK, be sure it has been installed.


-- Performing Test HAS_UNUSED_BUT_SET_VARIABLE
-- Performing Test HAS_UNUSED_BUT_SET_VARIABLE - Success
-- Performing Test HAS_UNUSED_PARAMETER
-- Performing Test HAS_UNUSED_PARAMETER - Success
-- Performing Test HAS_UNUSED_VARIABLE
-- Performing Test HAS_UNUSED_VARIABLE - Success
-- Performing Test HAS_CAST_FUNCTION_TYPE
-- Performing Test HAS_CAST_FUNCTION_TYPE - Success
-- Performing Test HAS_PARENTHESES
-- Performing Test HAS_PARENTHESES - Success
-- Performing Test HAS_USELESS_CAST
-- Performing Test HAS_USELESS_CAST - Success
-- Performing Test HAS_NONNULL_COMPARE
-- Performing Test HAS_NONNULL_COMPARE - Success
-- Performing Test HAS_TAUTOLOGICAL_POINTER_COMPARE
-- Performing Test HAS_TAUTOLOGICAL_POINTER_COMPARE - Failed
-- Performing Test HAS_CATCH_VALUE
-- Performing Test HAS_CATCH_VALUE - Success
-- Performing Test HAS_MISSING_BRACES
-- Performing Test HAS_MISSING_BRACES - Success
-- Performing Test HAS_IGNORED_ATTRIBUTES
-- Performing Test HAS_IGNORED_ATTRIBUTES - Success
-- Performing Test HAS_DEPRECATED_COPY
-- Performing Test HAS_DEPRECATED_COPY - Failed
-- Performing Test HAS_DEPRECATED_DECLARATIONS
-- Performing Test HAS_DEPRECATED_DECLARATIONS - Success
-- Performing Test HAS_CLASS_MEMACCESS
-- Performing Test HAS_CLASS_MEMACCESS - Success
-- Performing Test HAS_MAYBE_UNINITIALIZED
-- Performing Test HAS_MAYBE_UNINITIALIZED - Success
-- The CUDA compiler identification is NVIDIA 11.0.194
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CMAKE_CUDA_COMPILER_VERSION: 11.0.194
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/gcc-8
-- Performing Test HAS_AVX512F
-- Performing Test HAS_AVX512F - Success
-- Performing Test COMPILES_AVX512F
-- Performing Test COMPILES_AVX512F - Success
-- Performing Test HAS_AVX512CORE
-- Performing Test HAS_AVX512CORE - Success
-- Performing Test COMPILES_AVX512CORE
-- Performing Test COMPILES_AVX512CORE - Success
-- Found PythonInterp: /usr/bin/python3 (found version "3.6.9") 
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.6m.so
-- Found NumPy: /usr/local/lib/python3.6/dist-packages/numpy/core/include (found version "1.19.1") 
-- NumPy ver. 1.19.1 found (include: /usr/local/lib/python3.6/dist-packages/numpy/core/include)
-- Configuring incomplete, errors occurred!
See also "/tmp/scratch/onnxruntime/build/CMakeFiles/CMakeOutput.log".
See also "/tmp/scratch/onnxruntime/build/CMakeFiles/CMakeError.log".

@skottmckay
Copy link
Contributor

I was able to repro.

You could try this change to workaround it: https://github.com/microsoft/onnxruntime/compare/skottmckay/WorkaroundLinkerBug

Or alternatively just using /LTCG instead of /LTCG:incremental works.

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 4, 2020

Great!
Do you find this VS bug somewhere or have an ongoing discussion with VS team?

@skottmckay
Copy link
Contributor

No - I looked at the generated assembly and tried a few things to make it behave. If you have time to create a minimal repro of the issue to file a VS bug please do.

If the change works for you I'll create a PR for it.

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 4, 2020

Will test and that you know.
You don't need to wait for the result before merging as the PR will fix an issue you repro regardless of my result.

@skottmckay
Copy link
Contributor

It fixes it on my machine, but as I don't know the internal details of what the LTCG issue is I don't know if re-ordering the std::isinf and std::isnan will fix it with all MSVC toolchains. As such, additional data points showing it works would be very helpful.

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 5, 2020

Of course. Saying just because I'm working on tight schedule so maybe there are some delay.

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 5, 2020

Works for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

2 participants