[WIP] Tensor shape overflow checking in Blas Engine #372

larroy · 2019-04-02T01:45:19Z

With this change, trying to multiply large matrices with BLAS Engine will cause the following exception in the python code instead of crashes inside blas.

Error in CustomOp.forward: Traceback (most recent call last):
  File "/Users/pllarroy/devel/mxnet/python/mxnet/operator.py", line 987, in forward_entry
    aux=tensors[4])
  File "repro.py", line 13, in forward
    c = mx.nd.batch_dot(a, b)
  File "<string>", line 59, in batch_dot
  File "/Users/pllarroy/devel/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/Users/pllarroy/devel/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [15:47:54] /Users/pllarroy/devel/mxnet/include/mshadow/./dot_engine-inl.h:352: Check failed: mult_not_overflow<int>(batch_count, m_n, &b_m_n) Result Tensor shape (100x7000x6000) is too big, will overflow gemm signed 32 bit index

Stack trace returned 10 entries:
[bt] (0) 0   libmxnet.dylib                      0x0000000112bf1b7d dmlc::StackTrace() + 877
[bt] (1) 1   libmxnet.dylib                      0x0000000112bf16d5 dmlc::LogMessageFatal::~LogMessageFatal() + 53
[bt] (2) 2   libmxnet.dylib                      0x0000000112bcdf35 dmlc::LogMessageFatal::~LogMessageFatal() + 21
[bt] (3) 3   libmxnet.dylib                      0x0000000114f2c91b mshadow::expr::BLASEngine<mshadow::cpu, float>::batched_gemm(mshadow::Stream<mshadow::cpu>*, bool, bool, long long, long long, long long, float, float const*, long long, float const*, long long, float, float*, long long, long long, float**) + 2139
[bt] (4) 4   libmxnet.dylib                      0x0000000114f21640 void mshadow::BatchGEMM<false, false, mshadow::cpu, float>(mshadow::Tensor<mshadow::cpu, 3, float>, mshadow::Tensor<mshadow::cpu, 3, float> const&, mshadow::Tensor<mshadow::cpu, 3, float> const&, float, float, mshadow::Tensor<mshadow::cpu, 1, float*>) + 4992
[bt] (5) 5   libmxnet.dylib                      0x0000000114eec939 void mxnet::op::BatchDotForward_<mshadow::cpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&) + 3049
[bt] (6) 6   libmxnet.dylib                      0x0000000113100a55 void std::__1::__invoke_void_return_wrapper<void>::__call<void (*&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&), nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&>(void (*&&&)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&), nnvm::NodeAttrs const&&&, mxnet::OpContext const&&&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&&&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&&&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&&&) + 277
[bt] (7) 7   libmxnet.dylib                      0x0000000113100869 std::__1::__function::__func<void (*)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&), std::__1::allocator<void (*)(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&)>, void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&)>::operator()(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&) + 121
[bt] (8) 8   libmxnet.dylib                      0x0000000112ebf939 std::__1::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&)>::operator()(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&) const + 217
[bt] (9) 9   libmxnet.dylib                      0x000000011306030f mxnet::imperative::PushFCompute(std::__1::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocator<mxnet::engine::Var*> > const&, std::__1::vector<mxnet::Resource, std::__1::allocator<mxnet::Resource> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocator<mxnet::NDArray*> > const&, std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&)::'lambda'(mxnet::RunContext)::operator()(mxnet::RunContext) const + 2639

Fixes apache/mxnet#14522

apeforest · 2019-04-02T20:40:00Z

mshadow/dot_engine-inl.h

-    pp_C.push_back(C + i * m_n);
-  }
+    int m_k = 0;
+    CHECK(mult_not_overflow(m,k, &m_k));


Would a series call of this function cause runtime regression? Can we check this during debug mode?

It's going to have some performance cost, this should be measured, but a set of divisions independent of tensor shapes shouldn't be too bad, it's basically O(1) perfomance penalty. Compared to the call to gemm I would guess it's minor, but should be measured.

apache#14522 dmlc/mshadow#372

apeforest · 2019-04-04T05:52:00Z

mshadow/base.h

+ * Uses division method
+ */
+template<typename T>
+inline bool mult_not_overflow_binary(T a, T b, T *result = nullptr) {


Can we use some built-in utility of gcc to check multiplication overflow: https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc/Integer-Overflow-Builtins.html?

Good point, can be an optimization if we are under gcc once the PR is working, there's some GPU problems I'm not sure if related to CI instability this weeks in GPU or a deeper problem.

apache#14522 dmlc/mshadow#372

szha · 2019-08-04T00:45:59Z

This code base has been donated to the Apache MXNet project per #373, and repo is deprecated. Future development should continue in Apache MXNet.

CLAassistant · 2020-07-26T18:07:17Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Tensor shape overflow checking in Blas Engine

72335fc

Fixes apache/mxnet#14522

larroy force-pushed the shape_overflow branch from 59471de to 72335fc Compare April 2, 2019 01:49

apeforest reviewed Apr 2, 2019

View reviewed changes

larroy added a commit to larroy/mxnet that referenced this pull request Apr 2, 2019

Add test for gemm overflow.

7b2582d

apache#14522 dmlc/mshadow#372

larroy added a commit to larroy/mxnet that referenced this pull request Apr 2, 2019

Add test for gemm overflow.

612cb95

apache#14522 dmlc/mshadow#372

larroy mentioned this pull request Apr 2, 2019

[WIP] Add test for gemm overflow. apache/mxnet#14601

Closed

5 tasks

larroy added 9 commits April 2, 2019 18:10

Fix overflow check

bdba1fd

More checks

ea800e6

Improve alias type

dd5624f

Fix build

0321964

Add more overflow checks

6a10b5f

refine narrow_not_overflow

fc8c7ed

Add more checks for overflow

dfab551

add checks to gpu calls

ff2c8de

Compile fix for Linux

3b3e7b5

apeforest reviewed Apr 4, 2019

View reviewed changes

larroy added 2 commits April 4, 2019 17:07

fixes

1baccbf

fixes

10227e3

larroy added a commit to larroy/mxnet that referenced this pull request Apr 5, 2019

Add test for gemm overflow.

548321c

apache#14522 dmlc/mshadow#372

larroy added a commit to larroy/mxnet that referenced this pull request May 21, 2019

Add test for gemm overflow.

040423f

apache#14522 dmlc/mshadow#372

szha closed this Jul 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Tensor shape overflow checking in Blas Engine #372

[WIP] Tensor shape overflow checking in Blas Engine #372

larroy commented Apr 2, 2019 •

edited

Loading

apeforest Apr 2, 2019

larroy Apr 2, 2019 •

edited

Loading

apeforest Apr 4, 2019

larroy Apr 6, 2019

szha commented Aug 4, 2019

CLAassistant commented Jul 26, 2020

[WIP] Tensor shape overflow checking in Blas Engine #372

[WIP] Tensor shape overflow checking in Blas Engine #372

Conversation

larroy commented Apr 2, 2019 • edited Loading

apeforest Apr 2, 2019

Choose a reason for hiding this comment

larroy Apr 2, 2019 • edited Loading

Choose a reason for hiding this comment

apeforest Apr 4, 2019

Choose a reason for hiding this comment

larroy Apr 6, 2019

Choose a reason for hiding this comment

szha commented Aug 4, 2019

CLAassistant commented Jul 26, 2020

larroy commented Apr 2, 2019 •

edited

Loading

larroy Apr 2, 2019 •

edited

Loading