Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC error for large arrays #5514

Closed
giuseros opened this issue May 5, 2020 · 3 comments
Closed

RPC error for large arrays #5514

giuseros opened this issue May 5, 2020 · 3 comments
Assignees

Comments

@giuseros
Copy link
Contributor

giuseros commented May 5, 2020

Hi all,
I am running TVM from an Ubuntu 16.04 machine and I have the tracker running on the same machine.

An aarch64 machine is connected to the tracker.

When running from the master branch, the following Python code:

remote = autotvm.measure.request_remote(device_key, device_tracker, device_port, timeout=10000)
ctx = remote.cpu()
a = tvm.nd.array(np.ones((5041,720)).astype('float32'), ctx)
b = tvm.nd.array(np.ones((720,192)).astype('float32'), ctx)

Produces the following error on the server: free(): invalid next size (normal)

On the host side, I get this error instead:

Traceback (most recent call last):
  File "tvm/python/tvm/runtime/ndarray.py", line 503, in array
    return empty(arr.shape, arr.dtype, ctx).copyfrom(arr)

  File "tvm/python/tvm/runtime/ndarray.py", line 145, in copyfrom
    check_call(_LIB.TVMArrayCopyFromBytes(self.handle, data, nbytes))

  File "tvm/python/tvm/_ffi/base.py", line 330, in check_call
    raise get_last_ffi_error()

tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (7) tvm/build/libtvm.so(TVMArrayCopyFromBytes+0xa) [0x7f808df5397a]
  [bt] (6) tvm/build/libtvm.so(tvm::runtime::ArrayCopyFromBytes(DLTensor*, void const*, unsigned long)+0x7c4) [0x7f808df537c4]
  [bt] (5) tvm/build/libtvm.so(tvm::runtime::RPCDeviceAPI::CopyDataFromTo(void const*, unsigned long, void*, unsigned long, unsigned long, DLContext, DLContext, DLDataType, void*)+0x42f) [0x7f808df97e7f]
  [bt] (4) tvm/build/libtvm.so(tvm::runtime::RPCSession::CopyToRemote(void*, unsigned long, void*, unsigned long, unsigned long, DLContext, DLDataType)+0x28f) [0x7f808df8400f]
  [bt] (3) tvm/build/libtvm.so(tvm::runtime::RPCSession::HandleUntilReturnEvent(tvm::runtime::TVMRetValue*, bool, tvm::runtime::PackedFunc const*)+0x13f) [0x7f808df835ef]
  [bt] (2) tvm/build/libtvm.so(+0xd3824c) [0x7f808df9024c]
  [bt] (1) tvm/build/libtvm.so(tvm::support::Socket::Error(char const*)+0x90) [0x7f808df85220]
  [bt] (0) tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f808d63dff2]
  File "/workspace/src/runtime/rpc/../../support/socket.h", line 362
TVMError: Socket SockChannel::Recv Error:Connection reset by peer

I investigated the issue and found out that it is related to this commit: afcf939

I.e., the commit before that (i.e., 9a8ed5b) works fine.

Any thoughts on what can be causing the issue?

I am cc'ing @jmorrill who is the author of the aforementioned PR.

Thanks,
Giuseppe

P.S. I also started a discuss post here: https://discuss.tvm.ai/t/rpc-error-for-large-arrays/6591

@tqchen
Copy link
Member

tqchen commented May 5, 2020

Looking into the diff you point out, perhaps the most relevant one would be the change on the ring buffer

@tqchen tqchen self-assigned this May 5, 2020
@tqchen
Copy link
Member

tqchen commented May 5, 2020

Please see if #5516 fixes the problem

@giuseros
Copy link
Contributor Author

giuseros commented May 6, 2020

Hi @tqchen ,
Thanks for the prompt fix! It is now working fine (it was also nice to dig a bit around the RPC part of the codebase).

I will close the issue now.

@giuseros giuseros closed this as completed May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants