Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][μTVM] Bringing TVM to Bare-Metal Devices #2563

Closed
2 of 6 tasks
Mutinifni opened this issue Feb 4, 2019 · 10 comments
Closed
2 of 6 tasks

[RFC][μTVM] Bringing TVM to Bare-Metal Devices #2563

Mutinifni opened this issue Feb 4, 2019 · 10 comments

Comments

@Mutinifni
Copy link
Contributor

Mutinifni commented Feb 4, 2019

There has been a proliferation of resource-constrained and embedded devices that do not have operating systems or a mature software stack. This trend is likely to continue with the shift towards hardware specialization and the growing interest in open-source hardware, for which software support takes time to develop. Running ML and DL applications on such devices will lead to opportunities for faster and broader impact. This entails the following challenges:

  • Bare-metal devices usually do not have on-device memory management.
  • They typically do not support LLVM and it may be hard to develop custom IR passes for them.
  • They are hard to debug because of a rigid programming and cross-compilation interface.
  • Some of these devices may not have RPC or network support.
  • They are hard to optimize for and compare against, because efficient operator libraries typically do not exist.

image

This RFC proposes a way to support TVM on such bare-metal devices by extending the stack at the layers highlighted in the figure above. I have already pushed C codegen support, and have developed and tested an initial μTVM implementation by emulating a device with an allocated region of memory on the host machine. Next, @weberlo and I will continue this work to support devices that expose a JTAG debugging interface (including RISC-V backends), and develop an optimization framework based on AutoTVM.

Proposed Design

We envision supporting bare-metal devices in two steps. The first step is to create a separated control and data plane mechanism, where the control plane lies within the host, and is used to drive execution on the board, which will allow us to reuse the existing interfaces for AutoTVM optimizations and easy Python-based device programming. The next step is to create a minimal TVM runtime that can independently run on the board without requiring a host driver, which will make μTVM suitable for deployment. This RFC currently focuses on the first step.

Because LLVM is not as ubiquitous as standard C, we started out by building a gcc / g++ code generation backend for TVM, which has already been upstreamed (#2161). However, μTVM’s design is independent of the code generator backend used, as long as it is supported for the target device.

Overview

μTVM will contain the following components:

  • Python frontend: Set of modules that let us program bare-metal devices using TVM’s Python DSL.
  • Code generator: TVM module that can generate executable code for the target device. Will typically be one of C, C++ or LLVM.
  • LowLevelDeviceAPI: A minimal read/write/execute interface that any bare-metal device must expose in order to work with μTVM.
  • MicroDeviceAPI: Child class of TVM’s DeviceAPI that performs on-device memory management and provides helper functions to copy memory to and from device by using a LowLevelDeviceAPI backend.
  • MicroModule: Child class of TVM’s ModuleNode which provides functionality to load compiled library on to the device and execute functions, also by using LowLevelDeviceAPI.

image

The above figure shows the way in which μTVMs components interact with each other while using an OpenOCD-based connection to a RISC-V board. The board is connected to the OpenOCD server through a JTAG debugger. μTVM runtime attaches to OpenOCD server using a Tcl socket connection, which enables the runtime to read, write, and execute from on-device memory. In order to actually program the board, the C file produced by the code generator is compiled into an object file using a vendor-specific GCC (in this case riscv-gcc). μTVM reads this binary and remaps it to on-device addresses by dumping relevant program sections and linking it by using the ld linker. The various ELF sections in this remapped binary are then written to the board along with a device-side driver code (an ”init stub” similar to a bootloader). On-device function calls are then made by calling the init stub with the correct function pointer and passing appropriate arguments.

Frontend

The Python frontend for μTVM will look the same as for any other device interface. For example, the below code segment performs vector add on the bare-metal device.

        ...
        m = tvm.module.load("test.obj", "openocd")
        fadd = m['fadd']
        ctx = tvm.micro_dev(0)
        n = 1024
        a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), ctx)
        b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), ctx)
        c = tvm.nd.array(np.zeros(n, dtype=C.dtype), ctx)
        fadd(a, b, c)
        ...

Device-Side Driver Code

In order to invoke functions on the board, we implement a device-side driver (similar to an init stub or a bootloader) whose function body simply invokes a function pointer with type-erased arguments. These arguments can be set by the μTVM runtime whenever it needs to invoke a function on the board. The driver code is uploaded to the target device when it is started with μTVM and is never erased. For each function invocation, the runtime rewrites the driver function pointer value with the target function, copies function arguments to a dedicated memory region on the device, and then executes the function by calling the driver code.

Backend

The μTVM backend has multiple components, as described below.

LowLevelDeviceAPI

First, to enable TVM to read or write to on-device memory or to start execution on the device, we introduce a new interface called low_level_device_api.h. This is the only interface a micro device must implement to be supported in TVM. For example, this could either be an emulated region of memory on the host (HostLowLevelDeviceAPI), or an interface exposed through an on-host JTAG controller software (OpenOCDLowLevelDeviceAPI).

The code below shows the LowLevelDeviceAPI interface.

class LowLevelDeviceAPI {
  public:
  virtual ~LowLevelDeviceAPI() {}

  virtual void Write(TVMContext ctx, void* offset, uint8_t* buf, size_t num_bytes) = 0;

  virtual void Read(TVMContext ctx, void* offset, uint8_t* buf, size_t num_bytes) = 0;

  virtual void Execute(TVMContext ctx, TVMArgs args, TVMRetValue *rv, void* offset) = 0;

  virtual void Reset(TVMContext ctx) = 0;
...
};

In order to communicate with a wide variety of target devices, we plan to use OpenOCD, a debugging tool for bare-metal devices. OpenOCD lets us program devices that support JTAG hardware debuggers by providing an interface to upload binaries, read and write from memory, etc. This is achieved by sending OpenOCD instructions via a Tcl socket connection. Hence, TVM will implement the above interface in OpenOCDLowLevelDeviceAPI in order to communicate with OpenOCD. The only peculiarity about this setup is that OpenOCD must be running (separately from TVM) and connected to the target device, and this must be set up by the user. One question that remains is how to configure the connection between TVM and OpenOCD (i.e., figure out which port the OpenOCD process is listening on), without resorting to hardcoding. Our current approach is for the user to pass this as a parameter when initializing OpenOCD in the Python frontend. For example, tvm.micro_dev(dev_id = 0, openocd_port = 6666).

MicroDeviceAPI and MicroModule

Next, we extend TVM's DeviceAPI and ModuleNode classes to support bare-metal devices. The MicroDeviceAPI class adds functions that perform memory management on the target low-level devices, and also allows data to be copied to and from the target device. The MicroModule class adds support to load and manage ELF sections into memory, find correct function pointers, obtain symbol addresses, etc. Both of these in tandem allow TVM to support any new bare-metal device backend. In the next few paragraphs, we describe how, using the LowLevelDeviceAPI, we can implement memory management, library loading, and function invocation.

Because it is possible that our target bare-metal devices have tight memory constraints, we manage memory on the host. Doing so avoids the space overhead on the device of memory management bookkeeping structures and functions. For example, if we want to allocate and populate a DLTensor, we find a suitable memory region on the device (by calling into our on-host memory manager), then use the Write method to populate that memory region. The tensor on the host only stores metadata (e.g., shape, device data pointer, etc.) and queries for on-device data whenever needed by the frontend.

To load library functions onto the device, we dump the ELF sections of the compiled binary and copy them into memory regions on the board. To remove a library, we can simply overwrite this region with null values. This allows us to support multiple libraries simultaneously.

To invoke a function, as described above, we write the target function pointer to the pointer used by the device driver, and copy function arguments to a dedicated section in the device’s address space. Once this data has been written, we use the Execute method to call the target function using the device-side driver, which will read arguments from the arguments section, execute its body, then write the result, if any, back to the arguments section (because we pass memory references in function arguments). Whenever output data is read in the frontend, the runtime will simply copy back the relevant arguments from the target device. Return values from functions are not supported, and also unnecessary because we can transfer data by reference passing.

Bare-metal devices are normally very difficult to debug and will crash without providing informative error messages. Because of the difficulty of this workflow, we have already implemented a HostMicroDevice for a smoother debugging experience and for faster iteration on experimental features. The HostMicroDevice emulates a bare-metal device by allocating an executable memory region on the host and treating it as the target device. All data copying to and from this region is done in the same way as it would be on a target device. The only difference lies in the way functions are invoked, because on the host, a function is simply called by casting the appropriate region in memory as a function pointer and calling it as usual. The next step will be to develop an OpenOCD-compatible MicroDeviceAPI and MicroModule.

Design Benefits

Our proposed plan will enable the following benefits for TVM:

  • Bare-metal backend support with dependence on only read, write and execute commands.
  • Remote driving of execution on the target device by separating out the control plane and running it on the host, leading to greater flexibility in device management.
  • Easier debugging with a Python-based programming interface.
  • RISC-V support.
  • JTAG-based devices supported by OpenOCD potentially also become TVM backends.
  • Reuse TVMs existing infrastructure for optimizations.

Roadmap

  • add a gcc / g++ code generation backend ([BACKEND][CODEGEN] C codegen with tests #2161)
  • test idea on emulated host memory region (PR in a week)
  • implement a low-level OpenOCD device (1 week)
  • test on RISC-V Spike simulator (1 week)
  • test on actual hardware backends (1 week)
  • test AutoTVM-based model optimizations (2 weeks)

Comments welcome!

@Ravenwater
Copy link

Love it.

Isn't this better floated in https://discuss.tvm.ai ? There is another discussion there that has baring on this same topic: https://discuss.tvm.ai/t/vta-define-a-complete-architecture-specification-for-vta/1614

One of our goals is to drive MACRO devices such as 1M core compute clusters and custom tensor processors.

To be able to communicate with TVM efficiently, we need an architecture definition of the VTA that supports all the features of executing the IR, from fully sequential, to massively parallel.

@Mutinifni
Copy link
Contributor Author

I think our eventual goal might also be to target many-core systems with wimpy cores, although, we were not thinking about as many as 1M cores. From TVM's perspective, that would still require a multithreaded code generator suitable for the target backend. The current gcc codegen, for example, does not have multithreaded support.

Thanks for pointing to the forums discussion -- I really like the aspects you've brought up there! I am not familiar with the VTA implementation / interface in TVM, but I think defining a debug protocol and a notification mechanism between devices would also be useful to make μTVM more generic.

@Ravenwater
Copy link

@Mutinifni I am very glad you are working on μTVM. As we are both looking towards a hardware accelerated VTA backend, we are going to face the same issues. The architecture spec is going to be the vehicle that allows us to collaborate and solve these issues synergistically. The communication of the IR and its attributes to a remote hardware accelerator is going to be one such issue. Let's work together and enumerate the requirements we have and RFC the lists for the community to weight in.

@sergei-mironov
Copy link
Contributor

sergei-mironov commented Jul 4, 2019

Hi! Could you please share links to possible target devices which could be supported by this approach? I am interested in on-device training concept which may become important in the context of 'fine-tuning' use-cases of ML models like BERT.

@weberlo
Copy link
Contributor

weberlo commented Jul 4, 2019

@grwlf Hi there! As mentioned above, there are only two requirements for a device to be supported by µTVM: a C cross-compiler toolchain and an implementation of a read/write/execute interface for the device.

As far as actual devices that we've tested on, we've primarily been using an emulated host device (implemented in #3227) and Spike, a functional RISC-V simulator. Since we're using OpenOCD to target Spike, any device that supports the JTAG protocol is also supported, and we'll be upstreaming the OpenOCD low-level device implementation in a few weeks.

In the coming months, we'll be testing our implementation against a HiFive1 board and an ARM Cortex-M board (I'll need to check exactly which model). Let us know if there are any specific boards you're interested in, and otherwise, keep an eye out for the upcoming OpenOCD PR.

@dylanzika
Copy link

Hey @weberlo great work on this effort, it would really make the value stand out of this approach by selecting a similar board to something in this linked work by Arm. It would allow you to do a competitive analysis comparing with CMSIS-NN.

https://arxiv.org/pdf/1801.06601.pdf

NUCLEO-F746ZG Mbed board [14] with an Arm Cortex-M7 core running at 216 MHz

Nucleo-f746zg development board. http://www.st.com/en/evaluation-tools/nucleo-f746zg.html

@weberlo
Copy link
Contributor

weberlo commented Jul 10, 2019

Thanks for the pointer! You're right. It would be very compelling if AutoTVM generated kernels for a Cortex-M board that outperformed the kernels from this paper.

@tqchen
Copy link
Member

tqchen commented Jul 25, 2019

#3227 initial impl. Let us open new RFCs to track new improvements

@wang-y-z
Copy link
Contributor

@weberlo Nice work and I think always-on tiny machine learning on IoT devices has arrived.

Recently the work MCUNet: Tiny Deep Learning on IoT Devices which implement TinyNAS TinyEngine(a memory-efficient inference library) co-design to find best way to deploy model on to bare-matel device like STM32F746(320kb SRAM / 1MB Flash) is very intersting~ How do you think the difference between MicroTVM and MCUNet?

By the way, can you tell me what's going on about MicroTVM on RICS-V device and if you have plan to support the User defined extensions for RV?

Looking for your brilliant work and your reply~

@weberlo
Copy link
Contributor

weberlo commented Aug 7, 2020

How do you think the difference between MicroTVM and MCUNet?

Hi @wang-y-z. I wasn't aware of this work until now. Thanks for the pointer! It's a bit embarrassing to see them compare against the old runtime that was designed purely for AutoTVM purposes (and it only happened to be able to run entire models). Because of that design goal, it makes no use of flash memory, so it runs out of memory very quickly 😅.

I'd say TinyNAS isn't comparable to µTVM, since µTVM doesn't currently do any architecture search. You could imagine using only TinyNAS to produce a model, then importing the result and running it with µTVM.

TinyEngine is an interesting point of comparison, since it uses a codegen-based approach, and this is the approach we want to move towards going forward. For the past few months, we've focused on strengthening support for autotuning and deployment with the C graph runtime. However, as we look at smaller devices, there are a lot of mechanisms in the graph runtime that cause unnecessarily high memory usage (e.g., runtime overhead and JSON parsing). With the prototype Relay AoT compiler being merged soon (#6219), we'll have a good starting point for an entirely codegen-based approach.

Though the codegen approach seems to give them the most benefit (Figure 4), the model-adaptive/memory-aware optimizations in TinyEngine look compelling as well, and it would certainly be interesting to see how they could be implemented in TVM.

By the way, can you tell me what's going on about MicroTVM on RICS-V device and if you have plan to support the User defined extensions for RV?

We haven't prioritized RISC-V-specific features, since we're still building up all of the device-agnostic infrastructure. Is there a use case for user-defined extensions you have in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants