Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[uTVM][Runtime] Deprecate uTVM Standalone Runtime #5060

Closed
1 of 9 tasks
liangfu opened this issue Mar 13, 2020 · 17 comments
Closed
1 of 9 tasks

[uTVM][Runtime] Deprecate uTVM Standalone Runtime #5060

liangfu opened this issue Mar 13, 2020 · 17 comments
Labels
vert:micro MicroTVM: src/runtime/micro, src/runtime/crt, apps/microtvm

Comments

@liangfu
Copy link
Member

liangfu commented Mar 13, 2020

Since the MISRA-C runtime has been merged in PR #3934 and discussed in RFC #3159 , I think now it's time to migrate uTVM standalone runtime ( introduced in PR #3567 )

Rationale

  • MISRA-C runtime takes smaller code size (45 KiB vs approx. 100 KiB)
  • MISRA-C runtime is more portable, since it's completely written in pure C
  • MISRA-C runtime is designed to be more stable, since it tries to avoid using typecasts and dynamic allocations.
  • uTVM standalone runtime is currently not tested in the CI, see WIP PR [uTVM] Enable Testing Standalone uTVM Runtime in CI #4991

Actionable Items

  • Implement a memory container that returns addresses from a single stack, PR [uTVM][Runtime] Introduce Virtual Memory Allocator to CRT #5124
  • Implement arena based memory allocator for CRT
  • Remove picojson library in 3rdparty directory, and it would be replaced by src/runtime/crt/load_json.h
  • Supersede uTVM standalone runtime with MISRA-C runtime
  • Enable testing new uTVM standalone runtime in CI
  • Demonstrate possibility to run TVM independently on micro-controllers, possibly a demo on
    • STM32F746 board or
    • Arty-A7 with Freedom E300 or
    • Sparkfun Edge

Please leave your comment.

cc @areusch

@tqchen
Copy link
Member

tqchen commented Mar 13, 2020

Cross posting to here. I think it worth to think about memory allocation strategy. Specificially, we should design an API that contains a simple allocator(which is arena like and allocate memory from a stack, and release everything once done), and use that allocator for all memories in the program(including data structures and tensors). This will completely eliminate the usage of system calls and allow the program o run in bare metal.

Example API

// call use system call to get the memory, or directly points to memory segments in ucontroller
UTVMAllocator* arena = UTVMCreateArena(10000);
// Subsequent data structures are allocated from the allocator
// The free calls will recycle data into the allocator
// The simplest strategy is not to recycle at all
UTVMSetAllocator(arena);

// normal TVM API calls

@tmoreau89
Copy link
Contributor

@liangfu regarding "superseding uTVM standalone runtime", will MISRA-C runtime support running on bare-metal systems?

@tmoreau89
Copy link
Contributor

@ajtulloch @weberlo @u99127 (this might be of interest to you)

@liangfu
Copy link
Member Author

liangfu commented Mar 23, 2020

@liangfu regarding "superseding uTVM standalone runtime", will MISRA-C runtime support running on bare-metal systems?

Yes, at least it intended to be, but how shall we provide a proper demo on this? Any idea?

@tmoreau89
Copy link
Contributor

We can test it on the STM board that @weberlo implemented a demo on: #4274

@liangfu
Copy link
Member Author

liangfu commented Mar 23, 2020

Excellent idea. Perhaps we can also test the bare-metal demo in CI, with a simple RISCV processor like picorv32.

@KireinaHoro
Copy link

KireinaHoro commented Mar 24, 2020

Cross posting to here. I think it worth to think about memory allocation strategy. Specificially, we should design an API that contains a simple allocator(which is arena like and allocate memory from a stack, and release everything once done), and use that allocator for all memories in the program(including data structures and tensors). This will completely eliminate the usage of system calls and allow the program o run in bare metal.

@tqchen Removing all external allocator use and go with an embedded arena allocator sounds a little bit fishy. Bare-metal platforms does not necessarily lack a proper allocator; newlib, for example, provides a pretty usable dlmalloc implementation. Are there any other concerns?

@liangfu
Copy link
Member Author

liangfu commented Mar 24, 2020

In PR #5124, we have a reference allocator, which implements vmalloc, vrealloc, and vfree. When necessary, I think we can redirect the function calls to different implementations, e.g. dlmalloc in newlib, jemalloc and many others.

I would agree with @KireinaHoro to use implementations in newlib for bare-metal applications.

For arena like allocator, I have concerns on how shall we deal with large memory reuse between conv layers, if we don't release allocated workspaces timely.

@tqchen
Copy link
Member

tqchen commented Mar 24, 2020

The workspace memory could have a different strategy. The way it works is that we create a different arena for workspace, along with a counter.

  • When a memory is allocated, we allocate memory from the arena, and add the counter
  • When a memory is de-allocated, we decrease the counter
  • When the counter goes to zero, we free all the memory.

This will work because all workspace memory are temporal. It also guarantees a constant time allocation

As a generalization. If most memory allocation happens in a RAII style lifecycle. e.g. everything de-allocates onces we exit a scope, then the counter based strategy(per scope) is should work pretty well.

I am not fixated about the arena allocator, but would like to challenge us to think a bit how much simpler can we make the allocation strategy looks like given what we know about the workload. Of course, we could certainly bring sub-allocator strategies that are more complicated, or fallback to libraries when needed

@u99127
Copy link
Contributor

u99127 commented Mar 24, 2020

Thanks for pointing this to me @tmoreau89 and thank you for this work @liangfu . Very interesting and good questions to ask.

From a design level point of view for micro-controllers I'd like to take this one step further and challenge folks to think about whether this can be achieved with static allocation rather than any form of dynamic allocation . The hypothesis being that at compile time one would know how much temporary space is needed between layers rather than having to face a run time failure.

Dynamic allocation on micro-controllers suffers from fragmentation issues and further do we want to have dynamic allocation in the runtime on micro-controllers. Further the model being executed will be part of a larger application - how can we allow our users to specify the amount of heap available or being consumed for executing their model ? It would be better to try to provide that with diagnostics at link time or compilation time rather than at runtime. @mshawcroft might have more to add. And yes, in our opinion for micro-controllers one of the challenges is the availability and usage of temporary storage for working set calculations between layers.

2 further design questions.

  1. In the micro-controller world, supporting every new device with their different memory maps and what not will be painful and beyond one simple reference implementation, I don't think we have an efficient route to deployment other than integrating with other platforms in the microcontroller space. How would this runtime integrate with other platforms like Zephyr, mbedOS or FreeRTOS ?

  2. I'd be interested in extending CI with qemu or some such for Cortex-M as well or indeed on the STM board that you are using @tmoreau89 .

Purely a nit but from a rationale point of view, I would say that uTVM runtime not being tested in a CI is technical debt :)

regards
Ramana

@tqchen
Copy link
Member

tqchen commented Mar 25, 2020

re: fragmentation issue, think the allocation strategies carefully and adopt an arena-style allocator(counter based as above) can likely resolve the issue of fragementation. In terms of the total memory cost, we can indeed found the cost out during compile time for simple graph programs

@liangfu
Copy link
Member Author

liangfu commented Mar 25, 2020

It's very interesting to see tflite is using arena like allocator for micro-controllers. See how adafruit demonstrate its PyBadge board with TFLite here.

@tqchen
Copy link
Member

tqchen commented Mar 25, 2020

@liangfu can you try to do a arena based approach given that it is simpler? We could adopt the counter based approach to enable early free of sub-arenas(when the free counters in the arena decreases to zero, we can free the space)

@liangfu
Copy link
Member Author

liangfu commented Mar 26, 2020

Sure, as this is definitely the direction we should follow, I can do that. And maybe we need a separate PR for the arena allocator feature.

@Robeast
Copy link

Robeast commented May 6, 2020

Hi @liangfu is there any update on your current implementation efforts? We are really looking forward to it!!

@liangfu
Copy link
Member Author

liangfu commented May 8, 2020

Hi @Robeast, thanks for your attention. I only have a draft version of the new allocator for now, I'd like to send a PR soon this week.

@masahi
Copy link
Member

masahi commented Jan 9, 2022

Can we close this?

@areusch areusch added the needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it label Oct 19, 2022
@areusch areusch added vert:micro MicroTVM: src/runtime/micro, src/runtime/crt, apps/microtvm and removed needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it labels Nov 16, 2022
@tqchen tqchen closed this as completed Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
vert:micro MicroTVM: src/runtime/micro, src/runtime/crt, apps/microtvm
Projects
None yet
Development

No branches or pull requests

8 participants