Neural Engine Support #336

BrianSemiglia · 2023-03-20T17:11:52Z

BrianSemiglia
Mar 20, 2023

Would be cool to be able to lean on the neural engine. Even if it wasn't much faster, it'd still be more energy efficient I believe.

EwoutH · 2023-06-07T11:08:06Z

EwoutH
Jun 7, 2023

@ggerganov Congratuations on getting Metal inference working (#1642), and getting ggml funded!

Now this project is maturing, are their more plans for Apple Neural Engine (ANE) support?

Some resources:

Sample Code: Training a Neural Network with Metal Performance Shaders
Use an MPS neural network graph to train a simple neural network digit classifier.
Article: Deploying Transformers on the Apple Neural Engine
Code: Apple Neural Engine (ANE) Transformers
Framework: Natural Language
Analyze natural language text and deduce its language-specific metadata.

0 replies

Dampfinchen · 2023-06-07T12:13:40Z

Dampfinchen
Jun 7, 2023

In addition, better tensor core support and support for the upcoming Meteor Lake VPUs and Ryzen AI enabled CPUs could be very beneficial.

I believe work done on getting the neural engine to run could directly translate to an even better CUDA and new DirectML acceleration.

0 replies

Marcelo5444 · 2023-06-10T10:06:43Z

Marcelo5444
Jun 10, 2023

Running the cpp code directly on the ANE is not posible. The only solution will be to chop some parts of the network into coreml models and call them inside the cpp code. Maybe the feedforward could be converted to coreml and run in paralalel. AFAIK is not easy to do and will add a lot of complicated logic inside the code. We need to consider if its worth it given the speed up of the ANE.

0 replies

shouyiwang · 2023-06-11T10:56:53Z

shouyiwang
Jun 11, 2023

If they are not willing to run LLM inference on iPhone, adding Neural Engine support would not be worthwhile.
Despite Apple's description, the NPU is a relatively small component of the iPhone's SOC, taking up only about 2-3 GPU cores' area(iPhone has 5 GPU cores). The NPU is even less noticeable on Macs, with the M2 Max having the same 16-core NPU as the iPhone. In comparison to the M2 Max's powerful GPU, the NPU's area is less than 1/10th.
This suggests that even Apple does not expect Mac developers to utilize the NPU, as it is relatively weak and insignificant. I assume the reason it included in Macs is just to improve MacOS's compatibility with iOS.

0 replies

Marcelo5444 · 2023-06-11T20:25:14Z

Marcelo5444
Jun 11, 2023

Your statement is not right. Ive got Mac M2 and the NPU even though is small, can run conv 2x faster than GPU. Also M2 Max has a different Neural Engine compared with the IPhone. We could do some computations on the ANE in order to reduce the load of the GPU. I think is something interesting to explore, however, the integration and sincronization inside the code is not trivial.

0 replies

shouyiwang · 2023-06-12T01:12:46Z

shouyiwang
Jun 12, 2023

@Marcelo5444
M2 has 10 core GPU, it is possible for a highly optimized NPU code to run faster than GPU. But that is probably not the case for Pro and Max version of M2 as they have the same NPU but much powerful GPU.
And do you have the source for your claim that "M2 Max has a different Neural Engine compared with the iPhone"?

0 replies

okpatil4u · 2023-06-12T04:26:55Z

okpatil4u
Jun 12, 2023

NPU is faster for convolution, but it doesn’t have enough speed for transformers. RNNs are not supported at all.

…

On Mon, 12 Jun 2023 at 6:42 AM, Shouyi Wang ***@***.***> wrote: @Marcelo5444 <https://github.com/Marcelo5444> M2 has 10 core GPU, it is possible for a highly optimized NPU code to run faster than GPU. But that is probably not the case for Pro and Max version of M2 as they have the same NPU but much powerful GPU. And do you have the source for your claim that "M2 Max has a different Neural Engine compared with the iPhone"? — Reply to this email directly, view it on GitHub <#336 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4D7NLVSULLLO6WSQVTXKZUJVANCNFSM6AAAAAAY5YZW2U> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

Marcelo5444 · 2023-06-12T08:34:44Z

Marcelo5444
Jun 12, 2023

As someone who have been using coreml and ANE a lot lately, I can tell that maltmul (a big chunk of transformer FLOPS) can run faster on ane vs GPU in some specific cases. Here there is a good example. Here I attach a coreml model that does 100 matmuls. I get 4x speed up in Mac M2 using ANE (217ms 1316ms) (compared with GPU execution). You can easily run that with xcode
matmul.mlmodel.zip

This comparison is not 100% fair as llama has custom GPU kernels that have been optimized for GPU, but this shows ANE has a potential.

Maybe if someone has time, we could benchmark this matmuls with llama custom kernels.

2 replies

x4080 Jun 13, 2023

can llama model converted to coreml -> ANE ?

Marcelo5444 Jun 15, 2023

The entire model problaly no, a subseccion of it, yes

Marcelo5444 · 2023-06-12T09:00:12Z

Marcelo5444
Jun 12, 2023

@okpatil4u Are supported by RNN, at least the inference.

0 replies

okpatil4u · 2023-06-12T09:01:59Z

okpatil4u
Jun 12, 2023

This was their official response. Maybe the API has changed.

0 replies

Marcelo5444 · 2023-06-12T09:51:10Z

Marcelo5444
Jun 12, 2023

Yes, RNN are basically loops with matrix mult, The .mlmodel I attached previously can be seen as a RNN. So ANE supports vanilla RNN

3 replies

Johnhersh Jul 19, 2023

I'm curious about this from a technical perspective. What does ANE provide? Is it essentially just another smaller GPU?

For a project like this, what parts would be good targets to use ANE? What are the limitations?

thibault-brunel Jul 19, 2023

Advantage: ANE provides excellent energy efficiency on matrix multiplications, especially on FP16. You usually divide the W needed by 3, albeit often at a noticeable speed expense.
Drawback: Can only be used through CoreMLtools, and Apple decides at runtime if you're running on ANE/CPU/GPU. (Sometimes very randomly). Kind of a very limited GPU with an extremely stubborn bouncer at the door.
Targets: Feed Forward phase.
Limitations: Will require a lot of custom logic and workarounds, as RNNs are officially not supported.

nullhook Nov 29, 2023

why feed forward phase is a good target here as opposed to others?

agonzalezm · 2023-12-11T14:13:23Z

agonzalezm
Dec 11, 2023

what about next Intel NPU and AMD XNDA2 that are coming in new processors, from 2024 all consumer pcs will have a powefull NPU capable of 50TOPS as dictated per Windows 12 will require this and will increase year over year. Can this type of NPU acceleration be supported and speed up inference with llama? and how this 50TOPS translate to tokens per second?

0 replies

EwoutH · 2023-12-11T15:18:15Z

EwoutH
Dec 11, 2023

Apple released MLX last week: https://github.com/ml-explore/mlx

MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research.

It might be useful in Neural Engine support. There is an example to use it to infer Llama: https://github.com/ml-explore/mlx-examples/tree/main/llama

4 replies

PawelSzpyt Jan 30, 2024

Link to example doesn't work. Can you show any reference to neural engine? I think the whole mlx works on GPU and does not utilize NE at all.

EwoutH Feb 1, 2024

They moved it to https://github.com/ml-explore/mlx-examples/tree/main/llms/llama

PawelSzpyt Feb 1, 2024

Yeah, I saw that. Still, I can't find anything related to Neural Engine in this whole repo, so either I'm bad at searching or I misunderstand your comment. Can you show me how this is related to Neural Engine? MLX framework states in its readme that it supports only GPU and CPU, neural engine is nowhere to be found. I'd love to run local inference on NE, being energy efficient and leaving my CPU&GPU available for other tasks, but I think its quite hard to do at this point?

easp Mar 14, 2024

@PawelSzpyt The lead developer confirms that MLX won't support ANE unless something changes: ml-explore/mlx#18 (comment)

agonzalezm · 2024-04-24T16:08:44Z

agonzalezm
Apr 24, 2024

What about new Snapdragon elite X announced today it has NPU capable of 40TOPS, they said windows 24H2 new local generative AI features will require this processor and NPU, so there must be some way to leverage this NPU for AI inferencing. Apple M3 hast 18 TOPS NPU this snapdragon is more than double.

Windows announced DirectML that supports NPU acceleration for machine learning models
https://devblogs.microsoft.com/directx/introducing-neural-processor-unit-npu-support-in-directml-developer-preview/

This must boost the usage of generative AI models locally without using cloud resources.

Here they say Copilot AI will require 40TOPS NPU to run locally
https://www.tomshardware.com/pc-components/cpus/intel-confirms-microsoft-copilot-will-soon-run-locally-on-pcs-next-gen-ai-pcs-require-40-tops-of-npu-performance

0 replies

DigitLib · 2024-06-23T10:10:20Z

DigitLib
Jun 23, 2024

Snapdragon and llama.cpp page
https://www.qualcomm.com/developer/blog/2024/04/big-performance-boost-for-llama-cpp-and-chatglm-cpp-with-windows

0 replies

ciekawy · 2024-07-22T23:43:56Z

ciekawy
Jul 22, 2024

Lot of cool stuff here.

https://huggingface.co/blog/mistral-coreml

1 reply

aleksandermajos Oct 30, 2024

Znakomicie!

TuzelKO · 2024-08-04T21:01:52Z

TuzelKO
Aug 4, 2024

Is there any point in supporting such a solution? https://mythic.ai/

1 reply

edhemphill Aug 26, 2024

I would say if llama.cpp wants to continue to be relevant in the next few years it will need to be very good at supporting all kinds of dedicated ML hardware, whether it be mythic, or these guys https://hailo.ai/products/generative-ai-accelerators/hailo-10h-m-2-generative-ai-acceleration-module/ or these guys https://www.syntiant.com/hardware or ...

There are dozens of them. Not to mention all the neural accelerators from better known people like Qualcomm, Huawei, etc. The future of running LLMs is most assuredly not GPUs. If llama.cpp can be the defacto standard on how you run LLMs on [blank] hardware, it might become one of the most critical pieces of open-source software in existence.

Otherwise we will all be stuck using all these guys dev kits. And, having developed embedded systems for long enough, that will be horribly painful. Silicon vendors are generally terrible maintainers of software.

hg0428 · 2024-10-10T12:01:40Z

hg0428
Oct 10, 2024

Apple has a reference implementation for transformers on the ANE.
https://github.com/apple/ml-ane-transformers

0 replies

giladgd · 2024-11-08T02:03:40Z

giladgd
Nov 8, 2024

I've done some research on what would be required to utilize the Neural Engine on Apple devices as a ggml backend.
It turns out that there are new CoreML APIs that are available since the latest OS releases (macOS 15+, iOS 18+, etc.) that allow allocating tensors directly (without having to use a model in the coreml format like before) and applying operations on them efficiently using the Neural Engine.
It's only available for usage from Swift and not Objective-C/C++, but we can expose the functions we need from Swift and use them from a cpp wrapper.

Here's an example of a matmul operation using CoreML in Swift from the Apple documentation:

let v1 = MLTensor([1.0, 2.0, 3.0, 4.0])
let v2 = MLTensor([5.0, 6.0, 7.0, 8.0])
let v3 = v1.matmul(v2)
v3.shape // is []
await v3.shapedArray(of: Float.self) // is 70.0


let m1 = MLTensor(shape: [2, 3], scalars: [
    1, 2, 3,
    4, 5, 6
], scalarType: Float.self)
let m2 = MLTensor(shape: [3, 2], scalars: [
     7,  8,
     9, 10,
    11, 12
], scalarType: Float.self)
let m3 = t1.matmul(r2)
m3.shape // is [2, 2]
await m3.shapedArray(of: Float.self) // is [[58, 64], [139, 154]]


// Supports broadcasting
let m4 = MLTensor(randomNormal: [3, 1, 1, 4], scalarType: Float.self)
let m5 = MLTensor(randomNormal: [4, 2], scalarType: Float.self)
let m6 = t4.matmul(t5)
m6.shape // is [3, 1, 1, 2]

To use the Neural Engine, the tensor operation needs to be wrapped with withMLTensorComputePolicy with a MLComputePolicy initialized with MLComputeUnits.cpuAndNeuralEngine (it can also be initialized with MLComputeUnits.all to let the OS spread the load between the Neural Engine, GPU and CPU).

This is a new API that was not available previously, so using this would mean that the Neural Engine support (using CoreML) will only be supported on recent OS versions.

The main benefit of this would be that we would utilize more of the compute available on Apple chips, which will allow performing more parallel operations that are also optimized on the chip level, which should lead to faster inference.

I'm not sure I'll have enough time to implement this myself soon, though.
I'll update if I do get to it to ensure we don't do duplicate work on this.

0 replies

Neural Engine Support #336

Replies: 19 comments · 11 replies

Replies: 19 comments 11 replies