Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrading to version 1.8.0, the async function loadModelFromUrl is not completing when using large models #31

Closed
felladrin opened this issue May 13, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@felladrin
Copy link
Contributor

Something interesting occurred while upgrading to version 1.8.0. Previously, it had been throwing an "Out of Memory" error, but that issue has now been resolved. However, a new problem has surfaced, where the async function loadModelFromUrl does not complete. It appears to be stuck in a state where it neither resolves nor rejects. It's possible that the error may be caught in the middle of the process and not being passed up.

This issue can be reproduced with models that are too large to fit into the device's memory. It works perfectly fine with smaller models.

It's possible that this problem is related to the changes made in this pull request:

However, as I only encountered this issue on the iOS browser, it's also possible that it's related to this change:

If anyone would like to test this problem, you can use this 10-part split-gguf of TinyLlama on a device with less than 6GB of RAM: https://huggingface.co/Felladrin/gguf-sharded-TinyLlama-1.1B-1T-OpenOrca/resolve/main/tinyllama-1.1b-1t-openorca.Q3_K_S.shard-00001-of-00010.gguf.
(If an even larger model is needed, there are also Q4_K_M and Q8_0 versions available in this repository.)

@ngxson
Copy link
Owner

ngxson commented May 14, 2024

Probably because the out of memory error is now thrown internally by cpp code (and not by worker js code). Can you confirm if you see error from llama_new_context_with_model? (ref. #12 (comment))

@flatsiedatsie
Copy link
Contributor

flatsiedatsie commented May 14, 2024

Sounds like the same issue I came across here?

With version 1.8 Wllama doesn't seem to raise an error though? It just just states the issue in the console. But my code thinks the model has loaded OK, even though it hasn't. Is there a way to get the failed state?

// Doh, you already figured that out :-)

@felladrin
Copy link
Contributor Author

Thanks for the reference. There is a lot of good info in that thread!

I've just noticed a pattern regarding this issue:

The loadModelFromUrl function is only hanging when running multi-threaded. It doesn't even print the warnings on the console. When I connect the mobile to Safari DevTools, I see the following:
image
From the screenshot, we can see that the device was using n_threads == 2.

When I force it to use n_threads = 1 with the same model, it then prints the warnings and also triggers the error, allowing me to catch it with the try/catch.

Indicating that the loadModelFromUrl is only not completing when using a too-large-model with multi-threading.

PS: I haven't tested your changes from #34.

@felladrin
Copy link
Contributor Author

felladrin commented May 19, 2024

ℹ️ This issue (loadModelFromUrl hanging when used with multi-threading and loading a too-large model) is still present in v1.9.0.
I tried adjusting the stepBytes and maxBytes from getWasmMemory() to see if any combination could resolve the issue, but unfortunately, I couldn't find a solution. I've run out of ideas. Since it's running fine with small models, I've decided not to use large models (> 1 billion parameters) on mobile anymore.

Note: iOS browsers don't clear the memory of web workers properly when reloading the page. For instance, if the page is reloaded before calling wllama.exit(), trying to use wllama.loadModelFromUrl() will run with even lower memory than usual. So this hanging was more evident after reloading the page and re-running the inference.
Found these related issues that, unfortunately, don't have a solution:

image

@ngxson
Copy link
Owner

ngxson commented May 21, 2024

@felladrin Sorry for the late response. Yeah seems like there are a lot of problems with Safari on iOS.

This issue (loadModelFromUrl hanging when used with multi-threading and loading a too-large model) is still present in v1.9.0.

Do you get the same error as last time (i.e. Aborted()) ?

iOS browsers don't clear the memory of web workers properly when reloading the page.

Probably we can make the web worker to exit itself when the page reload. But I still doubt doing this, since this should be responsibility of the browser. I'll have a look on this when I have more time.

@felladrin
Copy link
Contributor Author

Ah, no worries @ngxson!
My intention was just to document it, so other devs facing this issue can get some clue. But I'm not waiting it to be fixed, as it's working pretty fine with models with less than 500M params.

Not sure when I'll try larger models on iOS again, but if I find anything new, I'll share here!

@ngxson ngxson added the bug Something isn't working label Jun 25, 2024
@felladrin
Copy link
Contributor Author

felladrin commented Oct 4, 2024

After the launch of iOS 18, most of those issues related to out-of-memory seem to have been gone! 🎉

I noticed that they (Apple) now force Safari to hard-reload the page when it finds it with too low memory. After the reload, with more memory available, the models usually run fine. Wllama can easily run 1B models (e.g. Llama 3.2 1B Q4_K_M) in <6GB-Memory iPhone.

@flatsiedatsie
Copy link
Contributor

Even the next iPhone SE is rumored to have 8GB of memory, so Apple is quickly making 8GB the new baseline. (The latest iPhone also comes with at least 8GB).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants