Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama 3.1 405B fp8 support #383

Open
endomorphosis opened this issue Jul 31, 2024 · 6 comments
Open

llama 3.1 405B fp8 support #383

endomorphosis opened this issue Jul 31, 2024 · 6 comments
Labels
DEV features

Comments

@endomorphosis
Copy link

I have been staging some updates testing the tgi-gaudi software with llama 405B fp8, i am waiting for habana optimum to approve the PR, and then I will submit a pr for huggingface/tgi_gaudi and will submit a PR for TGI in the microservices.

I got it running on xeon with llama_cpp (which is what ollama is based on) at 1 tok/s on sapphire rapids, but am going to test speculative decoding for llama 3.1 8b, which should improve performance 10-20 times depending on how many tokens can be completed by the draft model. However ollama is broken and that will need to be investigated further.

@kevinintel kevinintel added the DEV features label Jul 31, 2024
@endomorphosis
Copy link
Author

image

lkk12014402 pushed a commit that referenced this issue Aug 8, 2024
@endomorphosis
Copy link
Author

I have thoroughly gone through all of the examples and interfaces with regards to

optimum-habana
Intel Neural Compressor (3.0 and 2.4)
tgi-gaudi

Currently the situation is that the libraries are in a state of disrepair for everything but bf16, as a result of a lack of unit testing, integration testing, regression testing. The examples do not work because they were designed for previous versions of libraries that are no longer working with the other versions of the libraries, the only thing that does work with regards to quantization is compile time quantization, which does not actually reduce the number of devices needed to run a model, but does increase the inference speed of the models, however with llama 3.1 405b it is currently impossible to run it on a single node, but only because the software packages are not being maintained in a functioning state.

I have spend 3 days so far on this endeavor, and I am unwilling to take the time needed to become a maintainer of those libraries, even if I do want to reduce hallucinations in my language modeling tasks, as I have been getting asked to complete my AGPL edge oriented mlops infrastructure package more quickly by @jaanli so that he can migrate away from google tpu cloud.

endomorphosis/ipfs_transformers_py#1 (comment)

@jaanli
Copy link

jaanli commented Aug 12, 2024

Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5

Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :)

@endomorphosis
Copy link
Author

Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5

Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :)

I have no idea what hardware you are running it on.

@jaanli
Copy link

jaanli commented Aug 12, 2024

Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5
Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :)

I have no idea what hardware you are running it on.

Ah yes, sorry - iPhone pro max 15 with latest firmware.

@endomorphosis
Copy link
Author

HabanaAI/vllm-fork#144
I am going to continue with attempting to get llama 405b with speculative decoding with llama 8b working, and process some Wikipedia datasets and embeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DEV features
Projects
None yet
Development

No branches or pull requests

3 participants