Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: repack for marlin when single scale is provided #2414

Closed
wants to merge 2 commits into from

Conversation

drbh
Copy link
Collaborator

@drbh drbh commented Aug 13, 2024

This PR adjust the conditional for repacking fp8 for marlin to run when a single scale is provided. This avoids a IndexError in the case that scales only contain a single value.

not related to: #2388

@@ -39,7 +39,8 @@ def __init__(
log_once(logger.info, "GPU does not support FP8, using Marlin FP8 kernel")

scales = scales.unsqueeze(0)
if scales.shape[1] == 1:
# repack weights for Marlin if a single scale is provided
if scales.size(0) == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if scales.size(0) == 1:
if scales.shape[0] == 1:

Can you explain where this change is coming from ?

Copy link
Collaborator Author

@drbh drbh Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea apologies for not including an example above. Currently if attempting to quantize an unquantized model with marlin fp8 the line above will throw when attempting to use marlin to repack.

text-generation-launcher --model-id meta-llama/Meta-Llama-3-8B --quantize fp8

with the change above this model loads and generates as expected

curl 127.0.0.1:3000/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
# {"generated_text":" Deep learning is a subset of machine learning that is inspired by the structure and function of the human brain"}

@Narsil
Copy link
Collaborator

Narsil commented Aug 14, 2024

This change doesn' seem to fix neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 for me

@Narsil
Copy link
Collaborator

Narsil commented Aug 29, 2024

I'm confused this still doesn't fix neural magic.

text-generation-launcher --model-id meta-llama/Meta-Llama-3-8B --quantize fp8

is currently working on main. This might have been fixed by something else ?

Can we introduce a failing test before fixing this ?

@Narsil
Copy link
Collaborator

Narsil commented Oct 1, 2024

Closing as stale, feel free to reopen.

@Narsil Narsil closed this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants