Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Developer Applications Demo using Transformers Library #10

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

hupreti
Copy link
Contributor

@hupreti hupreti commented May 7, 2024

Developer Applications on Cloud AI 100 using Transformers Library

app_ui

Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Removed app_config.json, instructions to create the .json is
added in Readme.md

Signed-off-by: Himanshu Upreti <[email protected]>
app/cert.pem Outdated Show resolved Hide resolved
Remove cert.pem and key.pem and update Readme.md with instructions to generate them

Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
app/Readme.md Outdated Show resolved Hide resolved
app/Readme.md Outdated Show resolved Hide resolved
app/Readme.md Show resolved Hide resolved
QEfficient/generation/LLMGenerator.py Outdated Show resolved Hide resolved
QEfficient/generation/LLMGenerator.py Outdated Show resolved Hide resolved
QEfficient/generation/LLMGenerator.py Outdated Show resolved Hide resolved
QEfficient/generation/LLMGenerator.py Outdated Show resolved Hide resolved
QEfficient/generation/LLMGenerator.py Outdated Show resolved Hide resolved
QEfficient/generation/LLMGenerator.py Outdated Show resolved Hide resolved
app/utils.py Outdated Show resolved Hide resolved
@anujgupt-github anujgupt-github added the good first issue Good for newcomers label May 9, 2024
Copy link
Contributor

@ochougul ochougul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review is not yet complete. Will review the rest later.

QEfficient/generation/llm_generator.py Outdated Show resolved Hide resolved
QEfficient/generation/llm_generator.py Outdated Show resolved Hide resolved
from typing import Dict, List, Optional, Union
from threading import Thread

from QEfficient.generation.cloud_infer import QAICInferenceSession
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this import should be below transformers import.
read more here: https://peps.python.org/pep-0008/#imports

self.session = None
self.tokenizer = None
self.is_first_prompt = False
self.model_name = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be replaced with line 66, is there a purpose of assigning empty string here?

self.tokenizer = None
self.is_first_prompt = False
self.model_name = ""
self.qpc_path = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with 61-63?

app/app.py Outdated
def infer_prompt(msg, chat_history, task, model):
global last_prompt, previous_current_ctx_len, last_state_generation_ids

qeff_generator_model.curr_cache_index = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this required? default is 0 anyway

qeff_generator_model.generated_ids = []

if qeff_generator_model.curr_cache_index >= qeff_generator_model.ctx_len - 1:
qeff_generator_model.curr_cache_index = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required?

app/app.py Outdated
if qeff_generator_model.curr_cache_index >= qeff_generator_model.ctx_len - 1:
qeff_generator_model.curr_cache_index = 0

qeff_generator_model.curr_cache_index = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

except Exception as err:
raise RuntimeError(f"Unable to load tokenizer, {err}")

if streamer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle else case, raise error

hf_token = None
if os.getenv("HF_TOKEN") is not None:
hf_token = os.getenv('HF_TOKEN')
tokenizer = AutoTokenizer.from_pretrained(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this on models present in tests? I have seen it failing with config file not found on huggingface, and the work-around is to use hf_download, check code in QEfficient/cloud/infer.py::80-87 you can allow only tokenizer file pattern, as you don't need other model files for tokenizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, hf_download are used un-necessarily, even at place where we just need tokenizer still we are downloading all files @ochougul please raise an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we download all the files while only needing tokenizer, assuming that user must already have all the files in cache_dir, in which case, the files won't be downloaded again, as this was how this API is supposed to be used.

If you want to change it to only download tokenizer files when only tokenizer is needed, you are welcome to update it. This is definitely a better way to do this.

You can raise issue internally yourself, no need to ask anybody. Thanks.

Signed-off-by: Himanshu Upreti <[email protected]>
@@ -60,6 +60,7 @@ def main(
device_group: List[int] = [
0,
],
execute : bool = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this funtionality exists in QEfficinet/cloud/execute.py::main, not required to add here.

time.sleep(0.07)

# calling infer api directly to get qpc_path
app_config[task][model_name]['qpc_path'] = infer_api(
Copy link
Contributor

@ochougul ochougul May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use QEfficient/cloud/execute.py::main function or directly use latency_kv_stats function
as I understand you want to compile and then execute, you can use from QEfficient/cloud/compile import main as compile and use that similar to it's use in infer.py, Let's not update infer.py file for this.

As there is no way customer will use execute option inserted in infer API via command line.
and infer.py is supposed to be a CLI API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want you can create a utils function called compile_and_execute and put it inside, QEffcieint/cloud/utils.py
and then change infer API as well to use same.

It's your choice.

Copy link
Contributor Author

@hupreti hupreti May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anujgupt-github @vbaddi I believe we need discuss on requirements on high-level apis and low level apis, their use-cases. Here the use case is not compile and execute. However, we can take https://github.com/daquexian/onnx-simplifier as example to understand top-level cli api and top-level programmable api

Copy link
Contributor

@ochougul ochougul May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please come up with a design if you have better solution. Happy to discuss it.
Thanks for sharing the onnx-simplifier link.
I agree that we need an API that does end-to-end from HF model till execution, but that doesn't mean, we should destroy code scalability for this.

ignore_patterns=["*.txt", "*.onnx", "*.ot", "*.md", "*.tflite", "*.pdf"],
)
tokenizer = AutoTokenizer.from_pretrained(model_hf_path, use_cache=True, padding_side="left")
if hf_token is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code (line 83-90) can also be moved under main function.
and pass model_hf_path as input to infer_api, and return qpc_path as output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should user downloads hf_model files from model_card and provide you model_hf_path ? please keep in mind that this is high_level api for user

@hupreti hupreti marked this pull request as draft May 14, 2024 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants