Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text_generator_main.cc using tinyllama model to inference can show Garbled characters #109

Open
nigelzzz opened this issue Jul 26, 2024 · 18 comments

Comments

@nigelzzz
Copy link

Description of the bug:

  • using generative/example/tiny_llama/convert_to_tflite.py to transfer model to *.tflite, (no quantize)
  • using text_generator_main.cc to load tiny_llama_seq512_kv1024.tflite, the output is
Prompt:
how are you?
Output text:
betbetbetesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdü lipesterdü lipesterdü lipesterdü lipesterd lipesterd lipesterd lipesterd lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lipLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLogin

Actual vs expected behavior:

No response

Any other information you'd like to share?

No response

@nigelzzz nigelzzz added the type:bug Bug label Jul 26, 2024
@pkgoogle pkgoogle self-assigned this Jul 26, 2024
@pkgoogle pkgoogle added status:awaiting user response When awaiting user response status:more data needed This label needs to be added to stale issues and PRs. labels Jul 26, 2024
@pkgoogle
Copy link
Contributor

Hi @nigelzzz, can you please provide more information so that we may reproduce it? For example, what version of Python you are using? which branch you are using?

Please also provide reproduce steps like:

python convert_to_tflite.py
<whatever commands you used to run the model>

Thanks!

@nigelzzz
Copy link
Author

nigelzzz commented Jul 28, 2024

Hi @pkgoogle ,
python version: 3.9.5
ai-edge-rotch branch: v.0.1.1

/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/convert_to_tflite.py
python3 convert_to_tflite.py

Then we can see tiny_llama_seq512_kv1024.tflite in current path.

i built /mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/c++/text_generator_main.cc.
i modification you can reference it.

 // Prepare helpers
 std::unique_ptr<tflite::FlatBufferModel> LoadModel() {
   std::unique_ptr<tflite::FlatBufferModel> model =
@@ -85,7 +93,13 @@ std::unique_ptr<tflite::Interpreter> BuildInterpreter(
   tflite::ops::builtin::BuiltinOpResolver resolver;
   // NOTE: We need to manually register optimized OPs for KV-cache and
   // Scaled Dot Product Attention (SDPA).
-  tflite::ops::custom::GenAIOpsRegisterer(&resolver);
+  resolver.AddCustom("odml.update_kv_cache",
+                      tflite::ops::custom::Register_KV_CACHE());
+  resolver.AddCustom("odml.scaled_dot_product_attention",
+                      tflite::ops::custom::Register_SDPA());
+
+
+  //tflite::ops::custom::GenAIOpsRegisterer(&resolver);

parameter

  • model path: tiny_llama_seq512_kv1024.tflite
  • sentencepiece_model: TinyLlama-1.1B-Chat-v1.0/tokenizer.model
  • start_token :
  • stop token :
  • num_thread: 4

@nigelzzz
Copy link
Author

@pkgoogle ,
btw, i have a little question,
can i know where the source of /ai-edge-torch/tree/main/ai_edge_torch/generative/examples/tiny_llama /tiny_llama_lm_logits.pt.

Because i can't see the file on llama huggingface repo

@haozha111
Copy link
Contributor

@pkgoogle , btw, i have a little question, can i know where the source of /ai-edge-torch/tree/main/ai_edge_torch/generative/examples/tiny_llama /tiny_llama_lm_logits.pt.

Because i can't see the file on llama huggingface repo

the .pt file is used as a golden test set for our development, which is not available in HF. @talumbau can confirm as well.

@nigelzzz
Copy link
Author

@haozha111 very thanks!!!

@pkgoogle
Copy link
Contributor

Hi @nigelzzz, which checkpoint data are you using from the original tiny_llama model? Thanks for your help.

@nigelzzz
Copy link
Author

@pkgoogle pkgoogle removed status:awaiting user response When awaiting user response status:more data needed This label needs to be added to stale issues and PRs. labels Jul 30, 2024
@nigelzzz
Copy link
Author

nigelzzz commented Aug 5, 2024

@pkgoogle ,
hi, can you reproduce it, or has any suggestion to debug it, i can help to solve it

Thanks!!

@pkgoogle
Copy link
Contributor

pkgoogle commented Aug 5, 2024

Hi @nigelzzz, @hheydary is currently assigned to this case. I would first try to see if you still get the same result if you removed your modifications first. If not, then you know it has something to do w/ your update. If so, you said "can show" so are you saying this happens often or just once in a while? If it happens in only particular instances, that will be good data to share with us. If it happens "all the time" ... this should show in the loss when validating on a known dataset. But yeah those would be good places to start. Hope that helps.

@hheydary
Copy link
Contributor

hheydary commented Aug 5, 2024

Hi @nigelzzz,
Instruction tuned models (an in general language models) are trained to recognize specialized tokens and take actions based on when they see those tokens. First, I noticed that you are not including BOS and EOS tokens when running the model. Those tokens for the model you mentioned can be found here. Additionally, for best results, you need to manually add the "chat template" that was used to train the model to your input prompt. From model's page on HF, the template would look like this:

# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# ...

i.e., (<|user|> \n PROMPT \n <|assistant|>.

@nigelzzz
Copy link
Author

nigelzzz commented Aug 6, 2024

Hi @hheydary and @pkgoogle,
my output still show garbled characters,
https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py
can i use above file to test text generation?

Prompt:
<|user|>
 Write an email:
 <|assistant|>
Output text:
agyagyagyagyagyagyagyagyagyagyagyagyagyagyagyagyagyścingtonścścścścingtonścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścirościrościrościrościrościrościrościrościrościrościrościrościroiroirościrościroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroirooczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczocz

@hheydary
Copy link
Contributor

hheydary commented Aug 6, 2024

Unfortunately, I am not able to reproduce the issue that you are seeing. Using the following command:

bazel run -c opt //ai_edge_torch/generative/examples/c++:text_generator_main -- --tflite_model=model.tflite --sentencepiece_model=tokenizer.model --prompt="<|user|> \n Write and email:\n <|assistant|>" --start_token="<s>" --stop_token="</s>" --num_threads=16

The model generates reasonable outputs.

A few things:

  • Make sure that you have the correct tokenizer file (shipped as a part of raw checkpoint)
  • Please make sure the correct set of arguments are passed, including start and stop tokens.

@nigelzzz
Copy link
Author

nigelzzz commented Aug 7, 2024

@hheydary ,
thanks for your responce!!

assert torch.allclose(
      tiny_llama_goldens, lm_logits[0, idx.shape[1] - 1, :], atol=1e-05
  )
  • which tensorflow librarys link with text_generator_main (libtensorflow.so or libtensorflowlite.so)
    Because my target machine is not android, its yocto linux. e.g., rpi4/5
  • Do you have any suggestion how to config it without android flag?
  • Or can you share your tinyllama model (tflite format)?
  • which version you used (v0.2.0)?

@nigelzzz
Copy link
Author

nigelzzz commented Aug 7, 2024

@hheydary ,
when i using 0.2.0, then run python3 tiny_llama.py, the out will show .

git branch
* (HEAD detached at origin/release/0.2.0)
2024-08-07 11:09:48.229016: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1723028988.241253  364737 cuda_dnn.cc:8439] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1723028988.245210  364737 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-07 11:09:48.254251: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-07 11:09:48.938564: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py:153: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  tiny_llama_goldens = torch.load(current_dir / "tiny_llama_lm_logits.pt")
Traceback (most recent call last):
  File "/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py", line 168, in <module>
    define_and_run()
  File "/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py", line 162, in define_and_run
    assert torch.allclose(
AssertionError

@nigelzzz
Copy link
Author

nigelzzz commented Aug 7, 2024

  • i using v0.2.0 branch
  • build command
 /user/: CC=/usr/bin/clang-18 bazel run -c opt //ai_edge_torch/generative/examples/c++:text_generator_main -- --tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite --sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model --prompt="<|user|> \n Write and email:\n <|assistant|>" --start_token="<s>" --stop_token="</s>" --num_threads=1
  • output
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: /mnt/data/nigel_wang/tensorflow_cache/153a550227f3ff2fa4e4811633058a05/external/org_tensorflow/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'com_google_absl' because it already exists.
DEBUG: /mnt/data/nigel_wang/tensorflow_cache/153a550227f3ff2fa4e4811633058a05/external/org_tensorflow/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'XNNPACK' because it already exists.
INFO: Analyzed target //ai_edge_torch/generative/examples/c++:text_generator_main (147 packages loaded, 3826 targets configured).
INFO: From Compiling src/google/protobuf/generated_message_tctable_lite.cc [for tool]:
external/protobuf~/src/google/protobuf/generated_message_tctable_lite.cc:347:14: warning: unused function 'Offset' [-Wunused-function]
  347 | inline void* Offset(void* base, uint32_t offset) {
      |              ^~~~~~
1 warning generated.
INFO: From Compiling src/google/protobuf/compiler/cpp/helpers.cc [for tool]:
external/protobuf~/src/google/protobuf/compiler/cpp/helpers.cc:197:25: warning: unused function 'VerifyInt32TypeToVerifyCustom' [-Wunused-function]
  197 | inline VerifySimpleType VerifyInt32TypeToVerifyCustom(VerifyInt32Type t) {
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
INFO: From Executing genrule @@org_tensorflow//tensorflow/lite/acceleration/configuration:configuration_schema:
When you use --proto, that you should check for conformity yourself, using the existing --conform
INFO: Found 1 target...
Target //ai_edge_torch/generative/examples/c++:text_generator_main up-to-date:
  bazel-bin/ai_edge_torch/generative/examples/c++/text_generator_main
INFO: Elapsed time: 276.290s, Critical Path: 109.56s
INFO: 1493 processes: 601 internal, 892 linux-sandbox.
INFO: Build completed successfully, 1493 total actions
INFO: Running command line: bazel-bin/ai_edge_torch/generative/examples/c++/text_generator_main '--tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite' '--sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model' '--prompt=<|user|> \n Write and email:\n <|assistant|>' '--start_token=<s>' '--stop_token=</s>' '--num_threads=1'
ERROR: Didn't find op for builtin opcode 'STABLEHLO_COMPOSITE' version '1'. An older version of this builtin might be supported. Are you using an old TFLite binary with a newer model?

ERROR: Registration failed.

Error at ai_edge_torch/generative/examples/c++/text_generator_main.cc:93

@nigelzzz
Copy link
Author

nigelzzz commented Aug 7, 2024

in 0.2.0

  • command
CC=/usr/bin/clang-18 bazel run -c opt //ai_edge_torch/generative/examples/c++:text_generator_main -- --tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite --sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model --prompt="<|user|> \n Write and email:\n <|assistant|>" --start_token="<s>" --stop_token="</s>" --num_threads=1
  • output
INFO: Running command line: bazel-bin/ai_edge_torch/generative/examples/c++/text_generator_main '--tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite' '--sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model' '--prompt=<|user|> \n Write and email:\n <|assistant|>' '--start_token=<s>' '--stop_token=</s>' '--num_threads=1'
normalizer.cc(52) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Prompt:
<|user|> \n Write and email:\n <|assistant|>
Output text:

image

@nigelzzz
Copy link
Author

nigelzzz commented Aug 7, 2024

@hheydary ,
i think i found some good point

  • quantize bool = True : can decode successfully.
  • quantize bool = false : fail decode. e.g., above log, all is ??
def convert_tiny_llama_to_tflite(
    checkpoint_path: str,
    prefill_seq_len: int = 512,
    kv_cache_max_len: int = 1024,
    quantize: bool = True,
):

@nigelzzz
Copy link
Author

@pkgoogle @hheydary @haozha111 ,
I think i found some good point, can reproduce by your side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants