-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exporting to ONNX with kv_cache #1240
Comments
Any update on this? |
Yes, kind of. Just replace in the script OnnxConversion pass by OptimumConversion. |
Curious if that generates an ONNX file with kv cache as part of the model or should the kv cache be implemented separately. Additionally , can the lllama 2 7B onnx model be generated via Olive similar to how the Mistral onnx file you generated ? |
kv-cache is already part of an original model in pytorch format. kv_cache flag is only intended to preserve it or not when converting to onnx. I have not tested LLama but assume it must work similarly. |
Can you please share the steps on how you used Olive config file to generate the ONNX model? |
you call
|
Disclaimer
That's not a bug report but rather a question.
To Reproduce
kv_cache
setfalse
, but the resultant model does not have the kv-cachekv_cache
totrue
. The conversion fails.Olive config
{
}
Olive logs
[2024-07-17 09:27:38,243] [INFO] [config.py:237:validate_evaluate_input_model] No evaluator is specified, skip to evaluate model
[2024-07-17 09:27:38,244] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2024-07-17 09:27:38,253] [INFO] [engine.py:986:save_olive_config] Saved Olive config to /mnt/genai/users/ilya_druker/models/cache/default_workflow/olive_config.json
[2024-07-17 09:27:38,258] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-cuda
[2024-07-17 09:27:38,258] [INFO] [engine.py:109:initialize] Using cache directory: /mnt/genai/users/ilya_druker/models/cache/default_workflow
[2024-07-17 09:27:38,260] [INFO] [engine.py:265:run] Running Olive on accelerator: gpu-cuda
[2024-07-17 09:27:38,260] [INFO] [engine.py:1085:_create_system] Creating target system ...
[2024-07-17 09:27:38,260] [INFO] [engine.py:1088:_create_system] Target system created in 0.000138 seconds
[2024-07-17 09:27:38,261] [INFO] [engine.py:1097:_create_system] Creating host system ...
[2024-07-17 09:27:38,261] [INFO] [engine.py:1100:_create_system] Host system created in 0.000198 seconds
[2024-07-17 09:27:38,275] [INFO] [engine.py:867:_run_pass] Running pass onnx_conversion:OnnxConversion
[2024-07-17 09:27:38,347] [INFO] [hf_config.py:112:load_hf_model] Loading Huggingface model from mistralai/Mistral-7B-v0.1
/home/ilya_druker/.local/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/mnt/homedirs/ilya_druker/.local/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source?warn(
/home/ilya_druker/.local/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.warnings.warn(
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:08<00:08, 8.61s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:22<00:00, 11.56s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:22<00:00, 11.12s/it][2024-07-17 09:28:08,970] [ERROR] [engine.py:949:_run_pass] Pass run failed.
Traceback (most recent call last):
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 937, in _run_pass
output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/systems/local.py", line 32, in run_pass
output_model = the_pass.run(model, data_root, output_model_path, point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/olive_pass.py", line 224, in run
output_model = self._run_for_config(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 132, in _run_for_config
output_model = self._run_for_config_internal(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 182, in _run_for_config_internal
return self._convert_model_on_device(model, data_root, config, output_model_path, device, torch_dtype)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 439, in _convert_model_on_device
converted_onnx_model = OnnxConversion._export_pytorch_model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 285, in _export_pytorch_model
torch.onnx.export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 516, in export
_export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1612, in _export
graph, params_dict, torch_out = _model_to_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
outs = ONNXTracedModule(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 138, in forward
graph, out = torch._C._create_graph_by_tracing(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 129, in wrapper
outs.append(self.inner(*trace_inputs))
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1139, in forward
outputs = self.model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1024, in forward
layer_outputs = decoder_layer(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 738, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 656, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/cache_utils.py", line 155, in update
self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 32 but got size 8 for tensor number 1 in the list.
[2024-07-17 09:28:08,984] [WARNING] [engine.py:360:run_accelerator] Failed to run Olive on gpu-cuda.
Traceback (most recent call last):
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 339, in run_accelerator
output_footprint = self.run_no_search(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 431, in run_no_search
should_prune, signal, model_ids = self._run_passes(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 829, in _run_passes
model_config, model_id = self._run_pass(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/engine/engine.py", line 937, in _run_pass
output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/systems/local.py", line 32, in run_pass
output_model = the_pass.run(model, data_root, output_model_path, point)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/olive_pass.py", line 224, in run
output_model = self._run_for_config(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 132, in _run_for_config
output_model = self._run_for_config_internal(model, data_root, config, output_model_path)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 182, in _run_for_config_internal
return self._convert_model_on_device(model, data_root, config, output_model_path, device, torch_dtype)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 439, in _convert_model_on_device
converted_onnx_model = OnnxConversion._export_pytorch_model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/olive/passes/onnx/conversion.py", line 285, in _export_pytorch_model
torch.onnx.export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 516, in export
_export(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1612, in _export
graph, params_dict, torch_out = _model_to_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
outs = ONNXTracedModule(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 138, in forward
graph, out = torch._C._create_graph_by_tracing(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/jit/_trace.py", line 129, in wrapper
outs.append(self.inner(*trace_inputs))
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1139, in forward
outputs = self.model(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 1024, in forward
layer_outputs = decoder_layer(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 738, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 656, in forward
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
File "/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/cache_utils.py", line 155, in update
self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 32 but got size 8 for tensor number 1 in the list.
[2024-07-17 09:28:08,987] [INFO] [engine.py:282:run] Run history for gpu-cuda:
[2024-07-17 09:28:08,998] [INFO] [engine.py:570:dump_run_history] run history:
+------------+-------------------+-------------+----------------+-----------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+============+===================+=============+================+===========+
| 4c8cc2fe | | | | |
+------------+-------------------+-------------+----------------+-----------+
[2024-07-17 09:28:09,000] [INFO] [engine.py:297:run] No packaging config provided, skip packaging artifacts
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:276: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
elif sliding_window is None or key_value_length < sliding_window:
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:114: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/modeling_attn_mask_utils.py:162: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/home/ilya_druker/.local/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py:119: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if seq_len > self.max_seq_len_cached:
Other information
The text was updated successfully, but these errors were encountered: