Phi3 conversion OOM on A100 #44

a8nova · 2024-06-10T11:47:23Z

Description of the bug:

I wanted to convert phi3, I made the necessary changes in my own fork main...a8nova:ai-edge-torch:phi3 but OOM killer is nuking my process

Full error attached:
phi3_conversion_error.txt

Actual vs expected behavior:

The OOM is nuking the conversion script on a Colab A100.

Any other information you'd like to share?

Is there anything wrong in the phi3 re-authoring? All changes can be viewed here: main...a8nova:ai-edge-torch:phi3
Is there anything I can do to get it to convert? (e.g. changing parameters to make it memory efficient..)
Debugging tips?

The text was updated successfully, but these errors were encountered:

haozha111 · 2024-06-10T16:31:50Z

Hi @a8nova thanks for reporting the issue!

There is a known issue for high memory usage during the conversion process, which may kill the conversion script. Which phi-3 version are you converting? What's the size of phi-3 checkpoint you are using? A colab free instance may only have 12GB free RAM, which isn't enough. Do you happen to have:

A colab pro subscription
A Linux workstation (or on cloud) which has over 50GB of memory?

We are still actively working on fixing the memory issue, and sorry for the inconvenience!

haozha111 · 2024-06-10T16:44:59Z

Also from the conversion log, it seems the memory consumption is from CUDA. Are you able to try CUDA_VISIBLE_DEVICES=-1 to disable GPU memory allocation? The conversion only needs to consume CPU memory.

a8nova · 2024-06-10T17:24:20Z

Hi @haozha111 - Thank you for the quick response.

Yes I have Colab pro subscription. I did this on a A100: 80GB System RAM & 40GB GPU RAM and still run out of memory.
The phi3 checkpoint https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/tree/main about 7.5GB on disk

Let me try setting CUDA_VISIBLE_DEVICES

a8nova · 2024-06-10T20:34:07Z

I am also getting OOM when running with CUDA_VISIBLE_DEVICES=-1 on a box with 53GB system RAM

env: CUDA_VISIBLE_DEVICES=-1
/content/ai-edge-torch/ai_edge_torch/generative/examples/phi3
2024-06-10 20:14:26.133314: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-10 20:14:26.577974: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-10 20:14:28.834951: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1718050475.549140   17434 cpu_client.cc:424] TfrtCpuClient created.
WARNING:root:Your model "prefill" is converted in training mode. Please set the module in evaluation mode with `module.eval()` for better on-device performance and compatibility.
WARNING:root:Your model "decode" is converted in training mode. Please set the module in evaluation mode with `module.eval()` for better on-device performance and compatibility.
2024-06-10 20:18:36.252751: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-06-10 20:18:36.252876: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:135] retrieving CUDA diagnostic information for host: 055cf236c060
2024-06-10 20:18:36.252891: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:142] hostname: 055cf236c060
2024-06-10 20:18:36.253133: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:166] libcuda reported version is: 535.104.5
2024-06-10 20:18:36.253165: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:170] kernel reported version is: 535.104.5
2024-06-10 20:18:36.253176: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:249] kernel version seems to match DSO: 535.104.5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1718051375.007065   17434 tf_tfl_flatbuffer_helpers.cc:392] Ignored output_format.
W0000 00:00:1718051375.010046   17434 tf_tfl_flatbuffer_helpers.cc:395] Ignored drop_control_dependency.
2024-06-10 20:29:35.016643: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmpil1idwz1
2024-06-10 20:29:35.028233: I tensorflow/cc/saved_model/reader.cc:52] Reading meta graph with tags { serve }
2024-06-10 20:29:35.028277: I tensorflow/cc/saved_model/reader.cc:147] Reading SavedModel debug info (if present) from: /tmp/tmpil1idwz1
2024-06-10 20:29:35.126021: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-06-10 20:29:35.139828: I tensorflow/cc/saved_model/loader.cc:236] Restoring SavedModel bundle.
^C
[ ]
Colab paid products - Cancel contracts here
You are subscribed to Colab Pro. Learn more
Available: 25.31 compute units
Usage rate: approximately 4.82 per hour
You have 1 active session.
Python 3 Google Compute Engine backend (GPU)
Showing resources from 10:20 PM to 11:31 PM
System RAM
1.5 / 53.0 GB
 
GPU RAM
0.0 / 22.5 GB
 
Disk
68.1 / 201.2 GB

haozha111 · 2024-06-10T23:28:32Z

got it, do you mind to update your branch w/ phi-3, and we can fork and try converting it. thanks!

a8nova · 2024-06-11T19:26:50Z

Changes in the phi3 branch are up to date, you should be able to checkout and run the conversion script. Note that I also had to make changes to loader.py and feed_forward.py. Please let me know if you run into any issues. Thank you!

a8nova · 2024-06-18T17:23:09Z

Hi @haozha111 @vamsimanchala - Any updates on this? Thanks!

haozha111 · 2024-06-18T20:36:03Z

hi @a8nova , we are making good progress on this issue, and it requires some fixes in our converter stack. We plan to give an update on this issue soon in the coming weeks, thanks for your patience!

mitsunami · 2024-06-21T15:35:05Z

Hi, I am also encountering the same issue. Although I cannot share the model details, it appears to be getting killed at the same point as seen in the logs above. I am looking forward to a fix for this issue. Thanks!

haozha111 · 2024-06-25T18:05:43Z

hi @mitsunami ,

Are you trying to convert from colab pro instance, or a local Linux workstation, and how much memory do you have?

We are making great progress on reducing the converter memory issue and we will give an update on this issue soon, thanks for your patience!

mitsunami · 2024-06-26T13:22:10Z

Hi @haozha111,
I'm trying that on a local desktop with 64 GB RAM. Looking forward to an update. Thanks!

vamsimanchala · 2024-07-03T18:41:44Z

Hi @mitsunami, We recently landed some changes. Can you please exercise the conversion to TFLite and let us know if things look good.

Thank you for your patience,
Vamsi Manchala

a8nova changed the title ~~Phi3 conversion fails on A100~~ Phi3 conversion OOM on A100 Jun 10, 2024

haozha111 self-assigned this Jun 10, 2024

haozha111 assigned vamsimanchala and unassigned haozha111 Jun 11, 2024

ymodak added the override-stale label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi3 conversion OOM on A100 #44

Phi3 conversion OOM on A100 #44

a8nova commented Jun 10, 2024 •

edited

Loading

haozha111 commented Jun 10, 2024

haozha111 commented Jun 10, 2024

a8nova commented Jun 10, 2024

a8nova commented Jun 10, 2024 •

edited

Loading

haozha111 commented Jun 10, 2024

a8nova commented Jun 11, 2024 •

edited

Loading

a8nova commented Jun 18, 2024

haozha111 commented Jun 18, 2024

mitsunami commented Jun 21, 2024

haozha111 commented Jun 25, 2024

mitsunami commented Jun 26, 2024

vamsimanchala commented Jul 3, 2024

Phi3 conversion OOM on A100 #44

Phi3 conversion OOM on A100 #44

Comments

a8nova commented Jun 10, 2024 • edited Loading

Description of the bug:

Actual vs expected behavior:

Any other information you'd like to share?

haozha111 commented Jun 10, 2024

haozha111 commented Jun 10, 2024

a8nova commented Jun 10, 2024

a8nova commented Jun 10, 2024 • edited Loading

haozha111 commented Jun 10, 2024

a8nova commented Jun 11, 2024 • edited Loading

a8nova commented Jun 18, 2024

haozha111 commented Jun 18, 2024

mitsunami commented Jun 21, 2024

haozha111 commented Jun 25, 2024

mitsunami commented Jun 26, 2024

vamsimanchala commented Jul 3, 2024

a8nova commented Jun 10, 2024 •

edited

Loading

a8nova commented Jun 10, 2024 •

edited

Loading

a8nova commented Jun 11, 2024 •

edited

Loading