Memory leak. #69

chao-camect · 2024-04-08T23:03:03Z

I have been training using tensorflow + keras on nvidia GPU for a while.
Recently I experimented with A770. With some efforts, I finally got it working, except that there is a memory leak.
The same code works fine on nvidia 3090, it uses about 8GB memory, very stably.
With A770, it starts with 8GB and grows very quickly until killed because of OOM.
I used tracemalloc to see where is the leak. No luck. So it's not in python code.
I haven't got time to get more details of it...

chao-camect · 2024-04-08T23:04:06Z

More details: I am using tensorflow 2.15.1, running on Ubuntu 22.04. I can reproduce it on both kernel 6.5 and 6.7.

huiyan2021 · 2024-04-09T00:09:30Z

Hi @chao-camect , thanks for reporting this issue, could you share a small re-producer so that we can investigate?

chao-camect · 2024-04-09T07:31:33Z

I think you should be able to reproduce it by training any model... It must be leaking in some common operations.
Anyway, here is a simple program to reproduce it:

import os

from absl import app
from absl import flags
import tensorflow as tf

from tensorflow.keras import applications
from tensorflow.keras import callbacks
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras.preprocessing.image import ImageDataGenerator

FLAGS = flags.FLAGS
flags.DEFINE_string('data_dir', '', '')
flags.DEFINE_string('output_dir', 'output/dogs_vs_cats_mobilenet_v2', '')
flags.DEFINE_integer('batch_size', 20, '')
flags.DEFINE_integer('width', 300, '')
flags.DEFINE_integer('height', 300, '')
flags.DEFINE_integer('epochs', 10000, '')
flags.DEFINE_integer('max_chpt_to_keep', 10, '')
flags.DEFINE_boolean('use_data_augmentation', True, '')
flags.DEFINE_boolean('debug', False, '')
flags.DEFINE_string('test_image_dir', '', '')

def count_jpgs(dirStr: str) -> int:
    n = 0
    for name in os.listdir(dirStr):
        full_name = os.path.join(dirStr, name)
        if os.path.isfile(full_name):
            ext = os.path.splitext(name)[1]
            if ext == '.jpg' or ext == '.jpeg':
                n += 1
        elif os.path.isdir(full_name):
            n += count_jpgs(full_name)
    return n

def main(argv):
    i = layers.Input([FLAGS.height, FLAGS.width, 3], dtype = tf.uint8)
    x = tf.cast(i, tf.float32)
    x = applications.mobilenet_v2.preprocess_input(x)
    base = applications.MobileNetV2(classes=2, include_top=False)
    base.trainable = False
    x = base(x)
    x = layers.GlobalAveragePooling2D()(x)
    o = layers.Dense(2, activation='softmax', name='output')(x)
    model = tf.keras.Model(inputs=[i], outputs=[o], name='dogs_vs_cats_mobilenet_v2')
    model.compile(
        loss='categorical_crossentropy',
        optimizer=optimizers.RMSprop(learning_rate=2e-5, rho=0.618),
        metrics=['acc'])
    if FLAGS.debug:
        model.summary()

    if FLAGS.use_data_augmentation:
        train_datagen = ImageDataGenerator(
            rotation_range=40,
            width_shift_range=0.2,
            height_shift_range=0.2,
            shear_range=0.2,
            zoom_range=0.2,
            horizontal_flip=True)
    else:
        train_datagen = ImageDataGenerator()
    train_generator = train_datagen.flow_from_directory(
        os.path.join(FLAGS.data_dir, 'train'),
        target_size=(FLAGS.height, FLAGS.width),
        batch_size=FLAGS.batch_size,
        class_mode='categorical')

    num_train_jpgs = count_jpgs(os.path.join(FLAGS.data_dir, 'train'))
    print('Total %d training images.' % num_train_jpgs)

    tb_callback = callbacks.TensorBoard(log_dir=os.path.join(FLAGS.output_dir, 'tb_logs'))
    model.fit(
        x=train_generator,
        steps_per_epoch=num_train_jpgs/FLAGS.batch_size,
        epochs=FLAGS.epochs,
        callbacks=[tb_callback])
    model.save(os.path.join(FLAGS.output_dir, 'saved_model'))

app.run(main)

To run it, download data from [Kaggle] (https://www.kaggle.com/c/dogs-vs-cats/data).

unzip train.zip
cd train/
mkdir dogs cats
mv dog.* dogs/
mv cat.* cats/
python3 train.py --data_dir=xxxx

chao-camect · 2024-04-09T07:36:51Z

More context: I run it inside docker. I installed the deps inside docker using following script:

wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg    
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor --output /usr/share/keyrings/oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero \
    intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 libegl-mesa0 libegl1-mesa \                         
    libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri libglapi-mesa libgles2-mesa-dev \
    libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers mesa-vdpau-drivers mesa-vulkan-drivers \
    va-driver-all vainfo hwinfo clinfo libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev \
    level-zero-dev intel-oneapi-runtime-dpcpp-cpp intel-oneapi-runtime-mkl xpu-smi

huiyan2021 · 2024-04-09T08:10:10Z

could you also run this environment check script and upload the result here? thanks!

https://github.com/intel/intel-extension-for-tensorflow/blob/main/tools/python/env_check.py

chao-camect · 2024-04-09T08:21:09Z

The tool doesn't support tensorflow 2.15.

Check Environment for Intel(R) Extension for TensorFlow*...

100% [................................................................................] 6116 / 6116Check Python
Traceback (most recent call last):
  File "/home/chao/projects/intel-extension-for-tensorflow/tools/python/env_check.py", line 138, in <module>
    itex_version = check_python()
  File "/home/chao/projects/intel-extension-for-tensorflow/tools/python/env_check.py", line 39, in check_python
    elif python_minor_version < config['python_version']['min_python_version'][itex_version]:
KeyError: '2.15.0.0'

Disty0 · 2024-04-09T11:05:04Z

I suspect this is an issue with a common Intel library.
Because this same issue happens on IPEX as well:

intel/intel-extension-for-pytorch#476

chao-camect · 2024-04-09T16:29:39Z

@Disty0 Thanks for linking to the other issue. I agree with you. For me, it leaks 3MB-4MB every second. It must be some common operation. I don't get how could it evaded from Intel's own engineers...
I tried LD_PRELOAD=libtcmalloc_minimal.so.4 from intel/intel-extension-for-pytorch#476, no help.

huiyan2021 · 2024-04-10T04:30:45Z

Hi @chao-camect , I am running the training script on Arc770 with docker that we published: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/install_for_xpu.html#get-docker-container-from-dockerhub

Total 25000 training images.
Epoch 1/10000
2024-04-10 04:00:56.653123: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type XPU is enabled.
1250/1250 [==============================] - 158s 125ms/step - loss: 0.3988 - acc: 0.8320
Epoch 2/10000
1250/1250 [==============================] - 155s 124ms/step - loss: 0.1587 - acc: 0.9555
Epoch 3/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.1136 - acc: 0.9612
Epoch 4/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0992 - acc: 0.9624
Epoch 5/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0925 - acc: 0.96426
Epoch 6/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0883 - acc: 0.9677
Epoch 7/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0856 - acc: 0.9670
Epoch 8/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0821 - acc: 0.9693
Epoch 9/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0794 - acc: 0.9702
Epoch 10/10000
1250/1250 [==============================] - 156s 125ms/step - loss: 0.0765 - acc: 0.9698
Epoch 11/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0793 - acc: 0.9694
Epoch 12/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0746 - acc: 0.9712
Epoch 13/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0758 - acc: 0.9712
Epoch 14/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0756 - acc: 0.9711
Epoch 15/10000
1250/1250 [==============================] - 154s 124ms/step - loss: 0.0703 - acc: 0.9729
Epoch 16/10000
1250/1250 [==============================] - 155s 124ms/step - loss: 0.0739 - acc: 0.9710
Epoch 17/10000
1250/1250 [==============================] - 155s 124ms/step - loss: 0.0725 - acc: 0.9708
Epoch 18/10000
535/1250 [===========>..................] - ETA: 1:28 - loss: 0.0699 - acc: 0.9733

GPU Memory Used (MiB) keeps stable at 4222

You should be able to run https://github.com/intel/intel-extension-for-tensorflow/blob/main/tools/python/env_check.py now, also need to clone https://github.com/intel/intel-extension-for-tensorflow/blob/main/tools/python/config.json, could you try again and upload the result here?

chao-camect · 2024-04-10T05:54:34Z

It's CPU memory not GPU.

huiyan2021 · 2024-04-10T06:00:57Z

More context: I run it inside docker. I installed the deps inside docker using following script:

wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg    
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor --output /usr/share/keyrings/oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero \
    intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 libegl-mesa0 libegl1-mesa \                         
    libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri libglapi-mesa libgles2-mesa-dev \
    libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers mesa-vdpau-drivers mesa-vulkan-drivers \
    va-driver-all vainfo hwinfo clinfo libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev \
    level-zero-dev intel-oneapi-runtime-dpcpp-cpp intel-oneapi-runtime-mkl xpu-smi

did you install by following steps here: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/experimental/install_for_arc_gpu.html

huiyan2021 · 2024-04-10T06:07:46Z

It's CPU memory not GPU.

How many iterations have you trained when facing OOM?

huiyan2021 · 2024-04-10T09:20:41Z

@chao-camect we observed memory usage increasing on host during the training, developer team is looking into it, will post here when there are any updates. Thanks!

chao-camect · 2024-04-10T16:30:38Z

Thanks for the prompt response!

huiyan2021 · 2024-04-11T12:59:30Z

Hi @chao-camect

Upgrade driver to the latest version both on host and in docker as following:

wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
  sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
  sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
sudo apt update
sudo apt upgrade

check driver version:
we also submitted a PR fixing the issue, build main branch of intel extension for tensorflow from source code or you can wait for our weekly build: https://intel.github.io/intel-extension-for-tensorflow/latest/get_started.html#install-for-xpu-weekly

This is the memory usage trend I tested on Arc770 using the latest build:

This is the one on A100:

Let us know whether you can re-produce the result or not, thanks!

chao-camect · 2024-04-13T17:57:19Z

Thanks. When will the weekly build be ready? I see that the latest version is 20240329.

huiyan2021 · 2024-04-15T10:28:56Z

Should be in this week, will let you know when they are ready.

xiguiw · 2024-04-17T02:47:54Z

intel_extension_for_tensorflow_lib_weekly 2.15.0.0.2.dev20240415
intel_extension_for_tensorflow_weekly 2.15.0.0.dev20240415
tensorflow 2.15.1
tensorflow-estimator 2.15.0
tensorflow-io-gcs-filesystem 0.36.0

xiguiw · 2024-04-17T03:01:09Z

@chao-camect
The weekly build is ready. Please help to try it, thanks!

Pleased uninstall your previous intel-extension-for-tensorflow package. The package name are different.

The install command is:
pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly

A couple of things to be noted:

Upgrade your driver as @huiyan2021 mensioned.
Both host (KMD driver) and docker (UMD driver) are to be upgrade if you use docker.

ii  intel-level-zero-gpu                       1.3.28202.39-821~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero                                 1.15.8-820~22.04                        amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-dev                             1.15.8-820~22.04                        amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.

oneAPI version is 2024.1

FYI
Here is the memory used with the weekly build, to run traning script your provided, 5 epoches.

Good lunck!

chao-camect · 2024-04-17T07:42:15Z

No. It's still leaking, just slower. As you can see from your own graph...
I have a bigger program that's leaking more obviously. If necessary, I can see whether I can separate that part for your testing purpose. However, this is really not my job.
I suggest you guys spend some time in testing. You need to compare training on ARC with NVIDIA.

xiguiw · 2024-04-17T09:14:45Z

@chao-camect
Thanks for trying the weekly build.

Yes we compared the traning on NV, this is the training result on A100. Running the script you provide us.

From the patten, memory usage increases between epochs, but keep stable in epch. A100 behaves similarly.

Suppose memory leak is related to specific workload (or operator/kernel).
Could you help to check with tcmalloc by:

disabled tensorboard/ disable the callback
put the traing command in run.sh
run following command and feedback the result.

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 HEAPCHECK=normal run.sh 

pprof run.sh "xxxx.heap" --stack --lines --text.

We have checked your example for 1 epoch and find 72 bytes leak from itex. This weekly build fixe this.

Note that onednn primitive cache and kernel cache from queue.submit consumed memory, the behavior looks like memory leak but actually it is NOT. Ignore such 'leaks' on python objects and tensorflow...

xiguiw · 2024-04-17T09:18:07Z

@chao-camect

It would be nice if you can separate that part from you 'bigger programm' for us to test.

BTW, Did you upate the driver? What's the driver version did you use now?

chao-camect · 2024-04-18T07:41:28Z

It'll take some time before I can get a minimized version for you to test with. Do you test the extension with a set of common models regularly? I don't think there is anything special in my model.
Anyway, the weekly build does help a bit. It was leaking about 4MB per second. It's about 1MB per second now.
level zero versions:

ii  intel-level-zero-gpu                            1.3.28202.51-821~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero                                      1.16.14-821~22.04                       amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-dev                                  1.16.14-821~22.04                       amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.

xiguiw · 2024-04-19T01:45:37Z

The drvier version is OK.

Of course we have models check regularly.
Till now not found memory leak for this case, we will add more models to check this issue.

Meanwhile, if you can get a minimized version for us, that's would be ideal. Thanks!

huiyan2021 · 2024-04-26T08:54:54Z

Hi @chao-camect , could you try below environment variable at your side, and let us know if memory leak still exists? Thanks!

export ITEX_LAYOUT_OPT=0

chao-camect · 2024-04-26T09:14:55Z

Looks like it did the trick.
I'll let it run for longer and report back.

chao-camect · 2024-04-28T04:24:34Z

I believe the memory leak is gone with ITEX_LAYOUT_OPT=0.
I assume setting ITEX_LAYOUT_OPT=0 has performance impact?

huiyan2021 · 2024-04-28T08:07:44Z

I believe the memory leak is gone with ITEX_LAYOUT_OPT=0.
Thanks for the confirmation!

I assume setting ITEX_LAYOUT_OPT=0 has performance impact?
It depends on the model, you can observe the performance for your model.

also, our fix is WIP... will let you know as soon as it works...

huiyan2021 · 2024-06-05T07:19:25Z

@chao-camect , please help to try our latest weekly build to see if it works for your case, thanks!

pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly

yinghu5 assigned yinghu5, huiyan2021 and aice-support and unassigned yinghu5 Apr 11, 2024

xiguiw added bug Something isn't working aitce labels Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak. #69

Memory leak. #69

chao-camect commented Apr 8, 2024

chao-camect commented Apr 8, 2024

huiyan2021 commented Apr 9, 2024

chao-camect commented Apr 9, 2024 •

edited

Loading

chao-camect commented Apr 9, 2024

huiyan2021 commented Apr 9, 2024

chao-camect commented Apr 9, 2024

Disty0 commented Apr 9, 2024

chao-camect commented Apr 9, 2024

huiyan2021 commented Apr 10, 2024 •

edited

Loading

chao-camect commented Apr 10, 2024

huiyan2021 commented Apr 10, 2024

huiyan2021 commented Apr 10, 2024

huiyan2021 commented Apr 10, 2024

chao-camect commented Apr 10, 2024

huiyan2021 commented Apr 11, 2024 •

edited

Loading

chao-camect commented Apr 13, 2024

huiyan2021 commented Apr 15, 2024

xiguiw commented Apr 17, 2024

xiguiw commented Apr 17, 2024 •

edited

Loading

chao-camect commented Apr 17, 2024

xiguiw commented Apr 17, 2024 •

edited

Loading

xiguiw commented Apr 17, 2024

chao-camect commented Apr 18, 2024

xiguiw commented Apr 19, 2024 •

edited

Loading

huiyan2021 commented Apr 26, 2024

chao-camect commented Apr 26, 2024

chao-camect commented Apr 28, 2024 •

edited

Loading

huiyan2021 commented Apr 28, 2024

huiyan2021 commented Jun 5, 2024

Memory leak. #69

Memory leak. #69

Comments

chao-camect commented Apr 8, 2024

chao-camect commented Apr 8, 2024

huiyan2021 commented Apr 9, 2024

chao-camect commented Apr 9, 2024 • edited Loading

chao-camect commented Apr 9, 2024

huiyan2021 commented Apr 9, 2024

chao-camect commented Apr 9, 2024

Disty0 commented Apr 9, 2024

chao-camect commented Apr 9, 2024

huiyan2021 commented Apr 10, 2024 • edited Loading

chao-camect commented Apr 10, 2024

huiyan2021 commented Apr 10, 2024

huiyan2021 commented Apr 10, 2024

huiyan2021 commented Apr 10, 2024

chao-camect commented Apr 10, 2024

huiyan2021 commented Apr 11, 2024 • edited Loading

chao-camect commented Apr 13, 2024

huiyan2021 commented Apr 15, 2024

xiguiw commented Apr 17, 2024

xiguiw commented Apr 17, 2024 • edited Loading

chao-camect commented Apr 17, 2024

xiguiw commented Apr 17, 2024 • edited Loading

xiguiw commented Apr 17, 2024

chao-camect commented Apr 18, 2024

xiguiw commented Apr 19, 2024 • edited Loading

huiyan2021 commented Apr 26, 2024

chao-camect commented Apr 26, 2024

chao-camect commented Apr 28, 2024 • edited Loading

huiyan2021 commented Apr 28, 2024

huiyan2021 commented Jun 5, 2024

chao-camect commented Apr 9, 2024 •

edited

Loading

huiyan2021 commented Apr 10, 2024 •

edited

Loading

huiyan2021 commented Apr 11, 2024 •

edited

Loading

xiguiw commented Apr 17, 2024 •

edited

Loading

xiguiw commented Apr 17, 2024 •

edited

Loading

xiguiw commented Apr 19, 2024 •

edited

Loading

chao-camect commented Apr 28, 2024 •

edited

Loading