Low accuracy on Arc A380 #5

ip2016 · 2022-10-30T11:13:53Z

It seems like XPU calculation accuracy deteriorates in 4th-5th digit after the dot on common math operation.
Here is the sample code:

a = tf.random.normal(shape=[10000, 10000], dtype=tf.float32) 
b = tf.random.normal(shape=[10000, 10000], dtype=tf.float32)

@tf.function
def run(a, b):
    x1 = tf.nn.relu(a)
    y1 = tf.nn.relu(b)  
    x2 = tf.math.square(x1)
    y2 = tf.math.square(y1)
    x3 = tf.math.scalar_mul(33e-5, x2)
    y3 = tf.math.scalar_mul(33e-5, y2)
    return tf.tensordot(x3, y3, 2)

with tf.device("/XPU:0"):
    print(f"XPU Result: {run(a, b)}")

with tf.device("/CPU:0"):
    print(f"CPU Result: {run(a, b)}")

Which yields the following results:

XPU Result: 2.721888542175293
CPU Result: 2.5815889835357666

System:
Asrock A380
Ubuntu 22.04 (kernel 5.17.0-1019-oem)

The text was updated successfully, but these errors were encountered:

yiqianglee · 2022-10-31T03:33:14Z

@ip2016 thanks for reporting this issue. We will have a look, but first try in other Intel GPU, we can't reproduce this issue, will try on A380 also.

Tengfei09 · 2022-11-01T04:39:53Z

@ip2016 May I ask what kind of CPU you’re using？ We have tested your example on our HW platforms. Results show that your CPU results seem a little weird.

ip2016 · 2022-11-01T09:25:16Z

@ip2016 May I ask what kind of CPU you’re using？ We have tested your example on our HW platforms. Results show that your CPU results seem a little weird.

Hello @Tengfei09

Thanks for your response.
Here is my system info:

>oneapi-cli version
v0.2.0-4-g9fef7bf786

>glxinfo -B
name of display: :0
hwconfig key 77 (UNKNOWN_INTEL_HWCONFIG) unhandled!
hwconfig key 78 (UNKNOWN_INTEL_HWCONFIG) unhandled!
hwconfig key 79 (UNKNOWN_INTEL_HWCONFIG) unhandled!
hwconfig key 80 (UNKNOWN_INTEL_HWCONFIG) unhandled!
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Intel (0x8086)
    Device: Mesa Intel(R) Graphics (DG2) (0x56a5)
    Version: 22.2.0
    Accelerated: yes
    Video memory: 6088MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) Graphics (DG2)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 22.2.0-devel (git-44289c46d9)
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.2.0-devel (git-44289c46d9)
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.2.0-devel (git-44289c46d9)
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

>lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
    CPU family:          6
    Model:               165
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            5
    CPU max MHz:         4300.0000
    CPU min MHz:         800.0000
    BogoMIPS:            5799.77
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx p
                         dpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 mo
                         nitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c 
                         rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid
                          ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves 
                         dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    1.5 MiB (6 instances)
  L3:                    12 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Not affected

>lspci -v |grep -A8 VGA
03:00.0 VGA compatible controller: Intel Corporation Device 56a5 (rev 05) (prog-if 00 [VGA controller])
	Subsystem: ASRock Incorporation Device 6004
	Flags: bus master, fast devsel, latency 0, IRQ 144, IOMMU group 1
	Memory at a1000000 (64-bit, non-prefetchable) [size=16M]
	Memory at 4000000000 (64-bit, prefetchable) [size=8G]
	Expansion ROM at a2000000 [disabled] [size=2M]
	Capabilities: <access denied>
	Kernel driver in use: i915
	Kernel modules: i915

ip2016 · 2022-11-01T09:34:13Z

I did some additional testing, still getting incompatible results sometimes. Here is one which is reproducible:

import tensorflow as tf

tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')]

tf.random.set_seed(11)
c = tf.random.normal([2, 2], 0, 1, tf.float32) 
d = tf.random.normal([2, 2], 0, 1, tf.float32)
print(f"{c}, {d}")

[[-1.5229468   0.66954553]
 [-0.64246905  1.4300431 ]], [[ 0.35981855  1.018044  ]
 [-2.029798   -0.7807023 ]]

with tf.device("/XPU:0"):
    print(f"XPU Result: {tf.tensordot(c,d,2)}")

with tf.device("/CPU:0"):
    print(f"CPU Result: {tf.tensordot(c,d,2)}")

XPU Result: 0.32128676772117615
CPU Result: 0.321286678314209

Running the same code with tensorflow-cpu, I'm getting:

CPU Result: 0.32128584384918213

Virtual environment version:

>python --version
Python 3.10.6

>pip freeze
absl-py==1.3.0
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.1.0
astunparse==1.6.3
attrs==22.1.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
cachetools==5.2.0
certifi==2022.9.24
cffi==1.15.1
charset-normalizer==2.1.1
colorama==0.4.6
contourpy==1.0.5
cycler==0.11.0
Cython==0.29.32
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.6
dm-tree==0.1.7
entrypoints==0.4
etils==0.9.0
executing==1.2.0
fastjsonschema==2.16.2
filelock==3.8.0
flatbuffers==22.10.26
fonttools==4.38.0
gast==0.4.0
gin-config==0.5.0
google-api-core==2.10.2
google-api-python-client==2.65.0
google-auth==2.13.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.56.4
grpcio==1.50.0
h5py==3.7.0
httplib2==0.21.0
huggingface-hub==0.10.1
idna==3.4
importlib-resources==5.10.0
intel-extension-for-tensorflow==1.0.0
intel-extension-for-tensorflow-lib==1.0.0.1
ipykernel==6.17.0
ipython==8.6.0
ipython-genutils==0.2.0
ipywidgets==8.0.2
jedi==0.18.1
Jinja2==3.1.2
joblib==1.2.0
jsonschema==4.16.0
jupyter==1.0.0
jupyter-console==6.4.4
jupyter-server==1.21.0
jupyter_client==7.4.4
jupyter_core==4.11.2
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.3
kaggle==1.5.12
keras==2.10.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.4
libclang==14.0.6
lxml==4.9.1
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.6.1
matplotlib-inline==0.1.6
mistune==2.0.4
nbclassic==0.4.5
nbclient==0.7.0
nbconvert==7.2.3
nbformat==5.7.0
nest-asyncio==1.5.6
notebook==6.5.1
notebook_shim==0.2.0
numpy==1.23.4
oauth2client==4.1.3
oauthlib==3.2.2
opencv-python-headless==4.6.0.66
opt-einsum==3.3.0
packaging==21.3
panda==0.3.1
pandas==1.5.1
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.3.0
portalocker==2.6.0
prometheus-client==0.15.0
promise==2.3
prompt-toolkit==3.0.31
protobuf==3.19.6
psutil==5.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycocotools==2.0.5
pycparser==2.21
Pygments==2.13.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-dateutil==2.8.2
python-slugify==6.1.2
pytz==2022.5
PyYAML==6.0
pyzmq==24.0.1
qtconsole==5.3.2
QtPy==2.2.1
regex==2022.9.13
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
sacrebleu==2.3.1
scikit-learn==1.1.3
scipy==1.9.3
Send2Trash==1.8.0
sentencepiece==0.1.97
seqeval==1.2.2
six==1.16.0
sniffio==1.3.0
soupsieve==2.3.2.post1
stack-data==0.6.0
tabulate==0.9.0
tensorboard==2.10.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.10.0
tensorflow-addons==0.18.0
tensorflow-datasets==4.7.0
tensorflow-estimator==2.10.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.27.0
tensorflow-metadata==1.10.0
tensorflow-model-optimization==0.7.3
tensorflow-text==2.10.0
termcolor==2.0.1
terminado==0.17.0
text-unidecode==1.3
tf-models-official==2.7.0
tf-slim==1.1.0
threadpoolctl==3.1.0
tinycss2==1.2.1
tokenizers==0.13.1
toml==0.10.2
tornado==6.2
tqdm==4.64.1
traitlets==5.5.0
transformers==4.23.1
typeguard==2.13.3
typing_extensions==4.4.0
uritemplate==4.1.1
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.4.1
Werkzeug==2.2.2
widgetsnbextension==4.0.3
wrapt==1.14.1
zipp==3.10.0

ip2016 · 2022-11-01T09:53:16Z

I changed code a bit and now CPU result from the tensorflow intel plugin matches the result from tensorflow-cpu, but not the GPU result.

with tf.device("/CPU:0"):
    tf.random.set_seed(11)
    c = tf.random.normal([2, 2], 0, 1, tf.float32) 
    d = tf.random.normal([2, 2], 0, 1, tf.float32)
print(f"{c}, {d}")

[[-1.5229472   0.66954476]
 [-0.6424697   1.4300429 ]], [[ 0.35981855  1.0180439 ]
 [-2.0297976  -0.7807032 ]]

with tf.device("/XPU:0"):
    print(f"XPU Result: {tf.tensordot(c,d,2)}")

with tf.device("/CPU:0"):
    print(f"CPU Result: {tf.tensordot(c,d,2)}")

XPU Result: 0.3212856948375702
CPU Result: 0.32128584384918213

yiqianglee · 2022-11-01T12:08:56Z

@ip2016 , good to see your latest result. For float point, I think this is reasonable, we can't expect bit-by-bit same in float point arithmetic, normally, we use relative tolerance and absolute tolerance to compare float point, here the tolerance is less than 1e-6 which is reasonable to me.

XPU Result: 0.3212856948375702 CPU Result: 0.32128584384918213

ip2016 · 2022-11-01T15:37:27Z

@yiqianglee Thanks for your input.
The issue I'm facing is a NaN in loss function when I'm trying a simple project with BERT fine tuning. I suspect this is caused by "exploded gradient problem" due to accumulated/amplified accuracy error on Arc A380.

This is how it runs on CPU: (in progress)

Epoch 1/2
 30/459 [>.............................] - ETA: 7:28 - loss: 0.6868 - accuracy: 0.6208

And this is a XPU run (in progress):

Epoch 1/2
2022-11-01 10:23:42.250718: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type XPU is enabled.
 30/459 [>.............................] - ETA: 20:13 - loss: nan - accuracy: 0.7000

I also notices it runs much slower on GPU.

The "train" code is below (a simple example from huggingface):

# %%
import tensorflow as tf

print(tf.config.list_physical_devices())

# %%
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
import numpy as np

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

# %%
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# %%
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(
    optimizer="adam",
    loss=loss,
    metrics=["accuracy"],
)
model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=2
)

yiqianglee · 2022-11-05T11:12:00Z

@ip2016 Thanks for reporting, we can reproduce it now, working on fix.

Wanzizhu · 2022-11-10T02:22:41Z

Hi, @ip2016，PR fix for this NAN issue has been merge, please follow doc here to rebuild and have a try.

ip2016 · 2022-11-13T18:23:28Z

Thanks for the fast fix.
After some unsuccessful attempts, I was able to build the package.
I'm not getting NaN in loss function anymore. However, loss value seems to be a bit optimistic most of the time. For the example above I'm getting loss ~0.38 while for CPU and GPU (on google Colab) I'm getting around 0.65

I have 2 more issues that I'm not sure if these are bugs or limitations.

For certain datasets I'm getting "Out of Memory" (OOM) error. I tried to use tf.config.experimental.set_memory_growth but it didn't seem to work for XPU. Are there any options to overcome the limitation?
When fine tuned (trained) for BERT NLP with non-padded token vectors, the training time is much longer. The example above runs for about 10 minutes on Arc A380 (slower than on i5-10400) and from intel_gpu_top I can notice that the GPU is idle at least half of the time. But if I change line:
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
to:
return tokenizer(example["sentence1"], example["sentence2"], truncation=True, padding=True)
it runs for just around 4 minutes.
I haven't noticed differences when running it on CPU, on Colab GPU non-padded even a bit faster, but not by much.

yiqianglee · 2022-11-13T22:59:49Z

@ip2016
For 1, set_memory_growth doesn't solve the HW limitation, currently ITEX's allocator will allocate almost all the memory of HW device, if you still see "OOM", I believe that hit the HW upbound, have you tried to lower batch size?
For 2, un-pad sequence will cause dynamic shape for MatMul, it's a known issue that oneDNN primitive need to be re-created if different shapes are coming, you can double confirm by export DNNL_VERBOSE=2, if you see many "cache miss" from second iterations, that's the overhead (primitive creation), if this is the case in your side, then it's a known issue, we are evaluating if we can have some solutions internally, but currently, it's known issue.

yiqianglee added the bug Something isn't working label Nov 2, 2022

guizili0 assigned Wanzizhu Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low accuracy on Arc A380 #5

Low accuracy on Arc A380 #5

ip2016 commented Oct 30, 2022

yiqianglee commented Oct 31, 2022

Tengfei09 commented Nov 1, 2022 •

edited

Loading

ip2016 commented Nov 1, 2022 •

edited

Loading

ip2016 commented Nov 1, 2022 •

edited

Loading

ip2016 commented Nov 1, 2022 •

edited

Loading

yiqianglee commented Nov 1, 2022

ip2016 commented Nov 1, 2022

yiqianglee commented Nov 5, 2022

Wanzizhu commented Nov 10, 2022 •

edited

Loading

ip2016 commented Nov 13, 2022

yiqianglee commented Nov 13, 2022

Low accuracy on Arc A380 #5

Low accuracy on Arc A380 #5

Comments

ip2016 commented Oct 30, 2022

yiqianglee commented Oct 31, 2022

Tengfei09 commented Nov 1, 2022 • edited Loading

ip2016 commented Nov 1, 2022 • edited Loading

ip2016 commented Nov 1, 2022 • edited Loading

ip2016 commented Nov 1, 2022 • edited Loading

yiqianglee commented Nov 1, 2022

ip2016 commented Nov 1, 2022

yiqianglee commented Nov 5, 2022

Wanzizhu commented Nov 10, 2022 • edited Loading

ip2016 commented Nov 13, 2022

yiqianglee commented Nov 13, 2022

Tengfei09 commented Nov 1, 2022 •

edited

Loading

ip2016 commented Nov 1, 2022 •

edited

Loading

ip2016 commented Nov 1, 2022 •

edited

Loading

ip2016 commented Nov 1, 2022 •

edited

Loading

Wanzizhu commented Nov 10, 2022 •

edited

Loading