Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

Problems getting ngraph-tf to run under manjaro #535

Open
SleepProgger opened this issue May 28, 2019 · 2 comments
Open

Problems getting ngraph-tf to run under manjaro #535

SleepProgger opened this issue May 28, 2019 · 2 comments

Comments

@SleepProgger
Copy link

SleepProgger commented May 28, 2019

I try since some days to get ngraph-tf to run under manjaro and ran into multiple problems.
The goal is to use ngraph-tf with the plaidml backend.

I am testing with the following code:

import tensorflow as tf
import os
import sys
if os.environ.get("USE_TF_KERAS", "1") == "1":
    import tensorflow.keras as keras
    print("Using tensorflow keras version")
else:
    import keras
    print("Using keras with backend %s" % keras.backend.backend())


if len(sys.argv) < 2:
    backend = "CPU"
else:
    backend = sys.argv[1]
if backend == "NONE":
    print("NOT using ngraph")
else:
    import ngraph_bridge
    print("Supported ngraph backend:\n  %s" % "\n  ".join(ngraph_bridge.list_backends()))
    ngraph_bridge.set_backend(backend)
    print("Using ngraph backend %s" % ngraph_bridge.get_currently_set_backend_name())

mnist = keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = keras.models.Sequential([
  keras.layers.Flatten(input_shape=(28, 28)),
  keras.layers.Dense(512, activation="relu"),
  keras.layers.Dropout(0.2),
  keras.layers.Dense(10, activation="softmax")
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print("Predict:", model.predict(x_train[:1]))

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

When trying to run it with tensorflow.keras and the ngraph backend set to PLAIDML (USE_TF_KERAS=1 KERAS_BACKEND="tensorflow" python test_ngrapg_tf.py PLAIDML) i get
a segfault or this stacktrace (sometimes the one, sometimes the other):

Traceback (most recent call last):
  File "test_ngrapg_tf.py", line 39, in <module>
    model.fit(x_train, y_train, epochs=5)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training.py", line 880, in fit
    validation_steps=validation_steps)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 329, in model_iteration
    batch_outs = f(ins_batch)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Caught exception while compiling op_backend: get_shape() must be called on a node with exactly one output ()

	 [[{{node ngraph_cluster_44}}]]

When trying to run it with keras with the keras backend set to tensorflow (USE_TF_KERAS=0 KERAS_BACKEND="tensorflow" python test_ngrapg_tf.py PLAIDML) i reliable get invalid opencl kernels generated by plaidml (see plaidml/plaidml#322)

Both versions can execute the prediction step just fine, altho keras with tensorflow backend seem to produce wrong values.

With only tensorflow or plaidml via keras (or in the case of tf also tf.keras) and without ngraph-tf it runs without a problem (USE_TF_KERAS=1/0 KERAS_BACKEND="tensorflow" python test_ngrapg_tf.py NONE).
Those tests where made with a self build version of ngraph-tf with and without the --use_prebuilt_tensorflow parameter.

Using the CPU ngraph backend it runs with keras with tensorflow as keras backend and tf.keras altho way slower as just tensorflow-cpu without ngraph in both cases.
Additionally when using keras with backend set to tensorflow the results seem to be wrong.

When trying to run it with the ngraph CPU backend via the pypi version of ngraph-tf installed via pip i get an Illegal instruction crash with keras->tensorflow and tf.keras.

Additional info

I am using python 3.5.5 installed via pyenv.

# uname -a 
Linux seima-pc 5.0.15-1-MANJARO #1 SMP PREEMPT Fri May 10 19:51:04 UTC 2019 x86_64 GNU/Linux

GPU: Radeon RX 580

When compiling ngraph-tf i need to create a link from lib64 to lib in the artifact dir otherwise the ngraph-tf build fails as it expects the lib dir but creates the lib64 dir (not sure if relevant)

Sorry for the wall of text, but i really don't know where it goes wrong.
Please let me know if additional information are required.

@SleepProgger
Copy link
Author

SleepProgger commented May 31, 2019

When solving (although in a very crude way) the invalid opencl kernel generated by plaidml (plaidml/plaidml#322) i now get the same error with tensorflow.keras and keras with the keras backend set to tensorflow, ie:

Traceback (most recent call last):
  File "test_ngrapg_tf.py", line 39, in <module>
    model.fit(x_train, y_train, epochs=5)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training.py", line 880, in fit
    validation_steps=validation_steps)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 329, in model_iteration
    batch_outs = f(ins_batch)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Caught exception while compiling op_backend: get_shape() must be called on a node with exactly one output ()

	 [[{{node ngraph_cluster_44}}]]

or a segfault (some times the one, sometimes the other)

I plan to try an ubuntu based distro tomorrow to see if it is in deed manjaro related

@SleepProgger
Copy link
Author

SleepProgger commented Jun 2, 2019

Sadly basically same behavior under Mint (Ubuntu LTS based).

  • Plaidml as ngraph backend segfaults and ngraphs CPU backend is way slower as pure tensorflow-cpu.
  • The pip version of ngraph-tf still complains about "illegal Instruction" (probably related to using CPU features not supported on my AMD FX CPU)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant