Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable MKL-DNN/oneDNN usage for aarch64 CPUs in TensorFlow 2.5+ #2574

Conversation

Flamefire
Copy link
Contributor

(created using eb --new-pr)

@boegel boegel changed the title TensorFlow: Enable oneDNN usage for AARCH64 in 2.5+ TensorFlow: Enable MKL-DNN/oneDNN usage for AARCH64 in 2.5+ Sep 14, 2021
@boegel boegel added this to the next release (4.4.3?) milestone Sep 14, 2021
easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved
@boegel boegel changed the title TensorFlow: Enable MKL-DNN/oneDNN usage for AARCH64 in 2.5+ enable MKL-DNN/oneDNN usage for aarch64 CPUs in TensorFlow 2.5+ Sep 14, 2021
add logging for auto-enabling use of MKL-DNN for TensorFlow
@boegel
Copy link
Member

boegel commented Sep 14, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 2 (2 easyconfigs in total)
select-pika-c6gd-2xlarge-0001 - Linux centos linux 8.3.2011, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/5a6ba346212cb80fda239c0f1a73198f for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel

gcc: fatal error: Killed signal terminated program cc1plus

Any cgroup stuff maybe?

@boegel
Copy link
Member

boegel commented Sep 15, 2021

@boegel

gcc: fatal error: Killed signal terminated program cc1plus

Any cgroup stuff maybe?

Or perhaps lack of memory...

This was on a c6g.2xlarge AWS instance (8 cores, 16GB of RAM), which is perhaps a bit on the light side to build TensorFlow from source (using 8 cores)?

I'll try again on a fatter node (or using less cores).

@boegel
Copy link
Member

boegel commented Sep 15, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 2 (2 easyconfigs in total)
select-pika-c6gd-4xlarge-0001 - Linux centos linux 8.3.2011, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/1b7f95f9ca3960d7e3d2a2de5403ce0a for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel Bug in TF with excluding based on arch and a config flag leading to ambigious, yet same select cases: https://github.com/tensorflow/tensorflow/blob/72fd2bfa42a8ad909baf8d2b7b674563d256514d/tensorflow/core/kernels/BUILD#L3028-L3034 (arm_any and no_mkldnn_contraction_kernel triggers)
Can you open a TF bug?

@boegel
Copy link
Member

boegel commented Sep 15, 2021

@Flamefire It's not clear to me what's going wrong here, so I'm not sure what to put in the issue. Also, I have very little bandwidth to follow up on it...

Does it basically mean that the --config=mkl_aarch64 doesn't work at all, or is it more subtle than that?

@Flamefire
Copy link
Contributor Author

Did that: tensorflow/tensorflow#52027

More or less: yes. However for ARM we might want to use --define build_with_mkl_aarch64=true anyway, see #2577 (comment)
Or None at all, i.e. change with_mkl_dnn to False, especially as the name is likely misleading...

@boegel boegel added the aarch64 Related to Arm 64-bit (aarch64) label Sep 17, 2021
@Flamefire
Copy link
Contributor Author

Closing as this isn't actually enabling mklDNN/oneDNN. See #2577 and https://groups.google.com/a/tensorflow.org/g/build/c/RZhgZst-fgQ

@Flamefire Flamefire closed this Sep 24, 2021
@Flamefire Flamefire deleted the 20210914103157_new_pr_tensorflow branch June 27, 2024 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aarch64 Related to Arm 64-bit (aarch64) enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants