Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[fosscuda/2019b] TensorFlow v2.3.0 w/ Python 3.7.4 #11040

Merged

Conversation

@boegel boegel added the update label Aug 3, 2020
@boegel boegel added this to the next release (4.2.3?) milestone Aug 3, 2020
@boegel
Copy link
Member

boegel commented Aug 3, 2020

@Flamefire Can you take care of the missing easyconfig for Bazel ?

@Flamefire
Copy link
Contributor Author

Sure. It's missing the easyblock change but I'm still having trouble on power but got a solution in the pipeline. I'll be able to test that next week.
You could merge the current easyblock PR which allows us to remove a patch from all semi-current ecs

@Flamefire Flamefire marked this pull request as draft August 17, 2020 09:33
@Flamefire Flamefire force-pushed the 20200730100641_new_pr_TensorFlow230 branch 2 times, most recently from 64ad619 to 7fcc9bc Compare September 1, 2020 15:09
@easybuilders easybuilders deleted a comment from boegelbot Sep 2, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 2, 2020
@Flamefire Flamefire force-pushed the 20200730100641_new_pr_TensorFlow230 branch 6 times, most recently from 5d7bd69 to d14da88 Compare September 4, 2020 08:01
@Flamefire Flamefire force-pushed the 20200730100641_new_pr_TensorFlow230 branch from d14da88 to 84a532f Compare September 4, 2020 14:56
@easybuilders easybuilders deleted a comment from boegelbot Sep 4, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 4, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 4, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 4, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 4, 2020
@Flamefire Flamefire force-pushed the 20200730100641_new_pr_TensorFlow230 branch from 84a532f to 4ae5f74 Compare September 4, 2020 16:12
@easybuilders easybuilders deleted a comment from boegelbot Sep 4, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 6, 2020
@Flamefire Flamefire force-pushed the 20200730100641_new_pr_TensorFlow230 branch 2 times, most recently from 5fdef37 to eb96238 Compare September 8, 2020 08:09
…nd patches: TensorFlow-2.3.0-fix_protoc_build.patch
@boegel
Copy link
Member

boegel commented Sep 14, 2020

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11040 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11040 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 6317

Test results coming soon (I hope)...

- notification for comment with ID 692064399 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Sep 14, 2020

Test report by @boegel
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in this PR)
gligar04.gastly.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz (skylake_avx512), Python 2.7.5
See https://gist.github.com/4db005f085c9d2dddcf65cba73129cc1 for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel You are likely using the TF EasyBlock without the JSON fix

@boegel
Copy link
Member

boegel commented Sep 14, 2020

@Flamefire Nevermind that test report, I started it in the wrong session...

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (2 easyconfigs in this PR)
generoso-x-4 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/9b3a6e2c233a33e60ac054a186316bf3 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 14, 2020

Test report by @boegel
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in this PR)
node3413.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/af860be930c49fd5766c0b2da80d7c76 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 14, 2020

$ python /tmp/easybuild_build/TensorFlow/2.3.0/foss-2019b-Python-3.7.4/TensorFlow-2.x_mnist-test.py
...
TypeError: Parameter to MergeFrom() must be instance of same class: expected tensorflow.TensorShapeProto got tensorflow.TensorShapeProto.

How does this only happen on some systems and not on others?!

@Flamefire
Copy link
Contributor Author

Flamefire commented Sep 14, 2020

This is the protobuf bug I fixed last minute. Reinstall the protobuf-python to fix. In short: the c++ extension for protobuf is much faster and wrong

edit (@boegel): see #11260

@terjekv
Copy link
Collaborator

terjekv commented Sep 15, 2020

Test report by @terjekv
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in this PR)
ninhursaga.uio.no - Linux RHEL 8.2, x86_64, Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (cascadelake), Python 3.6.8
See https://gist.github.com/ced6c60e2ab6b1bd3bf4e426d4e22e02 for a full test report.

@terjekv
Copy link
Collaborator

terjekv commented Sep 15, 2020

I had to rebuild ICU and double-conversion. After that, all good. The TF EasyBlock is complicated enough, but I guess this will bite others as well. Should there be a way to detect shortcomings, and state that "a rebuild of X, Y, and Z in this toolchain is required for this EasyConfig"? Or, optionally, state this without attempts at detecting problem?

I realise that this is a situation we try very very hard to avoid, but it crops up now and then, especially for "modern" applications with a billion dependencies that have extremely specific expectations of their environment.

@Flamefire
Copy link
Contributor Author

I was thinking if EB could do that. E.g. run the sanity check for dependencies again on failure of a build to see if those were updated.
But anyway the error is usually clear enough in pointing to the dependency. Just telling users to rebuild that an try again should be enough.

@boegel
Copy link
Member

boegel commented Sep 15, 2020

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in this PR)
node3404.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/476ee9c9a5d9f7f914171dcdc83ce028 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 15, 2020

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in this PR)
node3308.joltik.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/88d6498cc0921f2f071f655b3a543263 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 15, 2020

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in this PR)
node2407.golett.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (haswell), Python 2.7.5
See https://gist.github.com/8aefff37191b0f82f3a7ff2e49cc243a for a full test report.

@boegel
Copy link
Member

boegel commented Sep 15, 2020

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11040 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11040 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7813

Test results coming soon (I hope)...

- notification for comment with ID 692874551 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in this PR)
generoso-x-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/efaa0ae5e257a6680711c352c6c62555 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 24, 2020

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in this PR)
node3404.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/0004bc760b26bc4802e229707c2842da for a full test report.

@boegel
Copy link
Member

boegel commented Sep 24, 2020

Tested on top of easybuilders/easybuild-easyblocks#2166, good to go now imho, thanks @Flamefire!

@boegel boegel merged commit 3427b27 into easybuilders:develop Sep 24, 2020
@Flamefire Flamefire deleted the 20200730100641_new_pr_TensorFlow230 branch September 24, 2020 14:44
@Flamefire
Copy link
Contributor Author

Seriously? This gets merged and hours later they release a .1? https://github.com/tensorflow/tensorflow/releases/tag/v2.3.1

@boegel I'd just update this EC if you are ok with that.

@akesandgren
Copy link
Contributor

What did you expect? It's Friday, there's always an update of something on a Friday, usually kernels though.

@terjekv
Copy link
Collaborator

terjekv commented Sep 25, 2020

@Flamefire, You have to understand, PyTorch were waiting for you to get this approved, then they could release a new version.

@boegel
Copy link
Member

boegel commented Sep 25, 2020

@Flamefire Updating makes sense, still only in develop...

@Flamefire
Copy link
Contributor Author

Updated in #11375

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants