-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix tpu bug #2350
fix tpu bug #2350
Conversation
Thanks @infwinston! Could we add a quick workdir test to an existing TPU smoke test? To ensure no future regressions. |
The code looks good to me. 🙂 Just to make sure I haven't missed anything, how is this related to the port PR? |
It should be introduced by #2096, instead. Sorry for not identifying that issue during the review. : ) |
oh yes #2096. Lines 1355 to 1369 in ca2a092
|
Some of the TPU tests were failing and I don't fully remember the reason for failure, but I remember it looked unrelated to this (IIRC it was related to availability). The same tests had passed during a previous round of testing so I assumed its a transient error. I should have double checked - my bad 🙏 |
no worries! just wanted to make sure our smoke test can catch this. thanks |
Fix a bug for TPU pod where the actual number of node ips !=
launched_nodes
. (from port PR #2210)the bug caused failures in setup and workdir sync on worker nodes.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh
tpuvm_mnist.yaml
on TPU pod v2-32