Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the sample allreduce-test.py from README runs forever #10

Open
nitinkamble opened this issue Jun 29, 2017 · 0 comments
Open

the sample allreduce-test.py from README runs forever #10

nitinkamble opened this issue Jun 29, 2017 · 0 comments

Comments

@nitinkamble
Copy link

nitinkamble commented Jun 29, 2017

First, I will give some context of my test.

I got the repo built, along with it's dependencies. Configured slrum, and have both Intel MPI and OpenMPI installed. I used the sample lines from the README for creating train.txt and vocab.txt. CUDA 8.0 libraries are built and installed on the system. I have configured gpus in slurm gres. I also see messages showing GPUs libraries being loaded by tensorflow.

I also set --max-interations to 10 to reduce the runtime. For such a small dataset, the run should finish very fast. But it is running for days. I tried with 2 tasks and also 50 tasks. I see that many CPU cores running almost at 100%, but nothing is running on GPU.

1st question, Why is it running forever, for such a small test?

and 2nd question, why GPUs are not being used?

Thanks in advance,
Nitin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant