Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization of fast_csr_mv #67

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

GeoffNN
Copy link
Collaborator

@GeoffNN GeoffNN commented Jun 18, 2020

Parallelization for the sparse matrix multiplication, row-wise.
Cf this thread, which announces nice speed ups.

@fabianp
Copy link
Member

fabianp commented Jun 18, 2020

I'm running the examples code with and without this patch, and the 12 CPUs I have are all constantly at 100% , and don't see any speed improvements after using this patch. It seems that numba (or llvm) is by default already deciding to parallelize some parts of the code

@fabianp
Copy link
Member

fabianp commented Jun 18, 2020

I'm going to run a couple more benchmarks, but if the performance is the same I would be more inclined to let numba decide which parts to parallelize, as he'll likely do a better job than us at that (thinking for example at nested parallelizable loops) ;-)

@fabianp
Copy link
Member

fabianp commented Jun 18, 2020

and the parallelization I observe definitely comes from numba, as I can get it down to use just one CPU with the environment variable NUMBA_NUM_THREADS=1

@GeoffNN
Copy link
Collaborator Author

GeoffNN commented Jun 18, 2020

For now, the only parallelization that's done in this part of the code base is for sampling the batches, once per epoch. Is it possible that the 100% on the CPUs (that I also observe on my machines) is because there's no deallocation between two calls to that function?

What happens if we remove parallel=True for sampling the batches, and put it for the matrix multiplication?

@fabianp
Copy link
Member

fabianp commented Jun 18, 2020

OK I think I found why I was seeing all CPUs being used. It's because the bottleneck of the algorithm is not in the algorithm itself but in computing the fw_gap used for reporting. This uses the full gradient, and so Numpy's matrix-vector routines that fire up all CPUs

@GeoffNN
Copy link
Collaborator Author

GeoffNN commented Jul 20, 2020

Does that mean that it would be nice in practice ? Since for applications, you won't be computing the gap very often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants