Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Enable test parallelism #5027

Closed
1 task
csadorf opened this issue Nov 24, 2022 · 6 comments
Closed
1 task

[FEA] Enable test parallelism #5027

csadorf opened this issue Nov 24, 2022 · 6 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request tests Unit testing for project

Comments

@csadorf
Copy link
Contributor

csadorf commented Nov 24, 2022

As reported by @vyasr there are currently some issues with running tests in paralle using pytest-xdist. We should resolve those to significantly reduce cuml's test runtime when multiple workers are available.

@csadorf csadorf added feature request New feature or request ? - Needs Triage Need team to review and classify tests Unit testing for project labels Nov 24, 2022
@csadorf
Copy link
Contributor Author

csadorf commented Nov 24, 2022

@vyasr Maybe you can provide a little bit more context?

@vyasr
Copy link
Contributor

vyasr commented Nov 28, 2022

Yup, the current issue that I run into is that the different workers collect different tests. Offhand I would guess that the problem is one of the following:

  1. Some unordered container is being used to parametrize tests and is resulting in different orders on different workers.
  2. Some sort of randomization is being used in parametrization to select which tests run when
  3. There is something nondeterministic in the selection of stress/quality/unit tests that changes their ordering.

I've attached the output that I see.

pytest_output_two_workers.txt

@wphicks
Copy link
Contributor

wphicks commented Dec 5, 2022

The non-determinism in test collection should be fixed by #5059, but it looks like we've got issues with some of the tests themselves unrelated to the question of parallelism.

@vyasr
Copy link
Contributor

vyasr commented Dec 6, 2022

Could you clarify what issues you're talking about? Are these issues that prevent the usage of parallelism (perhaps because of GPU overutilization or memory requirements), unrelated issues that lead to long test times (aka #5053), or something else?

@wphicks
Copy link
Contributor

wphicks commented Dec 8, 2022

Sorry, I was actually referring to unrelated failures that were present across PRs. That has been resolved, but now there are indeed failures which look like they're definitely caused by the parallelism itself. Specifically, we're seeing:

  • Hypothesis tests exceeding their deadlines (probably benign resource contention and easily resolvable)
  • Running out of GPU memory
  • Some more troubling failures in the Dask tests, including what appears to be actual numerical errors

@dantegd
Copy link
Member

dantegd commented Dec 27, 2022

xdist is now being used in the GHA based CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request tests Unit testing for project
Projects
None yet
Development

No branches or pull requests

4 participants