Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix:Improve log warnings in REST API /health endpoint #5381

Merged
merged 5 commits into from
Jul 25, 2023
Merged

Conversation

vblagoje
Copy link
Member

What?

In this PR, we make an improvement to the log warning in the REST API /health endpoint. Specifically, we change the warning message from "No NVIDIA GPU found." to "Couldn't collect GPU stats: ".

Why?

The current warning message is misleading and may cause confusion among users, potentially leading to unnecessary new issues being created. By giving a more descriptive and accurate warning, we aim to provide a better user experience and minimize confusion.

How can it be used?

This change is internal and does not affect the way users interact with the REST API /health endpoint. The improvement will only affect the content of the log messages generated when a request is made to this endpoint.

How did you test it?

The testing for this change is still ongoing. Manual testing is currently being conducted to ensure the updated log warning is accurate and provides the desired clarity.

Notes for the reviewer

Do not integrate yet. This PR is issued to ease communication between the stakeholders involved.

@vblagoje vblagoje requested a review from a team as a code owner July 19, 2023 08:30
Copy link
Member

@ArzelaAscoIi ArzelaAscoIi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated against CUDA driver version 470.XX that is has the known pynvml issue described here.

Can confirm that the error is not logged 🚀 Thank you for taking care of it.

@github-actions github-actions bot added the type:documentation Improvements on the docs label Jul 24, 2023
@vblagoje
Copy link
Member Author

@ArzelaAscoIi would you please test this version?

Copy link
Member

@ArzelaAscoIi ArzelaAscoIi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected 👍 Thank you!

@vblagoje vblagoje added this to the 1.19.0 milestone Jul 25, 2023
@vblagoje
Copy link
Member Author

vblagoje commented Jul 25, 2023

@anakin87 the existing unit test related to /health checkpoint rest_api/test/test_rest_api.py::test_get_health_check is decent already. As there are no regressions and @ArzelaAscoIi has confirmed this branch works in his env, I'll merge this PR into the main, and then we can cherry-pick it to 1.19

@anakin87
Copy link
Member

@vblagoje please go ahead and merge!

@vblagoje vblagoje merged commit 22897c1 into main Jul 25, 2023
8 checks passed
@vblagoje vblagoje deleted the nvml_warning branch July 25, 2023 15:06
anakin87 pushed a commit that referenced this pull request Jul 26, 2023
* Improve warning in REST APIs get_health_status method

* Convert log message

* A better solution and documentation

* Add another nested try/except block

* Simplify
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:rest_api type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants