Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(system_monitor): incorrect counter increment in CPU Usage monitor #1783

Conversation

v-nakayama7440-esol
Copy link
Contributor

@v-nakayama7440-esol v-nakayama7440-esol commented Sep 5, 2022

Signed-off-by: v-nakayama7440-esol [email protected]

Description

There is a bug that the cpu_monitor does not generate a warning or error unless the CPU usage becomes more than the counter threshold.
The default parameters for monitoring CPU usage are as follows, and cpu_monitor is supposed to generate a warning after the CPU usage becomes 96% or more and it appears 2 consecutive times (monitor interval is about 1sec). As regards to error, cpu_monitor shall generate an error after the usage becomes 100% and it appears 2 consecutive times.

    usage_warn: 0.96
    usage_error: 1.00
    usage_warn_count: 2
    usage_error_count: 2

However, the current implementation of cpu_monitor generates a warning/error after the appearance of 3 times, it's counting one extra. This has been introduced from this fix #557.

Changed parameters as follows to make testing easy.

    usage_error: 0.90
    usage_error_count: 2

And Observed an error was generated after count of 3. (This is a dump of CPU Usage)

1662360194,170445574,...,b'\x00',cpu_monitor: CPU Usage,OK,se-osaka-dev32,CPU all: status,OK,CPU all: total,16.34%,...
1662360195,526444173,...,b'\x00',cpu_monitor: CPU Usage,OK,se-osaka-dev32,CPU all: status,OK,CPU all: total,100.00%,...
1662360196,862445342,...,b'\x00',cpu_monitor: CPU Usage,OK,se-osaka-dev32,CPU all: status,OK,CPU all: total,99.94%,...
1662360198,246452080,...,b'\x02',cpu_monitor: CPU Usage,very high load,se-osaka-dev32,CPU all: status,very high load,CPU all: total,99.94%,...

This PR is to fix counting one extra.

Related links

#557

Tests performed

  1. Change usage_warn and usage_error in cpu_monitor.param.yaml to make testing easy.

        usage_warn: 0.40
        usage_error: 0.90
        usage_warn_count: 2
        usage_error_count: 2
  2. Simulate CPU high load by using stress command.

  3. Verify cpu_monitor generates an error after count of 2. b'\x02' indicates an error.

    1662358072,778439042,...,b'\x00',cpu_monitor: CPU Usage,OK,se-osaka-dev32,CPU all: status,OK,CPU all: total,39.30%,...
    1662358074,154435411,...,b'\x00',cpu_monitor: CPU Usage,OK,se-osaka-dev32,CPU all: status,OK,CPU all: total,100.00%,...
    1662358075,486447870,...,b'\x02',cpu_monitor: CPU Usage,very high load,se-osaka-dev32,CPU all: status,very high load,CPU all: total,100.00%,...
  4. Also verify cpu_monitor generates an warning after count of 2. b'\x01' indicates a warning.

    1662357625,186469428,...,b'\x00',cpu_monitor: CPU Usage,OK,se-osaka-dev32,CPU all: status,OK,CPU all: total,32.37%,...
    1662357626,514459628,...,b'\x00',cpu_monitor: CPU Usage,OK,se-osaka-dev32,CPU all: status,OK,CPU all: total,54.64%,...
    1662357627,850450383,...,b'\x01',cpu_monitor: CPU Usage,high load,se-osaka-dev32,CPU all: status,high load,CPU all: total,53.82%,...

Notes for reviewers

Pre-review checklist for the PR author

The PR author must check the checkboxes below when creating the PR.

In-review checklist for the PR reviewers

The PR reviewers must check the checkboxes below before approval.

  • The PR follows the pull request guidelines.
  • The PR has been properly tested.
  • The PR has been reviewed by the code owners.

Post-review checklist for the PR author

The PR author must check the checkboxes below before merging.

  • There are no open discussions or they are tracked via tickets.
  • The PR is ready for merge.

After all checkboxes are checked, anyone who has write access can merge the PR.

@ito-san
Copy link
Contributor

ito-san commented Sep 5, 2022

@v-nakayama7440-esol Please allow me to change the contents above because it is difficult to understand.

@codecov
Copy link

codecov bot commented Sep 5, 2022

Codecov Report

Merging #1783 (0594199) into main (8c198af) will decrease coverage by 0.38%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main    #1783      +/-   ##
==========================================
- Coverage   10.74%   10.36%   -0.39%     
==========================================
  Files        1148     1212      +64     
  Lines       84761    87628    +2867     
  Branches    20451    20282     -169     
==========================================
- Hits         9106     9080      -26     
- Misses      66301    69067    +2766     
- Partials     9354     9481     +127     
Flag Coverage Δ *Carryforward flag
differential 0.00% <0.00%> (?)
total 10.37% <0.00%> (-0.36%) ⬇️ Carriedforward from 30560e6

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
...n_capture_rviz_plugin/src/screen_capture_panel.cpp 0.00% <0.00%> (ø)
...lanner/scene_module/avoidance/avoidance_module.hpp 0.00% <0.00%> (ø)
...er/scene_module/lane_change/lane_change_module.hpp 0.00% <0.00%> (ø)
...lanner/scene_module/pull_over/pull_over_module.hpp 0.00% <ø> (ø)
...th_planner/scene_module/scene_module_interface.hpp 0.00% <ø> (ø)
...er/src/scene_module/avoidance/avoidance_module.cpp 0.00% <0.00%> (ø)
...rc/scene_module/lane_change/lane_change_module.cpp 0.00% <0.00%> (ø)
...nner/src/scene_module/pull_out/pull_out_module.cpp 0.00% <0.00%> (ø)
...er/src/scene_module/pull_over/pull_over_module.cpp 0.00% <0.00%> (ø)
...er/include/scene_module/scene_module_interface.hpp 0.00% <0.00%> (ø)
... and 208 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@ito-san
Copy link
Contributor

ito-san commented Sep 6, 2022

@v-nakayama7440-esol Please allow me to change the contents above because it is difficult to understand.

Thank you for the PR. Completed to fix the contents above.

@ito-san
Copy link
Contributor

ito-san commented Sep 6, 2022

  • Parameters

      usage_warn: 0.40
      usage_error: 0.90
      usage_warn_count: 2
      usage_error_count: 2
  • Verified a warning was generated after count of 2.

    stress-ng -k -c 12 -l 40

    ros2 topic echo /diagnostics | grep -e "cpu_monitor: CPU Usage" -B 1 -e "cpu_monitor: CPU Usage" -A 7
    image

  • Verified an error was generated after count of 2.

    stress-ng -k -c 12 -l 90

    ros2 topic echo /diagnostics | grep -e "cpu_monitor: CPU Usage" -B 1 -e "cpu_monitor: CPU Usage" -A 7
    image

Copy link
Contributor

@ito-san ito-san left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ito-san ito-san merged commit 574be20 into autowarefoundation:main Sep 6, 2022
@andoh104
Copy link

andoh104 commented Sep 7, 2022

@ito-san @v-nakayama7440-esol
This code something strange.

    if (usage_error_check_cnt_[idx] < usage_error_count_) {
      usage_error_check_cnt_[idx]++;
    }
    if (usage_error_check_cnt_[idx] >= usage_error_count_) {
      level = DiagStatus::ERROR;
    }

Same logic

    if (++usage_error_check_cnt_[idx] >= usage_error_count_) {
      level = DiagStatus::ERROR;
      usage_error_check_cnt_[idx] = usage_error_count;  // <-- do you need this code?
    }

@ito-san
Copy link
Contributor

ito-san commented Sep 7, 2022

@andoh104
Thank you for pointing out. It seems to be good to fix but I think you're missing suffix underscore(_) to usage_error_count.

usage_error_check_cnt_[idx] = usage_error_count; // <-- do you need this code?

usage_error_check_cnt_[idx] = usage_error_count_; is needed.
Because without this logic, it counts forever and ever, and eventually it may cause an overflow and exit error condition.
@v-nakayama7440-esol
Could you please consider to fix this?

@ito-san
Copy link
Contributor

ito-san commented Sep 7, 2022

@v-nakayama7440-esol

Could you please consider to fix this?

Sorry, I withdraw my request after discussing internally in TIER IV. Please leave it there!

naokimatsunawa pushed a commit to tier4/autoware.universe that referenced this pull request Sep 9, 2022
boyali pushed a commit to boyali/autoware.universe that referenced this pull request Sep 28, 2022
boyali pushed a commit to boyali/autoware.universe that referenced this pull request Oct 3, 2022
boyali pushed a commit to boyali/autoware.universe that referenced this pull request Oct 3, 2022
boyali pushed a commit to boyali/autoware.universe that referenced this pull request Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants