Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Fix: SystemCPUfreq fails when any core is offline #497

Merged
merged 1 commit into from
Mar 20, 2023

Conversation

taherkk
Copy link
Contributor

@taherkk taherkk commented Mar 5, 2023

SystemCpufreq function does not check if a core is online before collecting metrics. In the case when a core is offline it fails and does not collect a couple of metrics.

Refer prometheus/node_exporter#2577 for more details

@taherkk
Copy link
Contributor Author

taherkk commented Mar 5, 2023

I have read the readme and steps for contribution and have tried my best to follow it. This my first open source contribution so please guide me if I have missed something.

@taherkk
Copy link
Contributor Author

taherkk commented Mar 5, 2023

Tests are failing because each directory under .../cpu[0-9] has a file named online in Linux except for cpu0. This contains int spicifying if the core is online or not.
I added the file in my local setup manually and the tests are passing.

Please help me in configuring that file using fixtures.

Please review @pgier

Copy link
Collaborator

@pgier pgier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me in general. I'm just wondering if there is a way to just check /sys/devices/system/cpu/offline which should be empty most of the time (no CPUs offline)?

Also to fix the tests, looks like you'll have to add the relevant online files to the test data.

system_cpu_test.go:103: stat testdata/fixtures/sys/devices/system/cpu/cpu1/online: no such file or directory

@discordianfish
Copy link
Member

You also need to sign off your commit, see the failing DCO check above

@taherkk taherkk force-pushed the syscpufreq_core_offline branch 2 times, most recently from 99d5b49 to 00bca37 Compare March 7, 2023 16:31
@taherkk
Copy link
Contributor Author

taherkk commented Mar 7, 2023

Signed off my commit.
Not sure why the lint step is failing.

@discordianfish
Copy link
Member

Here is the lint error:

#!/bin/bash -eo pipefail
git diff --exit-code
diff --git a/testdata/fixtures.ttar b/testdata/fixtures.ttar
index cb49947..b80316a 100644
--- a/testdata/fixtures.ttar
+++ b/testdata/fixtures.ttar
@@ -12543,11 +12543,6 @@ Mode: 444
 Directory: fixtures/sys/devices/system/cpu/cpu1
 Mode: 775
 # ttar - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Path: fixtures/sys/devices/system/cpu/cpu1/online
-Lines: 1
-1
-Mode: 544
-# ttar - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Directory: fixtures/sys/devices/system/cpu/cpu1/cpufreq
 Mode: 775
 # ttar - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@@ -12606,6 +12601,11 @@ Lines: 1
 <unsupported>
 Mode: 664
 # ttar - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+Path: fixtures/sys/devices/system/cpu/cpu1/online
+Lines: 1
+1
+Mode: 544
+# ttar - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Directory: fixtures/sys/devices/system/cpu/cpu1/thermal_throttle
 Mode: 755
 # ttar - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Exited with code exit status 1
CircleCI received exit code 1

You probably didn't run make fixtures to create the ttar file

@taherkk
Copy link
Contributor Author

taherkk commented Mar 8, 2023

I'm exploring the idea of checking for offline cpu's in /sys/devices/system/cpu/offline

Sharing the approach here:
Get cpu's from the file.
Expand ranges.
Filter cpu's which not physically present (Though I have a hexacore CPU the offline file shows a range of 12-15 which are not physically present)

@taherkk
Copy link
Contributor Author

taherkk commented Mar 12, 2023

@pgier, I have updated the code according to the approach mentioned earlier.

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@AdarshdeepCheema
Copy link

@taherkk I think it is causing #530. Can u please confirm
FYI @pgier @discordianfish

@taherkk
Copy link
Contributor Author

taherkk commented Jun 6, 2023

raised PR #534 for it @pgier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants