Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add perf exporter #1274

Merged
merged 1 commit into from
May 7, 2019
Merged

Add perf exporter #1274

merged 1 commit into from
May 7, 2019

Conversation

hodgesds
Copy link
Contributor

@hodgesds hodgesds commented Mar 2, 2019

This implements #1238 by adding perf based profiling metrics. It's still a work in progress but so far it seems to be working locally. I'd like to instrument more of the already available metrics but figured I'd put this out there for others to see.

@hodgesds hodgesds force-pushed the perf-exporter branch 2 times, most recently from f17994f to 0468a0c Compare March 2, 2019 03:38
@hodgesds
Copy link
Contributor Author

hodgesds commented Mar 2, 2019

Example of L1 data cache hit rate:
image

@SuperQ
Copy link
Member

SuperQ commented Mar 2, 2019

I'm going to have to dig deeper into the new included libraries, but it looks like so far everything is being returned as gauge values.

We prefer to expose raw underlying counters in Prometheus, rather than try to use pre-calculated rates. Is this possible with perf?

@hodgesds
Copy link
Contributor Author

hodgesds commented Mar 2, 2019

Yeah, most of the underlying metrics are counters and I just copy pasted some of the other exporter code to get this started. I think most of the metrics will be mapped to counters. The one thing I need to figure out is how are overflows handled normally for other exporters.

@SuperQ
Copy link
Member

SuperQ commented Mar 2, 2019

For counters, they should all be named _total and use prometheus.CounterValue.

If you have concerns about 2^53 uint64 overflows, take a look at how we handle it in the snmp_exporter.

@hodgesds hodgesds changed the title [WIP] Add perf exporter Add perf exporter Mar 5, 2019
@hodgesds
Copy link
Contributor Author

hodgesds commented Mar 5, 2019

Pushed up changes, I think this should be good if you want to test it locally.

@SuperQ
Copy link
Member

SuperQ commented Mar 5, 2019

I started with echo 2 > sudo tee /proc/sys/kernel/perf_event_paranoid.

Then I ran a couple collections and it seems to be resetting counters after each scrape:

$ curl -s 'http://localhost:9100/metrics?collect\[\]=perf' | grep node_perf_page_faults_total
# HELP node_perf_page_faults_total Number of page faults
# TYPE node_perf_page_faults_total counter
node_perf_page_faults_total{cpu="0"} 16
node_perf_page_faults_total{cpu="1"} 0
node_perf_page_faults_total{cpu="2"} 60
node_perf_page_faults_total{cpu="3"} 1796
node_perf_page_faults_total{cpu="4"} 31
node_perf_page_faults_total{cpu="5"} 0
node_perf_page_faults_total{cpu="6"} 42
node_perf_page_faults_total{cpu="7"} 5
$ curl -s 'http://localhost:9100/metrics?collect\[\]=perf' | grep node_perf_page_faults_total
# HELP node_perf_page_faults_total Number of page faults
# TYPE node_perf_page_faults_total counter
node_perf_page_faults_total{cpu="0"} 2
node_perf_page_faults_total{cpu="1"} 3
node_perf_page_faults_total{cpu="2"} 10
node_perf_page_faults_total{cpu="3"} 0
node_perf_page_faults_total{cpu="4"} 0
node_perf_page_faults_total{cpu="5"} 10
node_perf_page_faults_total{cpu="6"} 4
node_perf_page_faults_total{cpu="7"} 62

Sadly, this seems to persist across restarts of the exporter as well.

@hodgesds
Copy link
Contributor Author

hodgesds commented Mar 5, 2019

I'm not sure what capabilities you are running so that may play a factor. The one thing to note from the perf_event_open man page:

pid == -1 and cpu >= 0
This measures all processes/threads on the specified CPU. This requires CAP_SYS_ADMIN capability or a /proc/sys/kernel/perf_event_paranoid value of less than 1.

Due to the fact that it is currently configured to trace all processes on the specific CPU I doubt a paranoid value of 2 (allow only user-space measurements) would work.

@SuperQ
Copy link
Member

SuperQ commented Mar 5, 2019

Even with -1, I still get reset counters.

I'm also seeing this error after a few scrapes:

2019/03/05 17:40:03 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 5ms

Looks like there may be a leak.

@hodgesds
Copy link
Contributor Author

hodgesds commented Mar 6, 2019

Real dumb mistake, I was assuming that NewPerfCollector was only being called once and that certainly is not the case. I was able to replicate your results and changed the code so that the profilers are wrapped in a sync.Once for initialization. From there I was able to see things incrementing properly without leaking FDs:

~ daniel@p50 ✔ lsof -p $(pgrep -f node_exporter) | wc -l                                                                                                                                                                                       
305                                                                                                                                                                                                                                            
~ daniel@p50 ✔ curl -s 'http://localhost:9100/metrics?collect\[\]=perf' | grep node_perf_page_faults_total                                                                                                                                     
# HELP node_perf_page_faults_total Number of page faults                                                                                                                                                                                       
# TYPE node_perf_page_faults_total counter                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="0"} 27447                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="1"} 27768                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="2"} 30826                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="3"} 27146                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="4"} 20667                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="5"} 17671                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="6"} 21991                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="7"} 27716                                                                                                                                                                                                     
~ daniel@p50 ✔ lsof -p $(pgrep -f node_exporter) | wc -l                                                                                                                                                                                       
305                                                                                                                                                                                                                                            
~ daniel@p50 ✔ curl -s 'http://localhost:9100/metrics?collect\[\]=perf' | grep node_perf_page_faults_total                                                                                                                                     
# HELP node_perf_page_faults_total Number of page faults                                                                                                                                                                                       
# TYPE node_perf_page_faults_total counter                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="0"} 27953                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="1"} 28097                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="2"} 31501                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="3"} 28217                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="4"} 21789                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="5"} 18465                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="6"} 22425                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="7"} 28083    

@SuperQ
Copy link
Member

SuperQ commented Mar 8, 2019

A couple of documentation items:

  • Please add a [FEATURE] to the CHANGELOG.md.
  • We should document the correct minimum permissions needed to enable this. I'm doing some testing to see what that is.

I would add something like this to the README.md:

The perf collector may not work by default on all Linux systems due to kernel security settings. To allow access, set the following kernel sysctl.

sysctl -w kernel.perf_event_paranoid=X

See the [upstream docs](link here).

@hodgesds
Copy link
Contributor Author

hodgesds commented Mar 8, 2019

👍 I pushed up some doc changes and cleaned up the commit history.

README.md Show resolved Hide resolved
@hodgesds hodgesds force-pushed the perf-exporter branch 3 times, most recently from 6893f47 to 24c8c5b Compare March 14, 2019 15:07
collector/perf_linux.go Outdated Show resolved Hide resolved
continue
}

if hwProfile.CPUCycles != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to refactor this a bit, maybe by using a map and loop over it like we do in similar cases in other collectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a couple of attempts at this but creating the map becomes rather difficult without using reflection because not all struct fields may be present. I can git it another attempt and it might save a little of the redundant checks, but would probably be slower. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got it working pretty well now.

@discordianfish
Copy link
Member

@hodgesds Can you address the remaining comments?

@hodgesds hodgesds force-pushed the perf-exporter branch 3 times, most recently from c37234b to 9b368d5 Compare April 4, 2019 02:09
@hodgesds
Copy link
Contributor Author

hodgesds commented Apr 4, 2019

Updated, let me know what you think. In the future I'd like to add support for kprobes but that requires a more thought into configuration if you have any ideas (does it make sense to have a config file?).

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes. Beside that, I still think we could do better to avoid repetition but I think it's fine for now.

collector/perf_linux.go Outdated Show resolved Hide resolved
collector/perf_linux.go Outdated Show resolved Hide resolved
@SuperQ
Copy link
Member

SuperQ commented Apr 15, 2019

Ping, please rebase this.

@hodgesds hodgesds force-pushed the perf-exporter branch 10 times, most recently from 906ba8d to f121e09 Compare April 22, 2019 12:34
@hodgesds
Copy link
Contributor Author

Added a test that will skip if perf_event_paranoid is not properly set.

Signed-off-by: Daniel Hodges <[email protected]>
@discordianfish
Copy link
Member

Okay, looks good to me. We have worse cases of repetition and I know it's tricky.

@discordianfish discordianfish requested a review from SuperQ May 3, 2019 11:28
Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SuperQ SuperQ merged commit 7882009 into prometheus:master May 7, 2019
@agolomoodysaada
Copy link

Are these new metrics enabled by default? are there docs written somewhere for this new feature?

@hodgesds
Copy link
Contributor Author

hodgesds commented May 7, 2019

Are these new metrics enabled by default? are there docs written somewhere for this new feature?

See the README.

oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants