Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hostmetricsreceiver] Add important per-process counters #12482

Closed
dgcom opened this issue Jul 15, 2022 · 23 comments
Closed

[hostmetricsreceiver] Add important per-process counters #12482

dgcom opened this issue Jul 15, 2022 · 23 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed priority:p2 Medium receiver/hostmetrics

Comments

@dgcom
Copy link

dgcom commented Jul 15, 2022

Is your feature request related to a problem? Please describe.
There are several very important per-process metrics which are not yet collected by host metrics receiver, for example:

  • process thread count
  • process open handles count
  • process open file descriptor count

These can be considered process golden metrics and are needed for most troubleshooting and trend analysis to make sure there are no threads/handles leaks in the process.

Describe the solution you'd like
Collect and emit at least metrics mentioned above.
The ideal solution - collect more per-process metrics (optionally) - include those which are being collected by leading infrastructure monitoring tools on the market.

Describe alternatives you've considered
I have analyzed per-process metrics collected by such competitor tools like New Relic and Data Dog and their infrastructure agents are able to collect these metrics, however I would like to use OTEL collector as unified agent instead.

Additional context
There is a (great) trend to switch to OTEL host metrics receiver for infrastructure monitoring (ex. Signoz, Splunk Observability, New Relic etc.) and if such tools utilize same host metrics receiver, they will all miss very important and useful metrics making troubleshooting and observability much harder.

@github-actions
Copy link
Contributor

Pinging code owners: @dmitryax

@TylerHelmuth TylerHelmuth added help wanted Extra attention is needed good first issue Good for newcomers labels Jul 15, 2022
@TylerHelmuth
Copy link
Member

@dgcom is this something you plan to work on? If so I will assign the issue to you.

@dgcom
Copy link
Author

dgcom commented Jul 15, 2022

@dgcom is this something you plan to work on? If so I will assign the issue to you.

I would love to, but I don't have enough time and skills in Go currently to contribute...

@evan-bradley
Copy link
Contributor

@TylerHelmuth I can take this one.

@TylerHelmuth
Copy link
Member

@evan-bradley it's yours.

@evan-bradley
Copy link
Contributor

evan-bradley commented Aug 3, 2022

process open handles count

@dgcom Just to clarify, are you talking about the Windows concept of a process handle? If so, I do not believe the library that the hostmetricsreceiver uses to gather process data currently supports getting this information.

The other metrics can be easily scraped. I will be adding voluntary and involuntary context switch counts and a open file descriptor count.

@dgcom
Copy link
Author

dgcom commented Aug 3, 2022

For Windows, handles count is "\Process(*)\Handle Count" perfmon counter.
In PowerShell this is available with this example:

# All processes
get-counter "\Process(*)\Handle Count"
# Specific process
get-counter "\Process(explorer)\Handle Count"
# List all available counters for processes
(Get-Counter -ListSet Process).Paths

For thread count, it is "\Process(*)\Thread Count"
Windows does not have file descriptors counter, so this should be available only for Linux.

Looking at the library used by the receiver - leoluk/perflib_exporter: perflib-based Prometheus exporter for Windows and low-level Go perflib library - I don't see a reason why it wouldn't be able to retrieve available counters...

@evan-bradley
Copy link
Contributor

Thank you for the clarification. Most process metrics are generated using data obtained from gopsutil, which is the library I was referring to that doesn't yet support getting a process handle count.

It does look like perflib_exporter should be able to retrieve this information. I have limited working knowledge around Windows and do not have a Windows environment readily available to test with, so someone else will have to implement that metric within the hostmetricsreceiver.

@dgcom
Copy link
Author

dgcom commented Aug 4, 2022

I looked at gopsutil and it does not use performance counters at all, which explains why it only supports cpu, memory and limited number of IO counters.
The best option would be to change process scraper implementation to use perflib_exporter, which provides more per-process data and that data is compatible with many other Windows monitoring implementations.
And I know it is hard to write such low-level cross-platform implementations...

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 10, 2022
@dgcom
Copy link
Author

dgcom commented Nov 10, 2022

What a coincidence - I was actually checking changes in hostmetrics receiver when the bot posted 60 day notice...

I see that process.open_file_descriptors and process.threads are now available:
opentelemetry-collector-contrib/documentation.md at main · open-telemetry/opentelemetry-collector-contrib

But process handles seems to be missing...

@dmitryax dmitryax removed the Stale label Nov 10, 2022
@evan-bradley
Copy link
Contributor

@dgcom I wasn't able to add process handles as part of my work, as I don't have a Windows environment to test with. I will leave this issue available for someone else to pick that up.

@evan-bradley evan-bradley added the help wanted Extra attention is needed label Nov 10, 2022
@evan-bradley evan-bradley removed their assignment Nov 10, 2022
@dgcom
Copy link
Author

dgcom commented Nov 10, 2022

@evan-bradley Ok, that's fine, thank you for covering Linux side of things! I'll see if I finally get some time to dig into this myself by the end of the year... Unless someone else will be kind enough to pick this up before that.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 10, 2023
@dgcom
Copy link
Author

dgcom commented Jan 11, 2023

This issue has been inactive for 60 days.

I strongly believe that we should keep this open until it is fully resolved.

@mx-psi mx-psi removed the Stale label Feb 7, 2023
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 10, 2023
@dgcom
Copy link
Author

dgcom commented Apr 10, 2023

This issue is still relevant and should be kept open until it is resolved.

@braydonk
Copy link
Contributor

braydonk commented May 26, 2023

I would like handle metrics for Windows as well. I have issue #21379 open for this along with PR #22813 that adds support for a Windows exclusive process.handles metric. It doesn't use the performance counter but instead uses NtQuerySystemInformation. This solution does result in only one new syscall per-scrape which is why I chose that, but perhaps the performance counter would be preferred for simplicity.

@braydonk
Copy link
Contributor

I ended up changing the PR to use a WMI query instead and it ended up being the simplest way to do it. The PR is still waiting on a review at this stage.

@dgcom
Copy link
Author

dgcom commented Jun 21, 2023

This is great news! Hope we'll close this out once PR is merged...

dmitryax pushed a commit that referenced this issue Jun 27, 2023
**Description:** <Describe what has changed.>
Adds a new Windows-exclusive metric called process.handles, which
represents the handle count of the given process. When enabled, the
receiver will make a WMI Query at the beginning of each scrape to update
the handle count for all processes on the system. If the metric is
enabled on a platform other than Windows, an error will be produced when
attempting to refresh handle counts. This matches the rough behaviour of
the Linux exclusive `process.open_file_descriptors`.

**Link to tracking Issue:** <Issue number if applicable>
#21379 
#12482

**Testing:** <Describe what testing was performed and which tests were
added.>
Ran the binary with the following configuration:
```
receivers:
  hostmetrics:
    collection_interval: 2s
    scrapers:
      cpu: {}
      disk: {}
      filesystem: {}
      load: {}
      memory: {}
      network: {}
      paging: {}
      process:
        mute_process_name_error: true
        metrics:
          process.handles:
            enabled: true
      processes: {}

exporters:
  file:
    path: x.json

service:
  pipelines:
    metrics:
      receivers: [hostmetrics]
      exporters: [file]
```

The following is an example result of a scrape with this configuration.
https://gist.github.com/braydonk/c97996272574319e03111dc79076a1bd
@braydonk
Copy link
Contributor

braydonk commented Jul 5, 2023

The new process.handles metric is in v0.81!

@dgcom
Copy link
Author

dgcom commented Jul 5, 2023

The new process.handles metric is in v0.81!

Great, now need to test it out!

@dgcom
Copy link
Author

dgcom commented Jul 7, 2023

I tested 0.8.1 and I can see threads and handles counts in Windows - this is great!

This issue can be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed priority:p2 Medium receiver/hostmetrics
Projects
None yet
Development

No branches or pull requests

6 participants