-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
system.processes.count wildly inaccurate #31565
Comments
Pinging code owners: See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This report might also join with the work on deprecating the processesscraper: #30895 |
Some update on this. Discussed it with the gopsutil maintainer and we both think counting the numeric directories in I had initially dismissed this as a strategy for the fix in gopsutil because this project's code says that's not accurate due to being multiple system calls. But for just a simple count (not status) it is still one syscall. Regarding what this project should do is an open question. It could just update to the latest version of gopsutil (once my fix is merged) and that library call will now have accurate numbers. But it seems weird to me that we are counting the directories twice. Once in this project and once by that library call. I'm not sure there is value. The above warning indicates it is necessary to get an accurate count. But the count could still be inaccurate. If I had a running process while the status was being collected but it exited before the metric was finalized then the number would be too high. I think it's better without the extra count. That way it's returning the processes at the time we started the metric. We may not know the status of some of those since they exited and the count may be off from what it is by the time the metric is actually emitted but right now it's this weird mix of the count at two different times. |
Just to follow up. My fix did get merged. I did a test build of the otel collector with my fix to gopsutil and it now returns much more reasonable number. Once the next release of gopsutil is made we could potentially just update and have much better numbers. It's still an open question if counting the numeric directories in |
I'm going to go ahead and close this. Earlier this month a bot updated the contrib repo to the latest version of gopsutil which contains my fix in that library. I still think it's not right that we are basically iterating the |
Component(s)
receiver/hostmetrics
What happened?
Description
The total process returned by the processes scraper in hostmetrics is wildly inaccurate.
This is due to gopsutil returning both process count and thread count when otel is just expecting process counts. I have filed a bug report with gopsutil. Depending on how gopsutil fixes this struct will determine how otel should proceed:
/proc
filesystem and otel remains unchanged. AFAIK there is no way so I don't think this is likely./proc
). This makes it no longer a single syscall and no better than what otel is already doing. We could simplify the otel code by just delegating it all or keep our version and just remove the call to getMiscStat
.We might also prefer to for now take that last option so the metric isn't so wildly inaccurate. I.E. it might be a little short but not several times larger. Then once gopsutil decided we can reconsider.
Steps to Reproduce
ps -A | wc -l
Expected Result
I expect the number returned by hostmetrics to be similar to the number returned by the
ps
command.Actual Result
The process count I see from hostmetrics is often 2-3 times what
ps
reports.Collector version
0.94.1
Environment information
Pop!_OS 22.04 - Tried running in docker container and on host
OpenTelemetry Collector configuration
Log output
Additional context
No response
The text was updated successfully, but these errors were encountered: