-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inputs.vsphere, reports duplicate points #11168
Comments
Hi, Would you be willing to share some of the example duplicate points from a telegraf log? It would be helpful to use the Thanks! |
Hi @powersj
|
What is your output you are pushing this to? And what is the output in your comment above from? When looking at the metrics.out file I first looked at the metrics at the timestamp you first referenced, For example: - vsphere_vm_power,clustername=WF_BGLR_INT_Cluster01,dcname=WF_BGLR_INT_DC,esxhostname=gty-a003.test.com,guest=winLonghorn64,guesthostname=WIN-PR69FI29KLE,host=telegraf_rhel_nossl,moid=vm-129,source=skajagar-win2008,uuid=420b5c3c-6f1b-d231-314b-23b88901cd98,vcenter=int-vc.test.com,vmname=skajagar-win2008 power_average=244i 1653477360000000000
+ vsphere_vm_power,clustername=WF_BGLR_INT_Cluster01,dcname=WF_BGLR_INT_DC,esxhostname=gty-a003.test.com,guest=winLonghorn64,guesthostname=WIN-PR69FI29KLE,host=telegraf_rhel_nossl,moid=vm-129,source=skajagar-win2008,uuid=420b5c3c-6f1b-d231-314b-23b88901cd98,vcenter=int-vc.test.com,vmname=skajagar-win2008 power_average=247i 1653477360000000000 In the above, the tags are all the same, but the field value reported for power_average is different between the two. A similar example looking at the memory output: - vsphere_vm_mem,clustername=WF_BGLR_INT_Cluster01,dcname=WF_BGLR_INT_DC,esxhostname=gty-a003.test.com,guest=winLonghorn64,guesthostname=WIN-FCCHP2L47JJ,host=telegraf_rhel_nossl,moid=vm-131,source=skajagar-win2016-Activated,uuid=420bc4e3-d966-403e-f6f7-7bb6223b6fae,vcenter=int-vc.test.com,vmname=skajagar-win2016-Activated swapinRate_average=0i,active_average=419429i,swapout_average=0i,swapin_average=0i,swapoutRate_average=0i,latency_average=0,vmmemctl_average=0i,usage_average=9.99,granted_average=4194304i 1653477360000000000
+ vsphere_vm_mem,clustername=WF_BGLR_INT_Cluster01,dcname=WF_BGLR_INT_DC,esxhostname=gty-a003.test.com,guest=winLonghorn64,guesthostname=WIN-FCCHP2L47JJ,host=telegraf_rhel_nossl,moid=vm-131,source=skajagar-win2016-Activated,uuid=420bc4e3-d966-403e-f6f7-7bb6223b6fae,vcenter=int-vc.test.com,vmname=skajagar-win2016-Activated active_average=398458i,granted_average=4194304i,swapout_average=0i,swapin_average=0i,swapoutRate_average=0i,swapinRate_average=0i,latency_average=0,vmmemctl_average=0i,usage_average=9.49 1653477360000000000 Here the So I would not say that these are duplicates. What I do wonder about is why the metric is showing up twice with the same timestamp. Can you please provide the rest of your configuration? |
@powersj Thanks for the analysis. - vsphere_vm_power,clustername=WF_BGLR_INT_Cluster01,dcname=WF_BGLR_INT_DC,esxhostname=gty-a003.test.com,guest=winLonghorn64,guesthostname=WIN-PR69FI29KLE,host=telegraf_rhel_nossl,moid=vm-129,source=skajagar-win2008,uuid=420b5c3c-6f1b-d231-314b-23b88901cd98,vcenter=int-vc.test.com,vmname=skajagar-win2008 power_average=244i 1653477360000000000
+ vsphere_vm_power,clustername=WF_BGLR_INT_Cluster01,dcname=WF_BGLR_INT_DC,esxhostname=gty-a003.test.com,guest=winLonghorn64,guesthostname=WIN-PR69FI29KLE,host=telegraf_rhel_nossl,moid=vm-129,source=skajagar-win2008,uuid=420b5c3c-6f1b-d231-314b-23b88901cd98,vcenter=int-vc.test.com,vmname=skajagar-win2008 power_average=247i 1653477360000000000 |
As I don't think the plugin would suddenly start generating duplicate metrics, it looks like you have two plugins running at the same time. Some follow up questions:
If you add the following to your config, does the output change? [[inputs.vsphere]]
name_override = "vsphere_local"
vcenters = [ "https://vcenter.local/sdk" ] I would expect to see a single metric called "vsphere_local". |
|
|
What version of vsphere are you running?
If you exclude all the vm_metrics with vm_metric_exclude = [ "*" ] and delete the vm_metric_include array, do you still get duplicates?
I am suspecting these two places as we are adjusting the time here: |
Does that mean you still are getting duplicates with excluding VM metrics? |
I disabled the VM metric as you mentioned and captured the result for 7 minutes and found nearly 33% of duplicate
|
Can you do one more thing, run with --debug and get me the full logs. The plugin appears to have quite a few debug statements and I'd like to follow along with what is happening. |
Please find the log file with debug enabled |
@powersj are we looking on it? |
It's on my list, but not something I've gotten back around to. I have only briefly looked at the log you provided, and I do believe the next step is to add a bit more logging to see where duplicates are getting created. |
if you need any help please let me know I can add extra logs wherever you suggest and provide you with the output. |
@powersj any idea on when are we prioritizing this? |
Thanks again for the logs. It would be really nice to isolate it down to a metric that we know is duplicated. For example, we identified the [agent]
interval = "60s"
debug = true
[[inputs.vsphere]]
vcenters = [ "https://vcenter.local/sdk" ]
username = "[email protected]"
password = "secret"
vm_metric_include = ["power.power.average"]
host_metric_exclude = ["*"]
cluster_metric_exclude = ["*"]
datastore_metric_exclude = ["*"]
datacenter_metric_exclude = ["*"]
resourcepool_metric_exclude = ["*"] Then look at the data and let me know if you still see the duplicates. If so, please include the debug log. If that does not produce duplicates, then I would make the following change to the config: --- host_metric_exclude = ["*"]
+++ host_metric_exclude = ["power.power.average"] Based on the inventory path example, it does look like there are multiple ways to reference a VM. Either via the host folder path or the VM path. I have not looked deeper into this, but that seems like an obvious place where duplicate metrics could be showing up. |
@powersj as we know real-time metrics are available at 20 second granularity. and let's say we have a one minute of refresh rate and the last time when we reported the point is 10:30 with timestamp of {10:30:00, 10:31:00} so the next refresh will happen at 10:31 and for real time metrics we will get 3 points as they have 20 seconds granularity. let say those 3 points have timestamp of {10:30:20, 10:30:40, 10:31:00} basically after Truncate this will get converted to two points with timestamp {10:30:00, 10:31:00} and in this case first point will becomea duplicate as we already reported with that timestamp in the last refresh
I did some experiments by removing the alignSamples function and found no duplicate and with the alignSamples function it gives 25% of duplicate. |
@powersj does this analysis make any sense or am I missing something here? |
We do not all know this, at least I did not :) Telegraf has hundreds of plugins connecting to various services and software, but none of the maintainers currently know all of the plugins on a deep level. I do see the discussion about this in realtime vs historical in the README, but it was not clear to me that it could be a culprit yet. As such, I was still really hoping to at least see the logs with a single metric included to hopefully learn more about how a single metric makes its way through the plugin and see how the vsphere interval was set and if any padding was occurring.
Said another way, your hypothesis is that at Telegraf's flush interval N+1 we produce a metric that is identical to a metric that was in flush interval N due to the alignSamples function's? It is not clear to me why this function was even added in #5113 and what problem it solves? @prydin do you have details or help you could provide on why a user might be seeing duplicate metrics come out of the alignSamples function? |
Sorry for the delay. Let me have a look at alignSamples. The idea was to avoid duplicates, not create them, so something is clearly amiss there. |
Hi @prydin did you get a chance to look into it? |
Hi, Wanted to check in and see if you both were able to resolve the issue? |
@powersj I'm trying to reproduce this in my lab right now. I'll get back to you once I have an idea what's going on. |
Thanks! |
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you! |
@yogeshprasad can you try the above suggestion? |
@yogeshprasad not sure if you have already, but can you try the PR in #12259? |
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf v1.22.4
Docker
No response
Steps to reproduce
Expected behavior
There should not be any duplicate points
Actual behavior
Nearly 10 % of points are being duplicate
Additional info
This is the function which creates duplicate points as we are adjusting the time here.
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/vsphere/endpoint.go#L1104
The text was updated successfully, but these errors were encountered: