Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inputs-vsphere did not complete within its interval #5296

Closed
aksharbarot opened this issue Jan 16, 2019 · 19 comments
Closed

inputs-vsphere did not complete within its interval #5296

aksharbarot opened this issue Jan 16, 2019 · 19 comments
Labels
area/vsphere bug unexpected problem or unintended behavior
Milestone

Comments

@aksharbarot
Copy link

aksharbarot commented Jan 16, 2019

Relevant telegraf.conf:

System info:

CentOS Linux release 7.6.1810 (Core)

[Include Telegraf version, operating system name, and other relevant details]
Telegraf 1.9.2 (git: HEAD dda8079)
influxdb-1.7.2-1

Steps to reproduce:

  1. ... installed telegraf
  2. ... enabled Vsphere plugin

Expected behavior:

Graph should be Constance

Actual behavior:

image

Additional info:

2019-01-16T13:41:10Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-16T13:41:20Z D! [outputs.influxdb] wrote batch of 28 metrics in 7.071246ms
2019-01-16T13:41:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics.
2019-01-16T13:41:20Z W! [agent] input "inputs.vsphere" did not complete within its interval

[Include gist of relevant config, logs, etc.]

If i am restarting telegraf then, it is working fine for few minutes and again started with same error and no graph

@aksharbarot aksharbarot changed the title inputs-sphere did not complete within its interval inputs-vsphere did not complete within its interval Jan 16, 2019
@danielnelson danielnelson added bug unexpected problem or unintended behavior area/vsphere labels Jan 16, 2019
@danielnelson
Copy link
Contributor

Could you check if this still occurs in the nightly builds ?

@aksharbarot
Copy link
Author

aksharbarot commented Jan 17, 2019

@danielnelson Thank you!.... I moved to below version as per your suggestion.

Telegraf version:

Telegraf unknown (git: master e95b88e)

All graphs are generating little constant but, issue with CPU collection only which is missing for all esxi hosts

CPU graph which is missing in between
image

Should work as below
image

@danielnelson
Copy link
Contributor

It may be that you need to increase your collection interval, depending on how many values you are collecting and the amount of time it takes. One way to see how much time the plugin is taking is to enable the internal plugin and look at the internal_gather,plugin=vsphere and internal_vsphere metrics.

Can you also attach your configuration for this plugin?

@prydin
Copy link
Contributor

prydin commented Jan 20, 2019

Please share your configuration file with us! Shorter collection interval than 60s is generally not recommended for vsphere. You may also want to increase collection concurrency.

@aksharbarot
Copy link
Author

aksharbarot commented Jan 21, 2019

FYI: configured 4 VC which will have 500+ hosts all together.

Even many time observed [inputs.vsphere]: Error in plugin: While collecting host: Post https://VC/sdk: context deadline exceeded

FYI: It is not able to collect metrics where more number of hosts present. (having 2 VC where few number of hosts are present and which is showing constant graph) only issue is with larger inventory.

Okay! here is the config file.
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = true
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = false

[[outputs.influxdb]]

urls = ["http://127.0.0.1:8086"]
database = "telegraf"
timeout = "5s"
username = "user"
password = "pass"

[[inputs.vsphere]]
vcenters = [ "VC1","VC2","VC3", VC4]
username = user
password = pass
vm_metric_exclude = [ "*" ]
host_metric_include = [
"cpu.coreUtilization.average",
"cpu.costop.summation",
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.swapwait.summation",
"cpu.usage.average",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.utilization.average",
"cpu.wait.summation",
"mem.active.average",
"mem.latency.average",
"mem.state.latest",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.totalCapacity.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.errorsRx.summation",
"net.errorsTx.summation",
"net.usage.average",
"power.power.average",
"storageAdapter.numberReadAveraged.average",
"power.power.average",
"storageAdapter.numberReadAveraged.average",
"storageAdapter.numberWriteAveraged.average",
"storageAdapter.read.average",
"storageAdapter.write.average",
"sys.uptime.latest",
]

cluster_metric_include = []

datastore_metric_exclude = [ "*" ]
insecure_skip_verify = true

@zhangyf0820
Copy link

I have the same issue with version 1.9.0.1、1.9.1.1,no problem with version 1.8.3-1.
FYI: configured 2 VC which will have less than 10 hosts all together.
Here is the config file.:

[global_tags]
[agent]
interval = "15s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
precision = ""
debug = false
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = false

[[outputs.prometheus_client]]
listen = ":9273"

[[inputs.vsphere]]
vcenters = [ "VC1","VC2" ]
username = "*"
password = ""
insecure_skip_verify = true

Here is the log:
2019-01-21T03:24:17Z I! [agent] Hang on, flushing any cached metrics before shutdown
2019-01-21T03:24:18Z I! Loaded inputs: inputs.mem inputs.system inputs.cpu inputs.diskio inputs.kernel inputs.processes inputs.swap inputs.vsphere inputs.disk
2019-01-21T03:24:18Z I! Loaded aggregators:
2019-01-21T03:24:18Z I! Loaded processors:
2019-01-21T03:24:18Z I! Loaded outputs: prometheus_client
2019-01-21T03:24:18Z I! Tags enabled: host=bocloud
2019-01-21T03:24:18Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"localhost", Flush Interval:10s
2019-01-21T03:24:18Z W! [input.vsphere] Configured max_query_metrics is 256, but server limits it to 64. Reducing.
2019-01-21T03:24:45Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:25:00Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:25:15Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:25:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:25:45Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:27:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:27:45Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:28:00Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:28:15Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:28:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:28:50Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:29:05Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:29:20Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:29:35Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:29:50Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:30:05Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:30:15Z E! [inputs.vsphere]: Error in plugin: While collecting host: ServerFaultCode: A specified parameter was not correct: querySpec.startTime, querySpec.endTime
2019-01-21T03:30:15Z E! [inputs.vsphere]: Error in plugin: While collecting vm: ServerFaultCode: A specified parameter was not correct: querySpec.startTime, querySpec.endTime
2019-01-21T03:32:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:32:45Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:33:00Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:33:15Z W! [agent] input "inputs.vsphere" did not complete within its interval

@prydin
Copy link
Contributor

prydin commented Jan 21, 2019

It looks like your interval is set to 15s. You need to increase that to at least 20s!
Keep in mind that vSphere data is only available every 20s, you should never specify an interval lower than that. We generally recommend you keep the collection interval to 60s due to the load you shorter intervals may put on the vCenter server. However, we have been able to successfully use a 20s interval on a vCenter managing 7000 VMs, so it is possible, albeit not recommended.

If you truly need 20s granularity on your data, I recommend you do two things:

  1. Move collection of clusters, datacenters and datastores to a separate instance of the plugin. These metrics are only available on a 300s interval, so it's not useful to collect them more often than that. Also, since they're stored on disk and not in memory, they take considerably longer to fetch. Here's a writeup I did on this. You might find this helpful: http://docs-dev.wavefront.com/integrations_vsphere.html (Note to self: Add something similar to the README)

  2. Once you've made the changes above, you should increase both discover_concurrency and collect_concurrency to 3. This should give you an extra performance boost!

@zhangyf0820
Copy link

It works, thanks a lot

@aksharbarot
Copy link
Author

No. it is not working only if including datastore metrics.
If excluding datastore metrics then, no missing values observed (means graph is stable).

@prydin
Copy link
Contributor

prydin commented Jan 23, 2019

As stated above, you need to declare a separate instance of the plugin and specify data stores, clusters and data centers in that instance with a 300s interval.

@zhangyf0820
Copy link

According to your suggestion, real-time metrics are good, but non-real-time data can not get continuous data.
How can I always get non-real-time data? Is this a bug? see #5322
1548243967 1

@prydin
Copy link
Contributor

prydin commented Jan 25, 2019

What's the collection interval on the non real time metrics? It should be 300s or higher. Also, can you run with the -debug flag and send us logs?

@prydin
Copy link
Contributor

prydin commented Jan 25, 2019

And yes, it might be because of a bug that's scheduled to be fixed in 1.10. You may want to try the latest nightly build from master and see if it fixes the issue.

@aksharbarot
Copy link
Author

Yes!
object_discovery_interval = "300s"
collect_concurrency = 4
discover_concurrency = 4

Data-collection interval
interval = "60s"

But, below error observed in logs.
2019-01-29T06:36:03Z E! [inputs.vsphere]: Error in plugin: While collecting host: Post https://example.com/sdk: context deadline exceeded
2019-01-29T06:37:03Z E! [inputs.vsphere]: Error in plugin: While collecting host: Post https://example.com/sdk: context deadline exceeded

and during this time, data is showing no value.

@ghost
Copy link

ghost commented Feb 6, 2019

Same Problem here !
I've used a release that @prydin made for another issue and worked like a charm.

Functional release

Telegraf unknown (git: prydin-scale-improvement 646c5960)

Not Functional release

telegraf-1.9.4-1.x86_64

I only get inputs-vsphere did not complete within its interval

PROBLEM

No graph is generated !

CONFIG

#[global_tags]

[agent]

interval = '300s'
round_interval = true
metric_batch_size = 10000
metric_buffer_limit = 100000
collection_jitter = '0s'
flush_interval = '10s'
flush_jitter = '0s'
precision = ''
debug = true
quiet = false
logfile = '/var/log/telegraf/telegraf.log'
hostname = ''
omit_hostname = false

[[outputs.influxdb]]

urls = ['http://foo.bar.com:8086']
database = 'telegraf'

[[inputs.vsphere]]
vcenters = [ 25 x vcenter ]
username = 'foobar'
password = 'barfoo'
vm_metric_include = [
  'sys.osUptime.latest',
  'cpu.usage.average',
  'disk.read.average',
  'cpu.usage.average',
  'cpu.demand.average',
  'cpu.idle.summation',
  'cpu.latency.average',
  'cpu.readiness.average',
  'cpu.ready.summation',
  'cpu.run.summation',
  'cpu.usagemhz.average',
  'cpu.used.summation',
  'cpu.wait.summation',
  'mem.active.average',
  'mem.granted.average',
  'mem.latency.average',
  'mem.swapin.average',
  'mem.swapinRate.average',
  'mem.swapout.average',
  'mem.swapoutRate.average',
  'mem.usage.average',
  'mem.vmmemctl.average',
'net.bytesRx.average',
  'net.bytesTx.average',
  'net.droppedRx.summation',
  'net.droppedTx.summation',
  'net.usage.average',
  'power.power.average',
  'virtualDisk.numberReadAveraged.average',
  'virtualDisk.numberWriteAveraged.average',
  'virtualDisk.read.average',
  'virtualDisk.readOIO.latest',
  'virtualDisk.throughput.usage.average',
  'virtualDisk.totalReadLatency.average',
  'virtualDisk.totalWriteLatency.average',
  'virtualDisk.write.average',
  'virtualDisk.writeOIO.latest',
  'sys.uptime.latest',
]
host_metric_include = [
    'cpu.coreUtilization.average',
    'cpu.costop.summation',
    'cpu.demand.average',
    'cpu.idle.summation',
    'cpu.latency.average',
    'cpu.readiness.average',
    'cpu.ready.summation',
    'cpu.swapwait.summation',
    'cpu.usage.average',
    'cpu.usagemhz.average',
    'cpu.used.summation',
    'cpu.utilization.average',
    'cpu.wait.summation',
    'disk.deviceReadLatency.average',
    'disk.deviceWriteLatency.average',
    'disk.kernelReadLatency.average',
    'disk.kernelWriteLatency.average',
    'disk.numberReadAveraged.average',
    'disk.numberWriteAveraged.average',
    'disk.read.average',
    'disk.totalReadLatency.average',
    'disk.totalWriteLatency.average',
    'disk.write.average',
    'mem.active.average',
    'mem.latency.average',
    'mem.state.latest',
    'mem.swapin.average',
    'mem.swapinRate.average',
    'mem.swapout.average',
    'mem.swapoutRate.average',
    'mem.totalCapacity.average',
    'mem.usage.average',
    'mem.vmmemctl.average',
    'net.bytesRx.average',
    'net.bytesTx.average',
'net.droppedRx.summation',
    'net.droppedTx.summation',
    'net.errorsRx.summation',
    'net.errorsTx.summation',
    'net.usage.average',
    'power.power.average',
    'storageAdapter.numberReadAveraged.average',
    'storageAdapter.numberWriteAveraged.average',
    'storageAdapter.read.average',
    'storageAdapter.write.average',
    'sys.uptime.latest',
]
datastore_metric_include = [
  'datastore.numberReadAveraged.average',
  'datastore.throughput.contention.average',
  'datastore.throughput.usage.average',
  'datastore.write.average',
  'datastore.read.average',
  'datastore.numberWriteAveraged.average',
  'disk.used.latest',
  'disk.provisioned.latest',
  'disk.capacity.latest',
  'disk.capacity.contention.average',
  'disk.capacity.provisioned.average',
  'disk.capacity.usage.average'
]
cluster_metric_include = []
datacenter_metric_exclude = [ '*' ]
collect_concurrency = 2
discover_concurrency = 1
object_discovery_interval = '1200s'
insecure_skip_verify = true

@sunnybhatnagar
Copy link

sunnybhatnagar commented Feb 8, 2019

I having the same issue using this build: 1.9.4-1.x86_64.
Please help me fix the same or recommend the right telegraf build. I am having 200+ hosts with 2000+ VM's.

[agent] input "inputs.vsphere" did not complete within its interval

# Read metrics from one or many vCenters
[[inputs.vsphere]]
    ## List of vCenter URLs to be monitored. These three lines must be uncommented
  ## and edited for the plugin to work.
  vcenters = [ "https://xhd/sdk" ]
  username = "sdf"
  password = "password"

  ## VMs
  ## Typical VM metrics (if omitted or empty, all metrics are collected)
  vm_metric_include = [

    "mem.usage.average",
    "net.usage.average",
  ]
  vm_metric_exclude = [
    "power.power.average",    
    "virtualDisk.numberReadAveraged.average",
    "virtualDisk.numberWriteAveraged.average",
    "virtualDisk.read.average",
    "virtualDisk.readOIO.latest",
    "virtualDisk.throughput.usage.average",
    "virtualDisk.totalReadLatency.average",
    "virtualDisk.totalWriteLatency.average",
    "virtualDisk.write.average",
    "virtualDisk.writeOIO.latest",
    "sys.uptime.latest",
	"mem.vmmemctl.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
	"cpu.wait.summation",
    "mem.active.average",
    "mem.granted.average",
    "mem.latency.average",
    "mem.swapin.average",
    "mem.swapinRate.average",
    "mem.swapout.average",
    "mem.swapoutRate.average",
    "cpu.run.summation",  
    "cpu.demand.average",
    "cpu.idle.summation",
    "cpu.latency.average", 
	"cpu.readiness.average",
    "cpu.ready.summation",
    "cpu.usagemhz.average",
    "cpu.used.summation",
 ]

  ## Nothing is excluded by default
  # vm_instances = true ## true by default

  ## Hosts 
  ## Typical host metrics (if omitted or empty, all metrics are collected)
  host_metric_include = [
    
    "cpu.usage.average",
    "cpu.usagemhz.average",
    "cpu.used.summation",
    "cpu.utilization.average",
    "cpu.wait.summation",
    "mem.usage.average",
    "net.usage.average",

  ]
   host_metric_exclude = [
    "power.power.average",
    "storageAdapter.numberReadAveraged.average",
    "storageAdapter.numberWriteAveraged.average",
    "storageAdapter.read.average",
    "storageAdapter.write.average",
    "sys.uptime.latest",
    "mem.vmmemctl.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
    "net.errorsRx.summation",
    "net.errorsTx.summation",
    "disk.deviceReadLatency.average",
    "disk.deviceWriteLatency.average",
    "disk.kernelReadLatency.average",
    "disk.kernelWriteLatency.average",
    "disk.numberReadAveraged.average",
    "disk.numberWriteAveraged.average",
    "disk.read.average",
    "disk.totalReadLatency.average",
    "disk.totalWriteLatency.average",
    "disk.write.average",
    "mem.active.average",
    "mem.latency.average",
    "mem.state.latest",
    "mem.swapin.average",
    "mem.swapinRate.average",
    "mem.swapout.average",
    "mem.swapoutRate.average",
    "mem.totalCapacity.average",
    "cpu.coreUtilization.average",
    "cpu.costop.summation",
    "cpu.demand.average",
    "cpu.idle.summation",
    "cpu.latency.average",
    "cpu.readiness.average",
    "cpu.ready.summation",
    "cpu.swapwait.summation",
   ] 
  ## Nothing excluded by default
  # host_instances = true ## true by default

  ## Clusters 
  # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
  # cluster_metric_exclude = [] ## Nothing excluded by default
  # cluster_instances = true ## true by default

  ## Datastores 
  # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
  # datastore_metric_exclude = [] ## Nothing excluded by default
  # datastore_instances = false ## false by default for Datastores only

  ## Datacenters
  datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
  datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
  # datacenter_instances = false ## false by default for Datastores only

  ## Plugin Settings  
  ## separator character to use for measurement and field names (default: "_")
  # separator = "_"

  ## number of objects to retreive per query for realtime resources (vms and hosts)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_objects = 256

  ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_metrics = 256

  ## number of go routines to use for collection and discovery of objects and metrics
   collect_concurrency = 5
   discover_concurrency = 3

  ## whether or not to force discovery of new objects on initial gather call before collecting metrics
  ## when true for large environments this may cause errors for time elapsed while collecting metrics
  ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
  # force_discover_on_init = false

  ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
  # object_discovery_interval = "300s"

  ## timeout applies to any of the api request made to vcenter
  timeout = "180s"

  ## Optional SSL Config
  # ssl_ca = "/path/to/cafile"
  # ssl_cert = "/path/to/certfile"
  # ssl_key = "/path/to/keyfile"
  ## Use SSL but skip chain & host verification
	insecure_skip_verify = true

@danielnelson danielnelson added this to the 1.10.0 milestone Feb 14, 2019
@danielnelson
Copy link
Contributor

@sunnybhatnagar Can you try with the latest nightly build?

@MartVisser
Copy link

@danielnelson I have the same issue; just tried the nightly build and that fixes the issue!

But now I got weird failed events in my vcenter, see:

screenshot 2019-02-22 at 09 56 41
screenshot 2019-02-22 at 09 56 24

@danielnelson
Copy link
Contributor

@MartVisser Can you open a new issue for that side effect?

I'm going to close this issue, if anyone is having issues with the plugin not completing by the end of the interval please read the hints above and try again with the nightly builds. If you still have problems after that, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vsphere bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

6 participants