Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect aerospike metrics export to prometheus with more than 1 node in aerospike cluster #2917

Closed
popov-ilya opened this issue Jun 13, 2017 · 6 comments · Fixed by #2918
Closed
Labels
breaking change Improvement to Telegraf that requires changes to the plugin or agent; for minor/major releases bug unexpected problem or unintended behavior
Milestone

Comments

@popov-ilya
Copy link

Bug report

Relevant telegraf.conf:

[agent]
  interval = "10s"
  round_interval = true
  flush_interval = "10s"
  flush_jitter = "0s"
  debug = false
  hostname = ""


[[outputs.prometheus_client]]
  listen = "192.168.1.18:9126"

[[inputs.aerospike]]
  servers = ["localhost:3000"]

System info:

Telegraf v1.3.1 (git: release-1.3 f93615672b02a41d9bc867bd92bf31c1d777989b)
Linux dev01 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce:

  1. Generate configtelegraf -input-filter aerospike -output-filter prometheus_client config
  2. Get metrics from telegraf telegraf --config telegraf.conf -test
  3. Get metrics from prometheus_client output curl 192.168.1.18:9126/metrics. Do it few times.
  4. Compare metrics that didn't change (device_used_bytes maybe) with what you got from curl.

Expected behavior:

prometheus_client should provide aerospike metrics from the host where telegraf is running.

Actual behavior:

prometheus_client may provide metrics from any host in your aerospike cluster.

Additional info:

If you do telegraf -test, it will give you aerospike_node and aerospike_namespace X times, where X is the node count in your aerospike cluster. Then if you query prometheus_client output with curl on :9126/metrics you will get 1 set of metrics, which should be from the host where telegraf is running. But actually it gives you metrics from any of your aerospike hosts and host may change every collection interval. It results in a lot of spikes on dashboards even for metrics than didn't actually change.
In previous version 1.2.1 it always get metrics from the same one host, but it is not necessary the right host.

@danielnelson
Copy link
Contributor

Can you add the output of telegraf -test?

@popov-ilya
Copy link
Author

popov-ilya commented Jun 13, 2017

Sure:

ilya-popov@dev01:~$ telegraf -test -config /etc/telegraf/telegraf.conf -input-filter aerospike -output-filter prometheus_client | grep aerospike_node
> aerospike_node,aerospike_host=localhost:3000,host=dev01 sub_objects=0i,sindex_gc_locktimedout=0i,query_long_running=0i,sindex_gc_activity_dur=0i,heartbeat_connections=3i,objects=7532i,batch_index_initiate=0i,early_tsvc_batch_sub_error=0i,demarshal_error=0i,batch_index_unused_buffers=0i,heap_active_kbytes=2273664i,heap_mapped_kbytes=2494464i,reaped_fds=0i,sindex_gc_list_creation_time=0i,fabric_msgs_sent=1334497i,sindex_ucgarbage_found=0i,sindex_gc_garbage_cleaned=0i,batch_index_huge_buffers=0i,cluster_integrity=true,batch_index_error=0i,uptime=3548464i,batch_index_complete=0i,heartbeat_received_self=0i,rw_in_progress=0i,batch_index_timeout=0i,scans_active=0i,proxy_in_progress=0i,batch_timeout=0i,batch_initiate=0i,paxos_principal="BB9F1EA1A005452",sindex_gc_garbage_found=0i,system_swapping=false,heap_allocated_kbytes=2258711i,info_complete=253232338i,migrate_allowed=true,sindex_gc_list_deletion_time=0i,tsvc_queue=0i,tombstones=0i,early_tsvc_client_error=2i,tree_gc_queue=0i,query_short_running=0i,batch_index_queue="0:0,0:0,0:0,0:0",migrate_partitions_remaining=0i,batch_index_destroyed_buffers=0i,batch_index_created_buffers=0i,early_tsvc_udf_sub_error=0i,heartbeat_received_foreign=70919919i,sindex_gc_inactivity_dur=0i,sindex_gc_objects_validated=0i,proxy_retry=0i,info_queue=0i,cluster_key="",fabric_connections=45i,client_connections=51i,batch_error=0i,fabric_msgs_rcvd=1334478i,batch_queue=0i,cluster_size=4i,node_name="BB947F50D005452",delete_queue=0i,system_free_mem_pct=55i,heap_efficiency_pct=91i 1497379475000000000
> aerospike_node,aerospike_host=localhost:3000,host=dev01 sindex_gc_garbage_found=0i,node_name="BB9F1EA1A005452",fabric_connections=44i,batch_index_error=0i,tree_gc_queue=0i,batch_initiate=0i,reaped_fds=0i,early_tsvc_client_error=2i,batch_timeout=0i,info_queue=0i,sindex_gc_list_creation_time=0i,heap_allocated_kbytes=2255352i,fabric_msgs_rcvd=1000440i,delete_queue=0i,batch_index_created_buffers=0i,batch_index_queue="0:0,0:0,0:0,0:0",sindex_gc_list_deletion_time=0i,rw_in_progress=0i,heap_efficiency_pct=90i,heartbeat_received_foreign=50540900i,fabric_msgs_sent=1000564i,batch_queue=0i,proxy_retry=0i,paxos_principal="BB9F1EA1A005452",heartbeat_connections=3i,heap_mapped_kbytes=2494464i,uptime=2528875i,batch_index_initiate=0i,sindex_gc_locktimedout=0i,migrate_partitions_remaining=0i,batch_index_destroyed_buffers=0i,system_free_mem_pct=66i,batch_index_timeout=0i,demarshal_error=0i,heap_active_kbytes=2272212i,cluster_key="",tombstones=0i,sindex_gc_inactivity_dur=0i,cluster_integrity=true,cluster_size=4i,batch_index_huge_buffers=0i,batch_index_complete=0i,sindex_gc_activity_dur=0i,client_connections=47i,query_long_running=0i,scans_active=0i,sindex_ucgarbage_found=0i,sindex_gc_garbage_cleaned=0i,batch_index_unused_buffers=0i,sindex_gc_objects_validated=0i,objects=7144i,system_swapping=false,query_short_running=0i,early_tsvc_udf_sub_error=0i,info_complete=180476564i,sub_objects=0i,tsvc_queue=0i,migrate_allowed=true,proxy_in_progress=0i,batch_error=0i,heartbeat_received_self=0i,early_tsvc_batch_sub_error=0i 1497379475000000000
> aerospike_node,aerospike_host=localhost:3000,host=dev01 heartbeat_received_self=0i,info_queue=0i,uptime=2512830i,migrate_partitions_remaining=0i,system_free_mem_pct=73i,heap_mapped_kbytes=2359296i,sindex_gc_list_creation_time=0i,fabric_connections=37i,tombstones=0i,rw_in_progress=0i,heartbeat_connections=3i,sindex_ucgarbage_found=0i,early_tsvc_udf_sub_error=0i,batch_index_destroyed_buffers=0i,sindex_gc_garbage_cleaned=0i,migrate_allowed=true,heartbeat_received_foreign=50223116i,batch_timeout=0i,sindex_gc_list_deletion_time=0i,query_long_running=0i,sindex_gc_objects_validated=0i,batch_initiate=0i,paxos_principal="BB9F1EA1A005452",cluster_key="",batch_index_huge_buffers=0i,batch_index_unused_buffers=0i,node_name="BB9B84A3B005452",cluster_size=4i,proxy_in_progress=0i,batch_index_timeout=0i,fabric_msgs_sent=793005i,batch_index_error=0i,early_tsvc_client_error=0i,batch_index_initiate=0i,sub_objects=0i,fabric_msgs_rcvd=792996i,heap_allocated_kbytes=2229334i,query_short_running=0i,sindex_gc_inactivity_dur=0i,batch_index_complete=0i,tree_gc_queue=0i,scans_active=0i,batch_index_queue="0:0,0:0,0:0,0:0",sindex_gc_garbage_found=0i,system_swapping=false,batch_index_created_buffers=0i,objects=7367i,cluster_integrity=true,client_connections=47i,heap_efficiency_pct=94i,batch_queue=0i,info_complete=179331347i,batch_error=0i,sindex_gc_locktimedout=0i,heap_active_kbytes=2238048i,delete_queue=0i,reaped_fds=0i,sindex_gc_activity_dur=0i,tsvc_queue=0i,proxy_retry=0i,early_tsvc_batch_sub_error=0i,demarshal_error=0i 1497379475000000000
> aerospike_node,aerospike_host=localhost:3000,host=dev01 sindex_gc_locktimedout=0i,fabric_connections=44i,fabric_msgs_rcvd=1966119i,rw_in_progress=0i,tree_gc_queue=0i,reaped_fds=0i,fabric_msgs_sent=1966133i,objects=7919i,sindex_gc_garbage_cleaned=0i,sindex_gc_activity_dur=0i,batch_initiate=0i,heap_mapped_kbytes=2476032i,cluster_key="",sindex_gc_list_deletion_time=0i,batch_error=0i,system_free_mem_pct=68i,migrate_partitions_remaining=0i,scans_active=0i,batch_index_timeout=0i,info_complete=180328064i,proxy_retry=0i,batch_index_destroyed_buffers=0i,batch_index_error=0i,node_name="BB92F7E6C005452",early_tsvc_client_error=2i,info_queue=0i,query_long_running=0i,demarshal_error=0i,batch_index_initiate=0i,proxy_in_progress=0i,early_tsvc_batch_sub_error=0i,tombstones=0i,heap_allocated_kbytes=2250586i,uptime=2526799i,batch_queue=0i,batch_index_created_buffers=0i,delete_queue=0i,batch_index_huge_buffers=0i,batch_timeout=0i,query_short_running=0i,batch_index_unused_buffers=0i,heartbeat_received_self=0i,heap_active_kbytes=2265080i,sindex_gc_garbage_found=0i,sindex_ucgarbage_found=0i,heartbeat_received_foreign=50500518i,heap_efficiency_pct=91i,batch_index_queue="0:0,0:0,0:0,0:0",sindex_gc_list_creation_time=0i,cluster_size=4i,heartbeat_connections=3i,migrate_allowed=true,client_connections=50i,paxos_principal="BB9F1EA1A005452",cluster_integrity=true,sindex_gc_objects_validated=0i,sub_objects=0i,system_swapping=false,sindex_gc_inactivity_dur=0i,batch_index_complete=0i,tsvc_queue=0i,early_tsvc_udf_sub_error=0i 1497379475000000000
ilya-popov@dev01:~$ curl 192.168.1.18:9126/metrics | grep aerospike_node_objects
# HELP aerospike_node_objects Telegraf collected metric
# TYPE aerospike_node_objects untyped
aerospike_node_objects{aerospike_host="localhost:3000",host="dev01"} 7144

ilya-popov@dev01:~$ curl 192.168.1.18:9126/metrics | grep aerospike_node_objects
# HELP aerospike_node_objects Telegraf collected metric
# TYPE aerospike_node_objects untyped
aerospike_node_objects{aerospike_host="localhost:3000",host="dev01"} 7367

ilya-popov@dev01:~$ curl 192.168.1.18:9126/metrics | grep aerospike_node_objects
# HELP aerospike_node_objects Telegraf collected metric
# TYPE aerospike_node_objects untyped
aerospike_node_objects{aerospike_host="localhost:3000",host="dev01"} 7919

ilya-popov@dev01:~$ curl 192.168.1.18:9126/metrics | grep aerospike_node_objects
# HELP aerospike_node_objects Telegraf collected metric
# TYPE aerospike_node_objects untyped
aerospike_node_objects{aerospike_host="localhost:3000",host="dev01"} 7919

ilya-popov@dev01:~$ curl 192.168.1.18:9126/metrics | grep aerospike_node_objects
# HELP aerospike_node_objects Telegraf collected metric
# TYPE aerospike_node_objects untyped
aerospike_node_objects{aerospike_host="localhost:3000",host="dev01"} 7367

As you can see, prometheus_client output every time gives aerospike_node_objects metric from random aerospike host.

@danielnelson
Copy link
Contributor

I think we need to make the node_name field into a tag.

@danielnelson danielnelson added bug unexpected problem or unintended behavior and removed need more info labels Jun 13, 2017
@danielnelson danielnelson added this to the 1.4.0 milestone Jun 13, 2017
@danielnelson
Copy link
Contributor

Needs fixed for both the aerospike_node and the aerospike_namespace measurements.

@popov-ilya Are you interested in doing this?

@popov-ilya
Copy link
Author

Unfortunately i don't have necessary skills for this.

@danielnelson
Copy link
Contributor

I'll copy my comment over from the PR:

I didn't see a way to determine which node is the "local" one using aerospike-client-go, though I didn't look too deeply. If we did have a way we could potentially add an option to allow switching between full cluster and local node only mode, similar to what we have for elasticsearch.

So the fix doesn't provide the requested functionality of scraping a single node, if this is a problem then you can reopen this issue.

@danielnelson danielnelson added the breaking change Improvement to Telegraf that requires changes to the plugin or agent; for minor/major releases label Jun 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Improvement to Telegraf that requires changes to the plugin or agent; for minor/major releases bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants