-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new input plugin for InfiniBand card/port statistics #6631
Add a new input plugin for InfiniBand card/port statistics #6631
Conversation
Well I've added a test and cleaned up the code, and added a README - although the packaging seems to be failing due to unrelated reasons. |
The packaging failure is related but on Windows. Presumably this plugin wouldn't work on Windows, I think the best way to handle this is to mirror this change from the ethtool PR: https://github.com/influxdata/telegraf/pull/5865/files#diff-980231a87c3954198c19881c18fd0126 |
Looks like the import I am using, rdmamap is causing the packaging failure when using Windows as it imports netns. Not sure how to prevent this though as I'm not importing it on non-linux builds? Tried go vet on netns on a Windows box, produces this error :(
Guess we could fork rdmamap and remove the Docker related bits as they aren't used in this code anyway - and therefore prevent this issue from occurring - or is there a way to only vet dependencies on a system they will be run on? EDIT: Although I can see that the docker library uses netns, so no idea why it hasn't been picket up before? And other packages use netlink, which uses netns... |
I would love to use this plugin however it's Linux specific and Infiniband cards are used on Linux as well as on Windows extensively. We use them on Windows mainly. Windows has numerous counters available for Mellanox cards via perfmon: > Get-Counter -ListSet *mellanox* | select countersetname
CounterSetName
--------------
Mellanox IB Adapter Traffic Counters
Mellanox IB Adapter Diagnostic Counters
Mellanox Adapter Diagnostic Counters
Mellanox Adapter Traffic Counters
Mellanox Adapter QoS Counters
Mellanox WinOF Bus Counters These counters can be monitored using On the other hand is this a plugin for InifiBand (IB) stats (general taking into account multiple vendors) or Mellanox cards IB stats ? Name suggests it is for the former. Of course Mellanox is currently the main vendor for InfiniBand equipment after Qlogic was sold to Intel. Oracle tried to produce their own IB chipset but I am not sure how successful they were and if this equipment is on market. Maybe design the plugin in such a way that it allows for extending to other vendors (future?) and platforms (Windows as well as Linux) using same interface ? List of all counters available on Windows in perfmon: > (Get-Counter -ListSet *mellanox*).counter
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Discarded
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Bad CRC Error
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Symbol Error
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Frame Length Error
\Mellanox IB Adapter Traffic Counters(*)\Packets Received Errors
\Mellanox IB Adapter Traffic Counters(*)\Packets Outbound Discarded
\Mellanox IB Adapter Traffic Counters(*)\Packets Outbound Errors
\Mellanox IB Adapter Traffic Counters(*)\Control Packets
\Mellanox IB Adapter Traffic Counters(*)\Packets Total/Sec
\Mellanox IB Adapter Traffic Counters(*)\Packets Total
\Mellanox IB Adapter Traffic Counters(*)\KBytes Total/Sec
\Mellanox IB Adapter Traffic Counters(*)\Bytes Total
\Mellanox IB Adapter Traffic Counters(*)\Packets Sent/Sec
\Mellanox IB Adapter Traffic Counters(*)\Packets Sent
\Mellanox IB Adapter Traffic Counters(*)\KBytes Sent/Sec
\Mellanox IB Adapter Traffic Counters(*)\Bytes Sent
\Mellanox IB Adapter Traffic Counters(*)\Packets Received/Sec
\Mellanox IB Adapter Traffic Counters(*)\Packets Received
\Mellanox IB Adapter Traffic Counters(*)\KBytes Received/Sec
\Mellanox IB Adapter Traffic Counters(*)\Bytes Received
\Mellanox IB Adapter Diagnostic Counters(*)\TX Ring Is Full Packets
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Timeout Received
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Duplicate Request Received
\Mellanox IB Adapter Diagnostic Counters(*)\CQ Overflows
\Mellanox IB Adapter Diagnostic Counters(*)\Requester RNR NAK Retries Exceeded Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Transport Retries Exceeded Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Remote Operation Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Out-of-order Sequence Received
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Out-of-order Sequence NAK
\Mellanox IB Adapter Diagnostic Counters(*)\Responder RNR NAK
\Mellanox IB Adapter Diagnostic Counters(*)\Requester RNR NAK
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Remote Access Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Remote Access Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Invalid Request Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Invalid Request Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder CQE Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester CQE Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Protection Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Protection Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder QP Operation Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester QP Operation Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Responder Length Errors
\Mellanox IB Adapter Diagnostic Counters(*)\Requester Length Errors
\Mellanox Adapter Diagnostic Counters(*)\Device detected stalled state
\Mellanox Adapter Diagnostic Counters(*)\Packet detected as stalled
\Mellanox Adapter Diagnostic Counters(*)\Packets discarded due to TC in stalled state
\Mellanox Adapter Diagnostic Counters(*)\Packets discarded due to Head-Of-Queue lifetime limit
\Mellanox Adapter Diagnostic Counters(*)\Dropless Mode Entries
\Mellanox Adapter Diagnostic Counters(*)\Dropless Mode Exits
\Mellanox Adapter Diagnostic Counters(*)\TX Ring Is Full Packets
\Mellanox Adapter Diagnostic Counters(*)\Requester Timeout Received
\Mellanox Adapter Diagnostic Counters(*)\Responder Duplicate Request Received
\Mellanox Adapter Diagnostic Counters(*)\CQ Overflows
\Mellanox Adapter Diagnostic Counters(*)\Requester RNR NAK Retries Exceeded Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Transport Retries Exceeded Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Remote Operation Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder Out-of-order Sequence Received
\Mellanox Adapter Diagnostic Counters(*)\Requester Out-of-order Sequence NAK
\Mellanox Adapter Diagnostic Counters(*)\Responder RNR NAK
\Mellanox Adapter Diagnostic Counters(*)\Requester RNR NAK
\Mellanox Adapter Diagnostic Counters(*)\Responder Remote Access Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Remote Access Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder Invalid Request Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Invalid Request Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder CQE Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester CQE Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder Protection Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Protection Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder QP Operation Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester QP Operation Errors
\Mellanox Adapter Diagnostic Counters(*)\Responder Length Errors
\Mellanox Adapter Diagnostic Counters(*)\Requester Length Errors
\Mellanox Adapter Traffic Counters(*)\Packets Received Discarded
\Mellanox Adapter Traffic Counters(*)\Packets Received Bad CRC Error
\Mellanox Adapter Traffic Counters(*)\Packets Received Symbol Error
\Mellanox Adapter Traffic Counters(*)\Packets Received Frame Length Error
\Mellanox Adapter Traffic Counters(*)\Packets Received Errors
\Mellanox Adapter Traffic Counters(*)\Packets Outbound Discarded
\Mellanox Adapter Traffic Counters(*)\Packets Outbound Errors
\Mellanox Adapter Traffic Counters(*)\Control Packets
\Mellanox Adapter Traffic Counters(*)\Packets Total/Sec
\Mellanox Adapter Traffic Counters(*)\Packets Total
\Mellanox Adapter Traffic Counters(*)\KBytes Total/Sec
\Mellanox Adapter Traffic Counters(*)\Bytes Total
\Mellanox Adapter Traffic Counters(*)\Packets Sent/Sec
\Mellanox Adapter Traffic Counters(*)\Packets Sent
\Mellanox Adapter Traffic Counters(*)\KBytes Sent/Sec
\Mellanox Adapter Traffic Counters(*)\Bytes Sent
\Mellanox Adapter Traffic Counters(*)\Packets Received/Sec
\Mellanox Adapter Traffic Counters(*)\Packets Received
\Mellanox Adapter Traffic Counters(*)\KBytes Received/Sec
\Mellanox Adapter Traffic Counters(*)\Bytes Received
\Mellanox Adapter QoS Counters(*)\Responder Ignored ECN due CNP coalesce
\Mellanox Adapter QoS Counters(*)\Sent Discard Frames
\Mellanox Adapter QoS Counters(*)\Requester Traffic Rate Low Peak
\Mellanox Adapter QoS Counters(*)\Requester Traffic Rate High Peak
\Mellanox Adapter QoS Counters(*)\Responder CNP Sent Successfully
\Mellanox Adapter QoS Counters(*)\Responder ECN Handled Successfully
\Mellanox Adapter QoS Counters(*)\Responder Ignored ECN
\Mellanox Adapter QoS Counters(*)\Responder Active CNP
\Mellanox Adapter QoS Counters(*)\Requester Successfully Handled Limitation Request
\Mellanox Adapter QoS Counters(*)\Requester Ignored Limitation Request
\Mellanox Adapter QoS Counters(*)\Requester Allocated Rate Limiters
\Mellanox Adapter QoS Counters(*)\Requester Total Allocated Rate Limiters
\Mellanox Adapter QoS Counters(*)\Requester Current Total Rate
\Mellanox Adapter QoS Counters(*)\Requester Average Total Rate
\Mellanox Adapter QoS Counters(*)\Rcv Pause Duration
\Mellanox Adapter QoS Counters(*)\Rcv Pause Frames
\Mellanox Adapter QoS Counters(*)\Sent Pause Duration
\Mellanox Adapter QoS Counters(*)\Sent Pause Frames
\Mellanox Adapter QoS Counters(*)\Packets Total/Sec
\Mellanox Adapter QoS Counters(*)\Packets Total
\Mellanox Adapter QoS Counters(*)\KBytes Total/Sec
\Mellanox Adapter QoS Counters(*)\Bytes Total
\Mellanox Adapter QoS Counters(*)\Packets Sent/Sec
\Mellanox Adapter QoS Counters(*)\Packets Sent
\Mellanox Adapter QoS Counters(*)\KBytes Sent/Sec
\Mellanox Adapter QoS Counters(*)\Bytes Sent
\Mellanox Adapter QoS Counters(*)\Packets Received/Sec
\Mellanox Adapter QoS Counters(*)\Packets Received
\Mellanox Adapter QoS Counters(*)\KBytes Received/Sec
\Mellanox Adapter QoS Counters(*)\Bytes Received
\Mellanox WinOF Bus Counters(*)\Arrived RDMA CNPs
\Mellanox WinOF Bus Counters(*)\CPU MEM-pages (4K) mapped by TPT for MR
\Mellanox WinOF Bus Counters(*)\CPU MEM-pages (4K) mapped by TPT for EQ
\Mellanox WinOF Bus Counters(*)\CPU MEM-pages (4K) mapped by TPT for CQ
\Mellanox WinOF Bus Counters(*)\CPU MEM-pages (4K) mapped by TPT for QP
\Mellanox WinOF Bus Counters(*)\MTT entries used for MR
\Mellanox WinOF Bus Counters(*)\MTT entries used for EQ
\Mellanox WinOF Bus Counters(*)\MTT entries used for CQ
\Mellanox WinOF Bus Counters(*)\MTT entries used for QP
\Mellanox WinOF Bus Counters(*)\MPT entries used for MR
\Mellanox WinOF Bus Counters(*)\MPT entries used for EQ
\Mellanox WinOF Bus Counters(*)\MPT entries used for CQ
\Mellanox WinOF Bus Counters(*)\MPT entries used for QP
\Mellanox WinOF Bus Counters(*)\External Doorbell Drop/sec
\Mellanox WinOF Bus Counters(*)\External Doorbell Push/sec
\Mellanox WinOF Bus Counters(*)\External Blueflame Replace/sec
\Mellanox WinOF Bus Counters(*)\External Blueflame hit/sec
\Mellanox WinOF Bus Counters(*)\MPT Miss/sec
\Mellanox WinOF Bus Counters(*)\MTT Miss/sec
\Mellanox WinOF Bus Counters(*)\EQ Miss/sec
\Mellanox WinOF Bus Counters(*)\CQ Miss/sec
\Mellanox WinOF Bus Counters(*)\RQ Miss/sec
\Mellanox WinOF Bus Counters(*)\SQ Miss/sec
\Mellanox WinOF Bus Counters(*)\Receive WQE cache lookup/sec
\Mellanox WinOF Bus Counters(*)\Receive WQE cache hit/sec
\Mellanox WinOF Bus Counters(*)\Steering/QPC Back-pressure/sec
\Mellanox WinOF Bus Counters(*)\WQE fetch/Atomic Back-pressure/sec
\Mellanox WinOF Bus Counters(*)\Scatter Back-pressure/sec
\Mellanox WinOF Bus Counters(*)\No-WQE Drops/sec
\Mellanox WinOF Bus Counters(*)\PCI Back-pressure/sec |
Hi @gregorybrzeski - that looks interesting. To me it looks like the Windows drivers support more counters than the Linux counterparts (the Kb/s bits) - and they are named differently too. I am unable to develop for Windows - I only have access to InfiniBand on RHEL & CentOS, so someone else would have to contribute this. I also only have Mellanox hardware to test with, but it looks to me, according to the Kernel documentation here that all InfiniBand devices will be enumerated in the /sys/class/infiniband area. https://www.kernel.org/doc/Documentation/ABI/stable/sysfs-class-infiniband So stats should be consistent between different vendors on Linux I think? At the moment the plugin is split into infiniband_linux.go, with support for Linux and infiniband_nonlinux.go for all other platforms, so an infiniband_windows.go file could be created and Windows support added here. |
I've finally got it building on Windows by making sure that any references to the rdmamap library are only built on Linux - a much simpler solution to the one I suggested earlier :) |
Is it possible to request a review of this please? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, would just a few minor changes.
|
||
func init() { | ||
inputs.Add("infiniband", func() telegraf.Input { | ||
log.Print("W! [inputs.infiniband] Current platform is not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this into a Init()
function, as we noticed that this placement causes the warning to be printed at every startup.
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/ethtool/ethtool_notlinux.go
Also, a bit of a nitpick but can you call this file: infiniband_notlinux.go
for improved consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done! Don't worry about nitpicking, I'm always happy to learn a better way to do things :)
plugins/inputs/infiniband/LICENSE
Outdated
@@ -0,0 +1,8 @@ | |||
Copyright 2019 United Kingdom Research and Innovation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this file? We don't include any additional LICENSE or copyright notices outside of the top level LICENSE. You will still maintain copyright on this code, check the CLA for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep removed
|
||
// Sample configuration for plugin | ||
var InfinibandConfig = ` | ||
## no config required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this comment line, Telegraf has magic to add something very similar when there is no config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
rdmaDevices := rdmamap.GetRdmaDeviceList() | ||
|
||
if len(rdmaDevices) == 0 { | ||
return fmt.Errorf("No InfiniBand devices found on this system! Check /sys/class/infiniband/ exists") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/golang/go/wiki/CodeReviewComments#error-strings
return fmt.Errorf("no InfiniBand devices found in /sys/class/infiniband/")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced based on your comment
…omment, change error string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented all changes as per your request :)
This PR adds a new input plugin for InfiniBand card/port statistics.
Implements #5686
Required for all PRs: