From f8ca456f3406ed3fcc1c51f6f44613ca73b4b645 Mon Sep 17 00:00:00 2001 From: Arun Saravanan Balachandran <52521751+ArunSaravananBalachandran@users.noreply.github.com> Date: Thu, 8 Oct 2020 05:47:07 +0000 Subject: [PATCH] Add PCIe AER monitoring support (#678) --- doc/pcie-mon/pcie-monitoring-services-hld.md | 244 ++++++++++++++++++- 1 file changed, 238 insertions(+), 6 deletions(-) diff --git a/doc/pcie-mon/pcie-monitoring-services-hld.md b/doc/pcie-mon/pcie-monitoring-services-hld.md index 90d3a978b7..83c3b9deba 100644 --- a/doc/pcie-mon/pcie-monitoring-services-hld.md +++ b/doc/pcie-mon/pcie-monitoring-services-hld.md @@ -1,13 +1,15 @@ # SONiC PCIe Monitoring services HLD # -### Rev 0.1 ### +### Rev 0.3 ### ### Revision - | Rev | Date | Author | Change Description | - |:---:|:-----------:|:------------------:|------------------------------------------------| - | 0.1 | | Sujin Kang | Initial version | - | 0.2 | | Sujin Kang | Add rescan for pcie device missing during boot | - | | | | Add pcied to PMON for runtime monitoring | + | Rev | Date | Author | Change Description | + |:---:|:-----------:|:----------------------------:|------------------------------------------------| + | 0.1 | | Sujin Kang | Initial version | + | 0.2 | | Sujin Kang | Add rescan for pcie device missing during boot | + | | | | Add pcied to PMON for runtime monitoring | + | 0.3 | | Arun Saravanan Balachandran | Add AER stats update support in pcied | + | | | | Add command to display AER stats | ## About This Manual ## @@ -113,6 +115,236 @@ user@server:~$ redis-cli -n 6 SET "PCIE_STATUS|PCIE_DEVICES" "PASSED" ``` +## 2. PCIe AER stats collection design ## + +The PCIe AER stats for the supported PCIe devices will be collected by `pcied` and updated in STATE_DB during runtime. + +New sub-command group `pcie-aer` will be implemented in `pcieutil` to retrieve and tabulate the PCIe AER stats from STATE_DB. + +### 2.1 Access the PCIe devices' AER stats from platform ### + +For AER supported PCIe device, the AER stats belonging to severities `correctable`, `fatal`, `non_fatal` can be accessed via files (e.g. `/sys/bus/pci/devices/0000:01:00.1/aer_dev_correctable`, `/sys/bus/pci/devices/0000:01:00.1/aer_dev_fatal`, `/sys/bus/pci/devices/0000:01:00.1/aer_dev_nonfatal`) respectively. + +### 2.2 PCIe AER stats collection in pcied ### + +For PCIe devices that pass PcieUtil `get_pcie_check`, the AER stats if available will be retrieved and updated in the STATE_DB periodically every minute by pcied. + +### 2.3 STATE_DB keys and value ### + +The key used to represent a PCIE device for storing its AER stats in STATE_DB is of the format `PCIE_DEVICE||:.`. +For every device, AER stats will be stored as key, value pairs where key is of the format `|` + +Example) For a PCIe device with Bus: 1, Dev: 0, Fn: 1, Id: b960 the STATE_DB entry will be as below: + +``` +"PCIE_DEVICE|0xb960|01:00.1": { + "expireat": 1600170923.518816, + "ttl": -0.001, + "type": "hash", + "value": { + "correctable|BadDLLP": "0", + "correctable|BadTLP": "2", + "correctable|CorrIntErr": "0", + "correctable|HeaderOF": "0", + "correctable|NonFatalErr": "0", + "correctable|Rollover": "0", + "correctable|RxErr": "0", + "correctable|TOTAL_ERR_COR": "2", + "correctable|Timeout": "0", + "fatal|ACSViol": "0", + "fatal|AtomicOpBlocked": "0", + "fatal|BlockedTLP": "0", + "fatal|CmpltAbrt": "0", + "fatal|CmpltTO": "0", + "fatal|DLP": "0", + "fatal|ECRC": "0", + "fatal|FCP": "0", + "fatal|MalfTLP": "0", + "fatal|RxOF": "0", + "fatal|SDES": "0", + "fatal|TLP": "0", + "fatal|TLPBlockedErr": "0", + "fatal|TOTAL_ERR_FATAL": "0", + "fatal|UncorrIntErr": "0", + "fatal|Undefined": "0", + "fatal|UnsupReq": "0", + "fatal|UnxCmplt": "0", + "non_fatal|ACSViol": "0", + "non_fatal|AtomicOpBlocked": "0", + "non_fatal|BlockedTLP": "0", + "non_fatal|CmpltAbrt": "0", + "non_fatal|CmpltTO": "0", + "non_fatal|DLP": "0", + "non_fatal|ECRC": "0", + "non_fatal|FCP": "0", + "non_fatal|MalfTLP": "0", + "non_fatal|RxOF": "0", + "non_fatal|SDES": "0", + "non_fatal|TLP": "0", + "non_fatal|TLPBlockedErr": "0", + "non_fatal|TOTAL_ERR_NONFATAL": "3", + "non_fatal|UncorrIntErr": "0", + "non_fatal|Undefined": "0", + "non_fatal|UnsupReq": "3", + "non_fatal|UnxCmplt": "0" + } + } +``` + +### 2.4 PCIe AER stats CLI ### + +Add a new "pcieutil pcie-aer" command line to display the AER stats. + +``` +root@sonic:/home/admin# pcieutil +Usage: pcieutil [OPTIONS] COMMAND [ARGS]... + + pcieutil - Command line utility for checking pci device + +Options: + --help Show this message and exit. + +Commands: + pcie-aer Display PCIe AER status + pcie-check Check PCIe Device + pcie-generate Generate config file with current pci device + pcie-show Display PCIe Device + version Display version info +root@sonic:/home/admin# +``` + +"pcieutil pcie-aer" has four sub commands 'all', 'correctable', 'fatal' and 'non-fatal'. +'all' command displays the AER stats for all severities. 'correctable', 'fatal' and 'non-fatal' commands display the AER stats of respective severity. + +``` +root@sonic:/home/admin# pcieutil pcie-aer +Usage: pcieutil pcie-aer [OPTIONS] COMMAND [ARGS]... + + Display PCIe AER status + +Options: + --help Show this message and exit. + +Commands: + all Show all PCIe AER attributes + correctable Show PCIe AER correctable attributes + fatal Show PCIe AER fatal attributes + non-fatal Show PCIe AER non-fatal attributes +root@sonic:/home/admin# +``` + +Sample output: + +``` +root@sonic:/home/admin# pcieutil pcie-aer all ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| AER - CORRECTABLE | 00:01.0 | 00:02.0 | 00:03.0 | 00:04.0 | 00:0f.0 | 00:13.0 | 00:14.0 | 00:14.1 | 00:14.2 | 01:00.0 | 01:00.1 | +| | 0x1f10 | 0x1f11 | 0x1f12 | 0x1f13 | 0x1f16 | 0x1f15 | 0x1f41 | 0x1f41 | 0x1f41 | 0xb960 | 0xb960 | ++=====================+===========+===========+===========+===========+===========+===========+===========+===========+===========+===========+===========+ +| RxErr | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| BadTLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| BadDLLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| Rollover | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| Timeout | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| NonFatalErr | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| CorrIntErr | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| HeaderOF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| TOTAL_ERR_COR | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2 | ++---------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ + ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| AER - FATAL | 00:01.0 | 00:02.0 | 00:03.0 | 00:04.0 | 00:0f.0 | 00:13.0 | 00:14.0 | 00:14.1 | 00:14.2 | 01:00.0 | 01:00.1 | +| | 0x1f10 | 0x1f11 | 0x1f12 | 0x1f13 | 0x1f16 | 0x1f15 | 0x1f41 | 0x1f41 | 0x1f41 | 0xb960 | 0xb960 | ++=================+===========+===========+===========+===========+===========+===========+===========+===========+===========+===========+===========+ +| Undefined | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| DLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| SDES | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| TLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| FCP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| CmpltTO | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| CmpltAbrt | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| UnxCmplt | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| RxOF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| MalfTLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| ECRC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| UnsupReq | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| ACSViol | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| UncorrIntErr | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| BlockedTLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| AtomicOpBlocked | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| TLPBlockedErr | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| TOTAL_ERR_FATAL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ + ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| AER - NONFATAL | 00:01.0 | 00:02.0 | 00:03.0 | 00:04.0 | 00:0f.0 | 00:13.0 | 00:14.0 | 00:14.1 | 00:14.2 | 01:00.0 | 01:00.1 | +| | 0x1f10 | 0x1f11 | 0x1f12 | 0x1f13 | 0x1f16 | 0x1f15 | 0x1f41 | 0x1f41 | 0x1f41 | 0xb960 | 0xb960 | ++====================+===========+===========+===========+===========+===========+===========+===========+===========+===========+===========+===========+ +| Undefined | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| DLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| SDES | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| TLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| FCP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| CmpltTO | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| CmpltAbrt | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| UnxCmplt | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| RxOF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| MalfTLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| ECRC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| UnsupReq | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| ACSViol | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| UncorrIntErr | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| BlockedTLP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| AtomicOpBlocked | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| TLPBlockedErr | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ +| TOTAL_ERR_NONFATAL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | ++--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ + +root@sonic:/home/admin# +``` + < TBA > ## Open Questions ##