Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add HDD monitoring items to hdd_monitor #721

Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,19 @@
disks: # Until multi type lists are allowed, name N the disks as disk0...disk{N-1}
disk0:
name: /dev/sda3
temp_attribute_id: 0xC2
temp_warn: 55.0
temp_error: 70.0
power_on_hours_attribute_id: 0x09
power_on_hours_warn: 3000000
total_data_written_attribute_id: 0xF1
total_data_written_warn: 4915200 # =150TB (1unit=32MB)
total_data_written_safety_factor: 0.05
recovered_error_attribute_id: 0xC3
recovered_error_warn: 1
free_warn: 5120 # MB(8hour)
free_error: 100 # MB(last 1 minute)
read_data_rate_warn: 360.0 # MB/s
write_data_rate_warn: 103.5 # MB/s
read_iops_warn: 63360.0 # IOPS
write_iops_warn: 24120.0 # IOPS
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,36 @@
contains: [": HDD Temperature"]
timeout: 3.0

recovered_error:
type: diagnostic_aggregator/GenericAnalyzer
path: recovered_error
contains: [": HDD RecoveredError"]
timeout: 3.0

read_data_rate:
type: diagnostic_aggregator/GenericAnalyzer
path: read_data_rate
contains: [": HDD ReadDataRate"]
timeout: 3.0

write_data_rate:
type: diagnostic_aggregator/GenericAnalyzer
path: write_data_rate
contains: [": HDD WriteDataRate"]
timeout: 3.0

read_iops:
type: diagnostic_aggregator/GenericAnalyzer
path: read_iops
contains: [": HDD ReadIOPS"]
timeout: 3.0

write_iops:
type: diagnostic_aggregator/GenericAnalyzer
path: write_iops
contains: [": HDD WriteIOPS"]
timeout: 3.0

usage:
type: diagnostic_aggregator/GenericAnalyzer
path: usage
Expand All @@ -156,6 +186,12 @@
contains: [": HDD TotalDataWritten"]
timeout: 3.0

connection:
type: diagnostic_aggregator/GenericAnalyzer
path: connection
contains: [": HDD Connection"]
timeout: 3.0

process:
type: diagnostic_aggregator/AnalyzerGroup
path: process
Expand Down
6 changes: 6 additions & 0 deletions system/system_monitor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,13 @@ Every topic is published in 1 minute interval.
| HDD Monitor | HDD Temperature | ✓ | ✓ | ✓ | |
| | HDD PowerOnHours | ✓ | ✓ | ✓ | |
| | HDD TotalDataWritten | ✓ | ✓ | ✓ | |
| | HDD RecoveredError | ✓ | ✓ | ✓ | |
| | HDD Usage | ✓ | ✓ | ✓ | |
| | HDD ReadDataRate | ✓ | ✓ | ✓ | |
| | HDD WriteDataRate | ✓ | ✓ | ✓ | |
| | HDD ReadIOPS | ✓ | ✓ | ✓ | |
| | HDD WriteIOPS | ✓ | ✓ | ✓ | |
| | HDD Connection | ✓ | ✓ | ✓ | |
| Memory Monitor | Memory Usage | ✓ | ✓ | ✓ | |
| Net Monitor | Network Usage | ✓ | ✓ | ✓ | |
| | Network CRC Error | ✓ | ✓ | ✓ | Warning occurs when the number of CRC errors in the period reaches the threshold value. The number of CRC errors that occur is the same as the value that can be confirmed with the ip command. |
Expand Down
9 changes: 9 additions & 0 deletions system/system_monitor/config/hdd_monitor.param.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,19 @@
disks: # Until multi type lists are allowed, name N the disks as disk0...disk{N-1}
disk0:
name: /
temp_attribute_id: 0xC2
temp_warn: 55.0
temp_error: 70.0
power_on_hours_attribute_id: 0x09
power_on_hours_warn: 3000000
total_data_written_attribute_id: 0xF1
total_data_written_warn: 4915200 # =150TB (1unit=32MB)
total_data_written_safety_factor: 0.05
recovered_error_attribute_id: 0xC3
recovered_error_warn: 1
free_warn: 5120 # MB(8hour)
free_error: 100 # MB(last 1 minute)
read_data_rate_warn: 360.0 # MB/s
write_data_rate_warn: 103.5 # MB/s
read_iops_warn: 63360.0 # IOPS
write_iops_warn: 24120.0 # IOPS
8 changes: 8 additions & 0 deletions system/system_monitor/docs/ros_parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,20 @@ hdd_monitor:
| Name | Type | Unit | Default | Notes |
| :------------------------------- | :----: | :---------------: | :-----: | :--------------------------------------------------------------------------------- |
| name | string | n/a | none | The disk name to monitor temperature. (e.g. /dev/sda) |
| temp_attribute_id | int | n/a | 0xC2 | S.M.A.R.T attribute ID of temperature. |
| temp_warn | float | DegC | 55.0 | Generates warning when HDD temperature reaches a specified value or higher. |
| temp_error | float | DegC | 70.0 | Generates error when HDD temperature reaches a specified value or higher. |
| power_on_hours_attribute_id | int | n/a | 0x09 | S.M.A.R.T attribute ID of power-on hours. |
| power_on_hours_warn | int | Hour | 3000000 | Generates warning when HDD power-on hours reaches a specified value or higher. |
| total_data_written_attribute_id | int | n/a | 0xF1 | S.M.A.R.T attribute ID of total data written. |
| total_data_written_warn | int | depends on device | 4915200 | Generates warning when HDD total data written reaches a specified value or higher. |
| total_data_written_safety_factor | int | %(1e-2) | 0.05 | Safety factor of HDD total data written. |
| recovered_error_attribute_id | int | n/a | 0xC3 | S.M.A.R.T attribute ID of recovered error. |
| recovered_error_warn | int | n/a | 1 | Generates warning when HDD recovered error reaches a specified value or higher. |
| read_data_rate_warn | float | MB/s | 360.0 | Generates warning when HDD read data rate reaches a specified value or higher. |
| write_data_rate_warn | float | MB/s | 103.5 | Generates warning when HDD write data rate reaches a specified value or higher. |
| read_iops_warn | float | IOPS | 63360.0 | Generates warning when HDD read IOPS reaches a specified value or higher. |
| write_iops_warn | float | IOPS | 24120.0 | Generates warning when HDD write IOPS reaches a specified value or higher. |

hdd_monitor:

Expand Down
144 changes: 130 additions & 14 deletions system/system_monitor/docs/topics_hdd_monitor.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@

<b>[values]</b>

| key | value (example) |
| ---------------------- | -------------------------- |
| HDD [0-9]: status | OK / hot / critical hot |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: model | SAMSUNG MZVLB1T0HBLR-000L7 |
| HDD [0-9]: serial | S4EMNF0M820682 |
| HDD [0-9]: temperature | 37.0 DegC |
| key | value (example) |
| ---------------------- | ---------------------------- |
| HDD [0-9]: status | OK / hot / critical hot |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: model | SAMSUNG MZVLB1T0HBLR-000L7 |
| HDD [0-9]: serial | S4EMNF0M820682 |
| HDD [0-9]: temperature | 37.0 DegC <br> not available |

## <u>HDD PowerOnHours</u>

Expand All @@ -35,13 +35,13 @@

<b>[values]</b>

| key | value (example) |
| ------------------------- | ----------------------- |
| HDD [0-9]: status | OK / lifetime limit |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: model | PHISON PS5012-E12S-512G |
| HDD [0-9]: serial | FB590709182505050767 |
| HDD [0-9]: power on hours | 4834 Hours |
| key | value (example) |
| ------------------------- | ----------------------------- |
| HDD [0-9]: status | OK / lifetime limit |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: model | PHISON PS5012-E12S-512G |
| HDD [0-9]: serial | FB590709182505050767 |
| HDD [0-9]: power on hours | 4834 Hours <br> not available |

## <u>HDD TotalDataWritten</u>

Expand All @@ -64,6 +64,27 @@
| HDD [0-9]: serial | FB590709182505050767 |
| HDD [0-9]: total data written | 146295330 <br> not available |

## <u>HDD RecoveredError</u>

/diagnostics/hdd_monitor: HDD RecoveredError

<b>[summary]</b>

| level | message |
| ----- | -------------------- |
| OK | OK |
| WARN | high soft error rate |

<b>[values]</b>

| key | value (example) |
| -------------------------- | ------------------------- |
| HDD [0-9]: status | OK / high soft error rate |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: model | PHISON PS5012-E12S-512G |
| HDD [0-9]: serial | FB590709182505050767 |
| HDD [0-9]: recovered error | 0 <br> not available |

## <u>HDD Usage</u>

/diagnostics/hdd_monitor: HDD Usage
Expand All @@ -87,3 +108,98 @@
| HDD [0-9]: avail | 749G |
| HDD [0-9]: use | 69% |
| HDD [0-9]: mounted on | / |

## <u>HDD ReadDataRate</u>

/diagnostics/hdd_monitor: HDD ReadDataRate

<b>[summary]</b>

| level | message |
| ----- | ---------------------- |
| OK | OK |
| WARN | high data rate of read |

<b>[values]</b>

| key | value (example) |
| ---------------------------- | --------------------------- |
| HDD [0-9]: status | OK / high data rate of read |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: data rate of read | 0.00 MB/s |

## <u>HDD WriteDataRate</u>

/diagnostics/hdd_monitor: HDD WriteDataRate

<b>[summary]</b>

| level | message |
| ----- | ----------------------- |
| OK | OK |
| WARN | high data rate of write |

<b>[values]</b>

| key | value (example) |
| ----------------------------- | ---------------------------- |
| HDD [0-9]: status | OK / high data rate of write |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: data rate of write | 0.00 MB/s |

## <u>HDD ReadIOPS</u>

/diagnostics/hdd_monitor: HDD ReadIOPS

<b>[summary]</b>

| level | message |
| ----- | ----------------- |
| OK | OK |
| WARN | high IOPS of read |

<b>[values]</b>

| key | value (example) |
| ----------------------- | ---------------------- |
| HDD [0-9]: status | OK / high IOPS of read |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: IOPS of read | 0.00 IOPS |

## <u>HDD WriteIOPS</u>

/diagnostics/hdd_monitor: HDD WriteIOPS

<b>[summary]</b>

| level | message |
| ----- | ------------------ |
| OK | OK |
| WARN | high IOPS of write |

<b>[values]</b>

| key | value (example) |
| ------------------------ | ----------------------- |
| HDD [0-9]: status | OK / high IOPS of write |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: IOPS of write | 0.00 IOPS |

## <u>HDD Connection</u>

/diagnostics/hdd_monitor: HDD Connection

<b>[summary]</b>

| level | message |
| ----- | ------------- |
| OK | OK |
| WARN | not connected |

<b>[values]</b>

| key | value (example) |
| ---------------------- | ------------------ |
| HDD [0-9]: status | OK / not connected |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: mount point | / |
28 changes: 21 additions & 7 deletions system/system_monitor/include/hdd_reader/hdd_reader.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,19 +28,20 @@
#include <map>
#include <string>

/**
* @brief ATA attribute IDs
*/
enum class ATAAttributeIDs : uint8_t { TEMPERATURE = 0, POWER_ON_HOURS = 1, SIZE };

/**
* @brief HDD device
*/
struct HDDDevice
{
std::string name_; //!< @brief Device name
std::string name_; //!< @brief Device name
uint8_t temp_attribute_id_; //!< @brief S.M.A.R.T attribute ID of temperature
uint8_t power_on_hours_attribute_id_; //!< @brief S.M.A.R.T attribute ID of power on hours
uint8_t
total_data_written_attribute_id_; //!< @brief S.M.A.R.T attribute ID of total data written
total_data_written_attribute_id_; //!< @brief S.M.A.R.T attribute ID of total data written
uint8_t recovered_error_attribute_id_; //!< @brief S.M.A.R.T attribute ID of recovered error

uint8_t unmount_request_flag_; //!< @brief unmount request flag
std::string part_device_; //!< @brief partition device

/**
* @brief Load or save data members.
Expand All @@ -53,7 +54,12 @@ struct HDDDevice
void serialize(archive & ar, const unsigned /*version*/) // NOLINT(runtime/references)
{
ar & name_;
ar & temp_attribute_id_;
ar & power_on_hours_attribute_id_;
ar & total_data_written_attribute_id_;
ar & recovered_error_attribute_id_;
ar & unmount_request_flag_;
ar & part_device_;
}
};

Expand All @@ -70,7 +76,11 @@ struct HDDInfo
// in S.M.A.R.T. information.
uint64_t power_on_hours_; //!< @brief power on hours count
uint64_t total_data_written_; //!< @brief total data written
uint32_t recovered_error_; //!< @brief recovered error count
bool is_valid_temp_; //!< @brief whether temp_ is valid value
bool is_valid_power_on_hours_; //!< @brief whether power_on_hours_ is valid value
bool is_valid_total_data_written_; //!< @brief whether total_data_written_ is valid value
bool is_valid_recovered_error_; //!< @brief whether recovered_error_ is valid value

/**
* @brief Load or save data members.
Expand All @@ -88,7 +98,11 @@ struct HDDInfo
ar & temp_;
ar & power_on_hours_;
ar & total_data_written_;
ar & recovered_error_;
ar & is_valid_temp_;
ar & is_valid_power_on_hours_;
ar & is_valid_total_data_written_;
ar & is_valid_recovered_error_;
ito-san marked this conversation as resolved.
Show resolved Hide resolved
}
};

Expand Down
Loading