Skip to content

Commit

Permalink
feat(intel_powerstat): add uncore frequency metrics
Browse files Browse the repository at this point in the history
  • Loading branch information
bkotlowski committed Jun 3, 2022
1 parent c6ed8bb commit 066602e
Show file tree
Hide file tree
Showing 9 changed files with 379 additions and 16 deletions.
20 changes: 16 additions & 4 deletions plugins/inputs/intel_powerstat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ to take preventive/corrective actions based on platform busyness, CPU temperatur
## - Setting this value to an empty array means no package metrics will be collected
## - Finally, a user can specify individual metrics to capture from the supported options list
## Supported options:
## "current_power_consumption", "current_dram_power_consumption", "thermal_design_power", "max_turbo_frequency"
## "current_power_consumption", "current_dram_power_consumption", "thermal_design_power", "max_turbo_frequency", "uncore_frequency"
# package_metrics = ["current_power_consumption", "current_dram_power_consumption", "thermal_design_power"]

## The user can choose which per-CPU metrics are monitored by the plugin in cpu_metrics array.
Expand Down Expand Up @@ -69,7 +69,7 @@ This configuration allows getting all processor package specific metrics and all

```toml
[[inputs.intel_powerstat]]
package_metrics = ["current_power_consumption", "current_dram_power_consumption", "thermal_design_power", "max_turbo_frequency"]
package_metrics = ["current_power_consumption", "current_dram_power_consumption", "thermal_design_power", "max_turbo_frequency", "uncore_frequency"]
cpu_metrics = ["cpu_frequency", "cpu_busy_frequency", "cpu_temperature", "cpu_c0_state_residency", "cpu_c1_state_residency", "cpu_c6_state_residency"]
```

Expand All @@ -81,8 +81,9 @@ The following dependencies are expected by plugin:
- _intel-rapl_ module which exposes Intel Runtime Power Limiting metrics over `sysfs` (`/sys/devices/virtual/powercap/intel-rapl`),
- _msr_ kernel module that provides access to processor model specific registers over `devfs` (`/dev/cpu/cpu%d/msr`),
- _cpufreq_ kernel module - which exposes per-CPU Frequency over `sysfs` (`/sys/devices/system/cpu/cpu%d/cpufreq/scaling_cur_freq`).
- _intel-uncore-frequency_ module exposes Intel uncore frequency metrics over `sysfs` (`/sys/devices/system/cpu/intel_uncore_frequency`),

Minimum kernel version required is 3.13 to satisfy all requirements.
Minimum kernel version required is 3.13 to satisfy most of requirements, for `uncore_frequency` metrics `intel-uncore-frequency` module is required (available since kernel 5.6).

Please make sure that kernel modules are loaded and running (cpufreq is integrated in kernel). Modules might have to be manually enabled by using `modprobe`.
Depending on the kernel version, run commands:
Expand All @@ -94,6 +95,9 @@ subo modprobe msr
sudo modprobe intel_rapl_common
sudo modprobe intel_rapl_msr

# also for kernel >= 5.6.0
sudo modprobe intel-uncore-frequency

# kernel 4.x.x:
sudo modprobe msr
sudo modprobe intel_rapl
Expand All @@ -111,6 +115,7 @@ to retrieve data for calculation of most critical per-CPU specific metrics:
and to retrieve data for calculation per-package specific metric:

- `max_turbo_frequency_mhz`
- `uncore_frequency_mhz_cur`

To expose other Intel PowerStat metrics root access may or may not be required (depending on OS type or configuration).

Expand Down Expand Up @@ -225,17 +230,22 @@ When starting to measure metrics, plugin skips first iteration of metrics if the
|-----|-------------|
| `package_id` | ID of platform package/socket |
| `active_cores`| Specific tag for `max_turbo_frequency_mhz` metric. The maximum number of activated cores for reachable turbo frequency
| `die`| Specific tag for all `uncore_frequency` metrics. Id of die
| `type`| Specific tag for all `uncore_frequency` metrics. Type of uncore frequency (current or initial)

Measurement powerstat_package metrics are collected per processor package -_package_id_ tag indicates which package metric refers to.

- Available metrics for powerstat_package measurement

| Metric name (field) | Description | Units |
|-----|-------------|-----|
| `thermal_design_power_watts` | Maximum Thermal Design Power (TDP) available for processor package | Watts |
| `thermal_design_power_watts` | Maximum Thermal Design Power (TDP) available for processor package | Watts |
| `current_power_consumption_watts` | Current power consumption of processor package | Watts |
| `current_dram_power_consumption_watts` | Current power consumption of processor package DRAM subsystem | Watts |
| `max_turbo_frequency_mhz`| Maximum reachable turbo frequency for number of cores active | MHz
| `uncore_frequency_limit_mhz_min`| Minimum uncore frequency limit for die in processor package | MHz
| `uncore_frequency_limit_mhz_max`| Maximum uncore frequency limit for die in processor package | MHz
| `uncore_frequency_mhz_cur`| Current uncore frequency for die in processor package. Available only with tag `current`. Since this value is not yet available from `intel-uncore-frequency` module it needs to be accessed via MSR. In case of lack of loaded msr, only `uncore_frequency_limit_mhz_min` and `uncore_frequency_limit_mhz_max` metrics will be collected | MHz

### Known issues

Expand All @@ -258,6 +268,8 @@ powerstat_package,host=ubuntu,package_id=0 current_power_consumption_watts=35 16
powerstat_package,host=ubuntu,package_id=0 current_dram_power_consumption_watts=13.94 1606494744000000000
powerstat_package,host=ubuntu,package_id=0,active_cores=0 max_turbo_frequency_mhz=3000i 1606494744000000000
powerstat_package,host=ubuntu,package_id=0,active_cores=1 max_turbo_frequency_mhz=2800i 1606494744000000000
powerstat_package,die=0,host=ubuntu,package_id=0,type=initial uncore_frequency_limit_mhz_min=800,uncore_frequency_limit_mhz_max=2400 1606494744000000000
powerstat_package,die=0,host=ubuntu,package_id=0,type=current uncore_frequency_mhz_cur=800i,uncore_frequency_limit_mhz_min=800,uncore_frequency_limit_mhz_max=2400 1606494744000000000
powerstat_core,core_id=0,cpu_id=0,host=ubuntu,package_id=0 cpu_frequency_mhz=1200.29 1606494744000000000
powerstat_core,core_id=0,cpu_id=0,host=ubuntu,package_id=0 cpu_temperature_celsius=34i 1606494744000000000
powerstat_core,core_id=0,cpu_id=0,host=ubuntu,package_id=0 cpu_c6_state_residency_percent=92.52 1606494744000000000
Expand Down
17 changes: 16 additions & 1 deletion plugins/inputs/intel_powerstat/file_mock_test.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

115 changes: 108 additions & 7 deletions plugins/inputs/intel_powerstat/intel_powerstat.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ package intel_powerstat

import (
_ "embed"
"errors"
"fmt"
"math/big"
"strconv"
Expand Down Expand Up @@ -33,6 +34,7 @@ const (
packageCurrentDramPowerConsumption = "current_dram_power_consumption"
packageThermalDesignPower = "thermal_design_power"
packageTurboLimit = "max_turbo_frequency"
packageUncoreFrequency = "uncore_frequency"
percentageMultiplier = 100
)

Expand All @@ -57,6 +59,7 @@ type PowerStat struct {
packageCurrentPowerConsumption bool
packageCurrentDramPowerConsumption bool
packageThermalDesignPower bool
packageUncoreFrequency bool
cpuInfo map[string]*cpuInfo
skipFirstIteration bool
logOnce map[string]error
Expand All @@ -76,10 +79,10 @@ func (p *PowerStat) Init() error {
}
// Initialize MSR service only when there is at least one metric enabled
if p.cpuFrequency || p.cpuBusyFrequency || p.cpuTemperature || p.cpuC0StateResidency || p.cpuC1StateResidency ||
p.cpuC6StateResidency || p.cpuBusyCycles || p.packageTurboLimit {
p.cpuC6StateResidency || p.cpuBusyCycles || p.packageTurboLimit || p.packageUncoreFrequency {
p.msr = newMsrServiceWithFs(p.Log, p.fs)
}
if p.packageCurrentPowerConsumption || p.packageCurrentDramPowerConsumption || p.packageThermalDesignPower || p.packageTurboLimit {
if p.packageCurrentPowerConsumption || p.packageCurrentDramPowerConsumption || p.packageThermalDesignPower || p.packageTurboLimit || p.packageUncoreFrequency {
p.rapl = newRaplServiceWithFs(p.Log, p.fs)
}

Expand All @@ -97,7 +100,17 @@ func (p *PowerStat) Gather(acc telegraf.Accumulator) error {
}

if p.areCoreMetricsEnabled() {
p.addPerCoreMetrics(acc)
if p.msr.isMsrLoaded() {
p.logOnce["msr"] = nil
p.addPerCoreMetrics(acc)
} else {
err := errors.New("error while trying to read MSR (probably msr module was not loaded)")
if val := p.logOnce["msr"]; val == nil || val.Error() != err.Error() {
p.Log.Errorf("%v", err)
// Remember that specific error occurs to omit logging next time
p.logOnce["msr"] = err
}
}
}

// Gathering the first iteration of metrics was skipped for most of them because they are based on delta calculations
Expand All @@ -109,25 +122,31 @@ func (p *PowerStat) Gather(acc telegraf.Accumulator) error {
func (p *PowerStat) addGlobalMetrics(acc telegraf.Accumulator) {
// Prepare RAPL data each gather because there is a possibility to disable rapl kernel module
p.rapl.initializeRaplData()

for socketID := range p.rapl.getRaplData() {
if p.packageTurboLimit {
p.addTurboRatioLimit(socketID, acc)
}

if p.packageUncoreFrequency {
die := maxDiePerSocket(socketID)
for actualDie := 0; actualDie < die; actualDie++ {
p.addUncoreFreq(socketID, strconv.Itoa(actualDie), acc)
}
}

err := p.rapl.retrieveAndCalculateData(socketID)
if err != nil {
// In case of an error skip calculating metrics for this socket
if val := p.logOnce[socketID]; val == nil || val.Error() != err.Error() {
if val := p.logOnce[socketID+"rapl"]; val == nil || val.Error() != err.Error() {
p.Log.Errorf("error fetching rapl data for socket %s, err: %v", socketID, err)
// Remember that specific error occurs for socketID to omit logging next time
p.logOnce[socketID] = err
p.logOnce[socketID+"rapl"] = err
}
continue
}

// If error stops occurring, clear logOnce indicator
p.logOnce[socketID] = nil
p.logOnce[socketID+"rapl"] = nil
if p.packageThermalDesignPower {
p.addThermalDesignPowerMetric(socketID, acc)
}
Expand All @@ -143,6 +162,84 @@ func (p *PowerStat) addGlobalMetrics(acc telegraf.Accumulator) {
}
}
}
func maxDiePerSocket(_ string) int {
/*
TODO:
At the moment, linux does not distinguish between more dies per socket.
This piece of code will need to be upgraded in the future.
https://github.com/torvalds/linux/blob/v5.17/arch/x86/include/asm/topology.h#L153
*/
return 1
}

func (p *PowerStat) addUncoreFreq(socketID string, die string, acc telegraf.Accumulator) {
err := checkFile("/sys/devices/system/cpu/intel_uncore_frequency")
if err != nil {
err := fmt.Errorf("error while checking existing intel_uncore_frequency (probably intel-uncore-frequency module was not loaded)")
if val := p.logOnce["intel_uncore_frequency"]; val == nil || val.Error() != err.Error() {
p.Log.Errorf("%v", err)
// Remember that specific error occurs to omit logging next time
p.logOnce["intel_uncore_frequency"] = err
}
return
}
p.logOnce["intel_uncore_frequency"] = nil
p.readUncoreFreq("initial", socketID, die, acc)
p.readUncoreFreq("current", socketID, die, acc)
}

func (p *PowerStat) readUncoreFreq(typeFreq string, socketID string, die string, acc telegraf.Accumulator) {
fields := map[string]interface{}{}
cpuID := ""
if typeFreq == "current" {
if p.areCoreMetricsEnabled() && p.msr.isMsrLoaded() {
p.logOnce[socketID+"msr"] = nil
for _, v := range p.cpuInfo {
if v.physicalID == socketID {
cpuID = v.cpuID
}
}
if cpuID == "" {
p.Log.Debugf("error while reading socket ID")
return
}
actualUncoreFreq, err := p.msr.readSingleMsr(cpuID, "MSR_UNCORE_PERF_STATUS")
if err != nil {
p.Log.Debugf("error while reading MSR_UNCORE_PERF_STATUS: %v", err)
return
}
actualUncoreFreq = (actualUncoreFreq & 0x3F) * 100
fields["uncore_frequency_mhz_cur"] = actualUncoreFreq
} else {
err := errors.New("error while trying to read MSR (probably msr module was not loaded), uncore_frequency_mhz_cur metric will not be collected")
if val := p.logOnce[socketID+"msr"]; val == nil || val.Error() != err.Error() {
p.Log.Errorf("%v", err)
// Remember that specific error occurs for socketID to omit logging next time
p.logOnce[socketID+"msr"] = err
}
}
}
initMinFreq, err := p.msr.retrieveUncoreFrequency(socketID, typeFreq, "min", die)
if err != nil {
p.Log.Errorf("error while retrieving minimum uncore frequency of the socket %s, err: %v", socketID, err)
return
}
initMaxFreq, err := p.msr.retrieveUncoreFrequency(socketID, typeFreq, "max", die)
if err != nil {
p.Log.Errorf("error while retrieving maximum uncore frequency of the socket %s, err: %v", socketID, err)
return
}

tags := map[string]string{
"package_id": socketID,
"type": typeFreq,
"die": die,
}
fields["uncore_frequency_limit_mhz_min"] = initMinFreq
fields["uncore_frequency_limit_mhz_max"] = initMaxFreq

acc.AddGauge("powerstat_package", fields, tags)
}

func (p *PowerStat) addThermalDesignPowerMetric(socketID string, acc telegraf.Accumulator) {
maxPower, err := p.rapl.getConstraintMaxPowerWatts(socketID)
Expand Down Expand Up @@ -579,6 +676,9 @@ func (p *PowerStat) parsePackageMetricsConfig() {
if contains(p.PackageMetrics, packageThermalDesignPower) {
p.packageThermalDesignPower = true
}
if contains(p.PackageMetrics, packageUncoreFrequency) {
p.packageUncoreFrequency = true
}
}

func (p *PowerStat) parseCPUMetricsConfig() {
Expand Down Expand Up @@ -693,6 +793,7 @@ func newPowerStat(fs fileService) *PowerStat {
cpuTemperature: false,
cpuBusyFrequency: false,
packageTurboLimit: false,
packageUncoreFrequency: false,
packageCurrentPowerConsumption: false,
packageCurrentDramPowerConsumption: false,
packageThermalDesignPower: false,
Expand Down
59 changes: 59 additions & 0 deletions plugins/inputs/intel_powerstat/intel_powerstat_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ func TestGather(t *testing.T) {
On("retrieveAndCalculateData", mock.Anything).Return(nil).Times(len(raplDataMap)).
On("getConstraintMaxPowerWatts", mock.Anything).Return(546783852.3, nil)
mockServices.msr.On("getCPUCoresData").Return(preparedCPUData).
On("isMsrLoaded", mock.Anything).Return(true).
On("openAndReadMsr", mock.Anything).Return(nil).
On("retrieveCPUFrequencyForCore", mock.Anything).Return(1200000.2, nil)

Expand Down Expand Up @@ -227,6 +228,43 @@ func TestAddCPUFrequencyMetric(t *testing.T) {
acc.AssertContainsTaggedFields(t, "powerstat_core", expectedMetric.fields, expectedMetric.tags)
}

func TestReadUncoreFreq(t *testing.T) {
var acc testutil.Accumulator
cpuID := "0"
coreID := "0"
packageID := "0"
die := "0"
power, mockServices := getPowerWithMockedServices()
prepareCPUInfoForSingleCPU(power, cpuID, coreID, packageID)
preparedData := getPreparedCPUData([]string{cpuID})

mockServices.msr.On("getCPUCoresData").Return(preparedData)

mockServices.msr.On("isMsrLoaded").Return(true)

mockServices.msr.On("readSingleMsr", "0", "MSR_UNCORE_PERF_STATUS").Return(uint64(10), nil)

mockServices.msr.On("retrieveUncoreFrequency", "0", "initial", "min", "0").
Return(float64(500), nil)
mockServices.msr.On("retrieveUncoreFrequency", "0", "initial", "max", "0").
Return(float64(1200), nil)
mockServices.msr.On("retrieveUncoreFrequency", "0", "current", "min", "0").
Return(float64(600), nil)
mockServices.msr.On("retrieveUncoreFrequency", "0", "current", "max", "0").
Return(float64(1100), nil)

power.readUncoreFreq("current", packageID, die, &acc)
power.readUncoreFreq("initial", packageID, die, &acc)

require.Equal(t, 2, len(acc.GetTelegrafMetrics()))

expectedMetric := getPowerUncoreFreqMetric("initial", float64(500), float64(1200), nil, packageID, die)
acc.AssertContainsTaggedFields(t, "powerstat_package", expectedMetric.fields, expectedMetric.tags)

expectedMetric = getPowerUncoreFreqMetric("current", float64(600), float64(1100), uint64(1000), packageID, die)
acc.AssertContainsTaggedFields(t, "powerstat_package", expectedMetric.fields, expectedMetric.tags)
}

func TestAddCoreCPUTemperatureMetric(t *testing.T) {
var acc testutil.Accumulator
cpuID := "0"
Expand Down Expand Up @@ -496,6 +534,27 @@ func getPowerGlobalMetric(name string, value interface{}, socketID string) struc
return getPowerMetric(name, value, map[string]string{"package_id": socketID})
}

func getPowerUncoreFreqMetric(typeFreq string, limitMin interface{}, limitMax interface{}, current interface{}, socketID string, die string) struct {
fields map[string]interface{}
tags map[string]string
} {
var ret struct {
fields map[string]interface{}
tags map[string]string
}
ret.tags = make(map[string]string)
ret.fields = make(map[string]interface{})
ret.tags["package_id"] = socketID
ret.tags["die"] = die
ret.tags["type"] = typeFreq
ret.fields["uncore_frequency_limit_mhz_min"] = limitMin
ret.fields["uncore_frequency_limit_mhz_max"] = limitMax
if typeFreq == "current" {
ret.fields["uncore_frequency_mhz_cur"] = current
}
return ret
}

func getPowerMetric(name string, value interface{}, tags map[string]string) struct {
fields map[string]interface{}
tags map[string]string
Expand Down
Loading

0 comments on commit 066602e

Please sign in to comment.