Azure: implement sending memory metrics via diagnostic extension #2022

francescolavra · 2024-05-12T14:57:35Z

This change set enhances the cloud_init klib by implementing an Azure VM agent (this fixes the "virtual machine agent status is not ready" warning that is currently displayed for Nanos instances in the Azure portal), and adds a new "azure" klib that implements an Azure extension similar to the Linux Diagnostic extension.

The current implementation supports sending 4 types of memory metrics (i.e. available and used memory, as both number of bytes and percentage of total memory). The azure klib is configured in the manifest options via an "azure" tuple; the diagnostic functionalities in this klib are enabled and configured by inserting a "diagnostic" tuple with the following attributes:

storage_account: indicates the Azure storage account to be used to store metrics data generated by the klib; the storage account must be located in the same region as the region where the Azure instance is deployed
storage_account_sas: Shared Access Signature token for accessing the storage account: this token must have proper permissions to create Azure storage tables and add table entities in the above storage account; SAS tokens for a given storage account can be generated for example via the Azure portal in the "Security + networking" section
metrics: tuple that enables sending memory metrics; it can contain 2 optional attributes:
- sample_interval: interval expressed in seconds at which metrics data is collected (default: 15)
- transfer_interval: interval expressed in seconds at which metrics data is aggregated and sent to the storage account (default: 60)

Example snippet of Ops configuration file:

"ManifestPassthrough": {
  "azure": {
    "diagnostics": {
      "storage_account": "mystorageaccount",
      "storage_account_sas": "sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-05-22T14:50:28Z&st=2024-05-12T06:50:28Z&spr=https&sig=xxyyzz",
      "metrics": {"sample_interval": "15","transfer_interval": "60"}
    }
  }
}

Aggregated memory metrics data consist of the number of samples, the minimum, maximum, last, and average value, and the sum of all values; these data are inserted in an Azure storage table (one entity per aggregated data). The name of the table is in the format "WADMetricsxxxxP10DV2Syyyymmdd", where xxxx is the transfer interval expressed with ISO8601 format, and yyyymmdd is a representation of the 10-day date interval to which the metrics refer (thus, a new table is created every 10 days). For example, a table named "WADMetricsPT1MP10DV2S20240503" contains metrics data aggregated every minute ("PT1M" is the ISO8601 representation of a 1-minute period) generated for a 10-day period starting on May 3, 2024.

By default, the Azure portal does not display these metrics in its charts; in order for metrics to be available in the portal, the Linux Diagnostics Extension must be enabled and configured in a running instance (this can be done in the "Diagnostic settings"
section in the portal) to match the settings in the Nanos manifest options. More specifically, the storage account and the metric aggregation interval specified in the Azure diagnostic settings must match those specified in the manifest options.
Note: the Azure VM agent implemented in the cloud_init klib responds to requests to enable and configure the diagnostic
extension, but does not actually apply the extension settings specified in the requests; instead, it always applies the settings from the manifest.

Closes #2014

When a TLS handshake with a remote peer is complete, the TLS input buffer handler invokes the application layer connection handler, which returns the application layer input buffer handler for the connection. Any error at this stage should be reported by the application layer by returning INVALID_ADDRESS; this is consistent with the behavior for non-encryped connections (see direct_receive_service() in net/direct.c), and allows applications to not implement an input buffer handler (e.g. when they connect to a remote peer to only send data and then close the connection), in which case their connection handler can return 0. This change modifies the TLS input buffer handler so that the check for errors uses INVALID_ADDRESS instead of 0, and modifes the gcp and cloudwatch code to align with this implementation.

This function allows sending an arbitrary HTTP request and receiving a response without having to implement a connection handler and an input buffer handler. Callers can optionally implement a value handler to receive the server response, which is internally parsed by the utility code. The cloud_azure.c code has been refactored to use this new function.

An Azure instance must report its "ready" status at least once after being provisioned. In the current code, if for some reason the cloud_init klib fails to report ready at the first boot, it will never report ready even at subsequent boots, which prevents the instance status from transitioning to the running state. This change modifies the cloud_init klib so that cloud-specific initialization is executed at every boot; beside fixing the above potential issue, this will allow implementing an Azure VM agent. The first_boot() function, being no longer user, is being removed (the existing implementation had a flaw by which if the TFS log is compacted at the first boot, the first_boot() function would return true even at the next boot).

This change makes http_request() insert the Content-length HTTP header in any request, regardless of the presence of a non-empty request body. This is necessary in order to support some types of requests which require this header (for example PUT requests to the Azure blob storage service to create a blob).

The kernel code that automatically loads klibs found in the /klib folder has a flaw by which a klib that at a first attempt fails to initialize due to missing dependencies is put in a state where it cannot be initialized even after these dependencies are satisfied by other klibs that are subsequently loaded. This is because the `pending` variable cannot be safely used in a lock-free manner to determine whether any other klibs are about to be initialized; for example, in an SMP VM one core could set `pending` to 0 and another core could put a klib in a failed state before the first core initializes the just loaded klib. This issue is causing sporadic CI test failures, such as https://app.circleci.com/pipelines/github/nanovms/nanos/4623/workflows/f30b3b7f-0732-49d6-9e09-f9efbd5d6e21/jobs/16230. This change fixes the above issue by introducing a spinlock-protected klib_autoload structure that keeps track of pending klibs and loaded klibs with missing dependencies. As a side effect, the lock in this structure protects the `klib_loaded` vector and the global kernel symbol table from concurrent modifications.

This change makes the buffer_set_capacity() function work with buffers without contents (i.e. buffer structs with a zero `length` field). This allows buffers initialized via `init_buffer()` to be used as dynamically allocated buffers without having to do an initial allocation for buffer contents. The `buffer_set_capacity()` function is being moved from a header file to a source file because it can be computationally intensive (due to memory allocation and deallocation operations, as well as memory copying) and as such should not be called in hot code paths. This decreases the kernel binary size by about 55 KB.

This is done in preparation for the next commit which will add support for printing hexadecimal numbers with uppercase letters.

Beside adding a new functionality to printf-style functions, this change makes the kernel compatible with third-party code (such as lwIP and mbedtls) that uses this format for printing hexadecimal numbers with uppercase letters.

This change enhances the cloud_init klib by implementing an Azure VM agent. This fixes the "virtual machine agent status is not ready" warning that is currently displayed for Nanos instances in the Azure portal. In addition, it adds support for implementing Azure extensions.

This change adds a new "azure" klib that implements an Azure extension similar to the Linux Diagnostic extension. The current implementation supports sending 4 types of memory metrics (i.e. available and used memory, as both number of bytes and percentage of total memory). This klib is configured in the manifest options via an "azure" tuple; the diagnostic functionalities are enabled and configured by inserting a "diagnostic" tuple with the following attributes: - storage_account: indicates the Azure storage account to be used to store metrics data generated by the klib; the storage account must be located in the same region as the region where the Azure instance is deployed - storage_account_sas: Shared Access Signature token for accessing the storage account: this token must have proper permissions to create Azure storage tables and add table entities in the above storage account; SAS tokens for a given storage account can be generated for example via the Azure portal in the "Security + networking" menu. - metrics: tuple that enables sending memory metrics; it can contain 2 optional attributes: - sample_interval: interval expressed in seconds at which metrics data is collected (default: 15) - transfer_interval: interval expressed in seconds at which metrics data is aggregated and sent to the storage account (default: 60) Example snippet of Ops configuration file: ``` "ManifestPassthrough": { "azure": { "diagnostics": { "storage_account": "mystorageaccount", "storage_account_sas": "sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-05-22T14:50:28Z&st=2024-05-12T06:50:28Z&spr=https&sig=xxyyzz", "metrics": {"sample_interval": "15","transfer_interval": "60"} } } } ``` Aggregated memory metrics data consist of the number of samples, the minimum, maximum, last, and average value, and the sum of all values; these data are inserted in an Azure storage table (one entity per aggregated data). The name of the table is in the format "WADMetricsxxxxP10DV2Syyyymmdd", where xxxx is the transfer interval expressed with ISO8601 format, and yyyymmdd is a representation of the 10-day date interval to which the metrics refer (thus, a new table is created every 10 days). For example, a table named WADMetricsPT1MP10DV2S20240503 contains metrics data aggregated every minute ("PT1M" is the ISO8601 representation of a 1-minute period) generated for a 10-day period starting on May 3, 2024. By default, the Azure portal does not display these metrics in its charts; in order for metrics to be available in the portal, the Linux Diagnostics Extension must be enabled and configured in a running instance (this can be done in the "Diagnostic settings" section in the portal) to match the settings in the Nanos manifest options. More specifically, the storage account and the metric aggregation interval specified in the Azure diagnostic settings must match those specified in the manifest options. Note: the Azure VM agent implemented in the cloud_init klib responds to requests to enable and configure the diagnostic extension, but does not actually apply the extension settings specified in the requests; instead, it always applies the settings from the manifest. Closes #2014

francescolavra mentioned this pull request May 12, 2024

Support Azure VM Agent #2014

Closed

eyberg mentioned this pull request May 12, 2024

add docs for using azure metrics nanovms/ops-documentation#476

Closed

francescolavra added 10 commits May 27, 2024 12:38

format_hex_buffer(): use %B instead of %X as format specifier

7a1e2a3

This is done in preparation for the next commit which will add support for printing hexadecimal numbers with uppercase letters.

francescolavra force-pushed the feature/azure-metrics branch from 9dc963b to a92295e Compare May 27, 2024 10:57

francescolavra merged commit a92295e into master May 27, 2024
5 checks passed

francescolavra deleted the feature/azure-metrics branch May 27, 2024 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure: implement sending memory metrics via diagnostic extension #2022

Azure: implement sending memory metrics via diagnostic extension #2022

francescolavra commented May 12, 2024

Azure: implement sending memory metrics via diagnostic extension #2022

Azure: implement sending memory metrics via diagnostic extension #2022

Conversation

francescolavra commented May 12, 2024