Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics #23

Closed
millems opened this issue Jul 3, 2017 · 27 comments
Closed

Metrics #23

millems opened this issue Jul 3, 2017 · 27 comments
Labels
1.x Parity feature-request A feature should be added or improved.

Comments

@millems
Copy link
Contributor

millems commented Jul 3, 2017

Review the inherited state of V1 metrics and determine which changes are necessary for V2.

(Feel free to comment on this issue with desired changes).

@brharrington
Copy link
Contributor

I'm curious if this has gotten any more thought, is there a current recommendation?

For our use-cases, we mostly care about the request volume and latency. I found that I can get most of what I care about using the pluggable http client support. I'm creating a wrapper for an SdkHttpClient or SdkAsyncHttpClient that captures basic metrics about the requests. The main gap is that I don't see a good way to get the following information that was easily accessible from the V1 metrics collector:

  • Service name (e.g. EC2)
  • Request type (e.g., DescribeInstances)
  • AWS error code, we get the HTTP status code, but that doesn't always map as we would expect to the AWS error

For synchronous clients we have a hacky way to get the first two by looking at the stacktrace and seeing what the entry point to the SDK was. For async clients that is not viable. Is there a better way to access that type of information at the HTTP client layer?

More generally, is there a better integration point that could be used?

@millems
Copy link
Contributor Author

millems commented Feb 22, 2019

@brharrington

For metrics, volume and latency the ExecutionInterceptor is probably the best integration point for a custom metrics implementation.

@brharrington
Copy link
Contributor

Thanks for the pointer @millems . Looks like it solves most of my problems.

@millems
Copy link
Contributor Author

millems commented Feb 23, 2019

@brharrington LMK if you have any questions.

@justnance justnance added feature-request A feature should be added or improved. and removed Feature Request labels Apr 19, 2019
@millems millems changed the title Refactor: Metrics Metrics Jul 8, 2019
@swaranga
Copy link

@millems What is the right way to log latencies for a request using the Interceptor API. One way could be to add the start timestamp in a beforeExecution call to the ExecutionAttributes instance and in the afterExecution method, get the start time from the attributes and calculate the elapsed time and log the latency.

Is that the recommended approach?

@millems
Copy link
Contributor Author

millems commented Sep 10, 2019

@swaranga That's the recommended way to log the latency for the entire execution (which includes marshalling and retries), yes.

@swaranga
Copy link

swaranga commented Sep 10, 2019

@millems One problem with this approach is that if I want to execute a different interceptor per request so that I can tie the metrics logging with my parent unit of work I cannot do that since there is no way to add a interceptor for a request object.

This is useful for instance when I am serving a request for my callers and as part of that I need to call an AWS API, I would like to log metrics for that particular call along with other metrics for the incoming request to my service. I could probably achieve this by using hacky ThreadLocals but that makes the whole code brittle to refactoring.

This could be addressed by adding support for per request interceptors but assuming both the SDK level interceptors and the request level interceptors are executed for a given request, for metrics logging that may not be the most optimal API design and here is why:

Consider an an application that processes messages from Kinesis using the KCL library which are then transformed and written to a different Kinesis stream. In such a case, I will want to use the same SDK instance for both the read and write to/from Kinesis as recommended by SDK docs. And since I have no way of assigning a request level interceptor for the get records call from KCL (since the call is made directly from the KCL code), I will have to add an SDK level interceptor. At the same time, I will also want to add a request level interceptor for the part of my code that does the Kinesis write so that the Kinesis write call metrics are tied to my unit of work which is the processing of each message.

In such a setup, if both the interceptors are executed, I will have incorrect metrics logged (call counts, failures, avg latency etc). I thus think the interceptor API is too low level to be useful for flexible metrics logging and I ask that you reconsider the design you had for the V1 SDK that only executed one RequestMetricCollector even if multiple collectors were configured at different levels (Http Client, SDK, Request instance - in the order of preference).

A dedicated API is definitely needed, purpose-built for metrics logging and should be addressed before the V2 SDK become more ubiquitous.

@millems
Copy link
Contributor Author

millems commented Sep 13, 2019

We agree that the interceptor API isn't optimal for someone just caring about metrics. There will be a separate interface for metrics. Here's our latest thoughts: #1304 #1362

@swaranga
Copy link

Thank you, I have added my comments on the design PR

@jeffsteinmetz
Copy link

jeffsteinmetz commented Oct 11, 2019

An incremental implementation that at least adds the Java JVM Memory, JVM Threads and OS File Descriptors would be beneficial while the interfaces to the AWS specific metrics are worked out.

@swaranga
Copy link

@jeffsteinmetz Why should the SDK provide metrics on your application's JVM memory usage, file descriptors and threads? Do you mean memory, file descriptors and threads used by the SDK itself? If so, threads might be doable but I am not sure about memory usage. For file descriptors, may be it can provide number of open Http connections from the SDK but beyond that I am not very sure this should be in scope.

@jeffsteinmetz
Copy link

@swaranga, feature parity with the 1.x Java SDK to provide JVM and Machine Metrics. We use this in the 1.x java sdk,

see
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/metrics/package-summary.html

and

https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/generating-sdk-metrics.html

Machine Metrics

Memory

TotalMemory - Total amount of memory currently available to the JVM for current and future objects, measured in bytes. This value may vary over time, depending on the host environment.
FreeMemory - An approximation to the total amount of memory currently available to the JVM for future allocated objects, measured in bytes.
UsedMemory - TotalMemory minus FreeMemory.
SpareMemory - The maximum amount of memory that the JVM will attempt to use, measured in bytes, minus UsedMemory.
Threads

ThreadCount - The current number of live threads including both daemon and non-daemon threads.
DeadLockThreadCount - The number of threads that are deadlocked waiting for object monitors or ownable synchronizers, if any. Threads are deadlocked in a cycle waiting for a lock of these two types if each thread owns one lock while trying to acquire another lock already held by another thread in the cycle. No metrics is generated when the value is zero.
DaemonThreadCount - The current number of live daemon threads. No metrics is generated when the value is zero.
PeakThreadCount - The peak live thread count since the JVM started or since the peak was reset.
TotalStartedThreadCount - The total number of threads created and also started since the JVM started.
File Descriptors

OpenFileDescriptorCount - Number of opened file descriptors of the operating system.
SpareFileDescriptorCount - Maximum number of file descriptors of the operating system minus OpenFileDescriptorCount.

and from the 1.x docs

The AWS SDK for Java can generate metrics for visualization and monitoring with CloudWatch that measure:

your application’s performance when accessing AWS
the performance of your JVMs when used with AWS
runtime environment details such as heap memory, number of threads, and opened file descriptors

@millems
Copy link
Contributor Author

millems commented Oct 14, 2019

+1, I could see it being nice to get JVM metrics into cloudwatch. It's more of a "Java on AWS" feature than a "Java SDK" feature, but it was included in the Java SDK 1.11.x, so it makes sense as a feature request.

@swaranga
Copy link

Understood. Although there seems to be no new development on the related PR's. Is this still a priority?

@millems
Copy link
Contributor Author

millems commented Oct 15, 2019

We recognize this feature to be a blocker for many customers to migrate from 1.x to 2.x. We still believe this feature to be important and intend to implement it.

@debora-ito
Copy link
Member

Feature request from v1: Allow a user to add a JVM shutdown hook to drain metrics queue - aws/aws-sdk-java#1296

@gigiigig
Copy link

Hello! Do you have any ETA for this? Thanks.

@millems
Copy link
Contributor Author

millems commented May 15, 2020

@gigiigig Sorry, we unfortunately can't provide ETAs, but we will keep this issue up to date as we progress.

@dagnir
Copy link
Contributor

dagnir commented Jul 7, 2020

Hi all,

We're happy to announce that the preview of the client-side metrics feature is now released! You can try it out starting with version 2.13.52. Please note if you want to use the cloudwatch-metric-publisher, the version to use is 2.13.52-PREVIEW. For more information please take a look at our announcement blog post.

@swaranga
Copy link

swaranga commented Jul 8, 2020

From the blog post, it seems like the metrics publisher is configured at the SDK level. Can I also attach one at the per request level? The V1 SDK had that option to override it at the per request level that allowed us to publish the metrics for the request along with the current unit of work.

Is that feasible?

@dagnir
Copy link
Contributor

dagnir commented Jul 8, 2020

@swaranga Yep, absolutely. You can set publishers at the request level as well using the RequestOverrideConfiguration

GetObjectRequest.builder()
                ...
                .overrideConfiguration(o -> o.addMetricPublisher(myPublisher))
                .build();

@gmariotti
Copy link

Is there any plan to provide a Micrometer based implementation for the metric publisher?

@dagnir
Copy link
Contributor

dagnir commented Jul 8, 2020

Hi @gmariotti we don't have any plans currently for a Micrometer implementation.

@dagnir
Copy link
Contributor

dagnir commented Sep 2, 2020

Hi everyone, the Metrics feature is now GA so I'm going to ahead and close this issue. Please check it out and feel free to open an Issue for any problems or feedback you have. https://aws.amazon.com/blogs/developer/client-side-metrics-for-the-aws-sdk-for-java-v2-is-now-generally-available/

@dagnir dagnir closed this as completed Sep 2, 2020
@rehevkor5
Copy link

It's possible to implement your own MetricPublisher in order to produce metrics in Micrometer format. However, there are some wrinkles due to differences between CloudWatch metrics and the approach used by eg. Prometheus metrics. Here are some details which people might find helpful.

Duration type metrics are directly convertible to Timer.

Boolean type metrics can be converted into two Counters: for example one which increments for true and one which increments for false.

Number type metrics can be tricky. Some of them (such as RetryCount) can be recorded as a Counter (which increments with each given value). However, metrics such as HttpMetric#AVAILABLE_CONCURRENCY, HttpMetric#LEASED_CONCURRENCY, and HttpMetric#PENDING_CONCURRENCY_ACQUIRES behave more like gauges. Also, they are difficult because they are are only sent (via MetricCollection passed to MetricPublisher#publish()) if a request was made. Therefore, if you treat those metrics as a Gauge they will become "stuck" at their last value until another request occurs. CloudWatch aggregates metrics like this once per minute, recording the sum of the values and the count of how many values were given (as well as min & max value). If the metric isn't reported, CloudWatch seems to assume that the value was zero during that time. Therefore, what you can get from CloudWatch is the average value (by dividing sum / count) or min/max of the metric in each time period. We could record these as a DistributionSummary if we need more details (though it might be somewhat misleading). However, a simpler and more reliable way is to imitate the CloudWatch approach and record the value sum as a Counter along with the value count as its own Counter. This allows us to perform calculations such as rate(...sum_count{...}[...]) / rate(...sample_count{...}[...]) (or, equivalently, increase() / increase()), which basically give us the average value for the time range. It's not necessarily 100% accurate when no requests are happening, but at least it will show zero instead of the last value. I'm about to implement this approach, hopefully it works as expected.

@ginkel
Copy link

ginkel commented Dec 15, 2022

@rehevkor5 Has your attempt to expose metrics to Micrometer been successful? Thanks!

@PatrykGala
Copy link

@rehevkor5 @ginkel this is my solution: https://gist.github.com/PatrykGala/e4aec004eb55cd8cbdee328f217771c7
The code creates gauge and counter for Integer metrics, one of them is not necessary for specific metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x Parity feature-request A feature should be added or improved.
Projects
None yet
Development

No branches or pull requests