-
Notifications
You must be signed in to change notification settings - Fork 854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CloudWatchMetricPublisher fails to publish high QPS detailed metrics #3080
Comments
@Kurru sorry this is biting you. Yes, the exception is probably related to a CloudWatch service quota, although the funny thing is I cannot reproduce with the repro code you provided. The CloudWatch service quota docs mention there's a 150 TPS limit for PutMetricData, so maybe the issue is the request rate? |
I don't think its the 150 TPS limit for PutMetricData but rather the request size limit of PutMetricData. I found a reference to this here:
I made a small edit to the sample code to ensure that the latency metric is a 'detailedMetric' so that the library would report individual values, not a pre-aggregation. Additionally, its worthy of note, the failure is silent. The failure is only exposed as a log message AFTER 60 seconds (or per your CloudWatchMetricPublisher.uploadFrequency config) |
Thank you for the reference. I still cannot reproduce after the code update, I get a successful 200 response when it really shouldn't, according to the docs. 918 is the sample count, it seems some metrics are being dropped in the process. Do you see the exception when you run your repro code? |
It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it. |
Sorry for the delay @debora-ito I put together a small reproduction git repo here: https://github.com/Kurru/cloudwatch-highqps To run it
I've been able to reproduce the stack trace reliability with this script on my mac-laptop. One thing I found important was that without a dependency on |
Oh, full stacktrace for this issue:
|
@Kurru I'm sorry for loosing track of this issue. Thank you for the repro project. I still can't repro though.
I'll work backwards and start from the error message, then see how I can force the error to be thrown. |
Can you enable the verbose wirelogs and share the logs with the error? |
Do these verbose wirelogs include my AWS credentials? I have the logs, though not sure how to share that to you if they contain something so sensitive? |
I found a trace log in
|
Yes, verbose wirelogs will include the access key id, but you can remove it and any other sensitive data before sharing. I was hoping to see just the part of the request that shows the MetricData members, in the same format of the examples in the API Reference here, something like:
For reference, in my local tests the metrics are being split in 4 async PutMetricData calls, in each request |
I was able to manipulate hex logs using regex into the format you requested. pastebin I had a hard time getting that specific log format, it doesnt seem to be logged by AWS directly? |
I've just been bitten by this. Changing this line to 150 should fix the issue. Line 60 in c2db6af
|
Describe the bug
When publishing a detailed metric into CloudWatch using CloudWatchMetricPublisher, cloudWatchClient.putMetricData API throws a request validation exception:
I believe this is due to there being more than 150 data points for a single metric, though I'm not sure as this seems to be internal to AWS servers.
This limitation seems to be documented here:
Expected behavior
Client library should construct the request object while taking into account this limitation either by sending the additional data points into later requests or by dropping the additional data points.
Current behavior
cloudWatchClient.putMetricData fails and a logging message is recorded.
All metric values for this high QPS service are lost.
Steps to Reproduce
I haven't validated but I believe this should cause the error to be triggered.
Possible Solution
software/amazon/awssdk/metrics/publishers/cloudwatch/internal/transform/MetricCollectionAggregator.java:128
could be updated from:To something like:
Context
Trying to implement metrics to monitor the performance of my API server using custom CloudWatch metrics.
AWS Java SDK version used
17
JDK version used
Whatever is in docker image openjdk:17-oracle
Operating System and version
Docker image: openjdk:17-oracle linux-oracle?
The text was updated successfully, but these errors were encountered: