Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudWatchMetricPublisher fails to publish high QPS detailed metrics #3080

Open
Kurru opened this issue Mar 4, 2022 · 13 comments
Open

CloudWatchMetricPublisher fails to publish high QPS detailed metrics #3080

Kurru opened this issue Mar 4, 2022 · 13 comments
Assignees
Labels
bug This issue is a bug. p2 This is a standard priority issue

Comments

@Kurru
Copy link

Kurru commented Mar 4, 2022

Describe the bug

When publishing a detailed metric into CloudWatch using CloudWatchMetricPublisher, cloudWatchClient.putMetricData API throws a request validation exception:

WARN cloudwatch:117 - Failed while publishing some or all AWS SDK client-side metrics to CloudWatch.

java.util.concurrent.CompletionException: software.amazon.awssdk.services.cloudwatch.model.InvalidParameterValueException: The collection MetricData.member.13.Values must not have a size greater than 150. (Service: CloudWatch, Status Code: 400, Request ID: a84e1dcb-bfba-495b-97f5-60d7316c1800)

I believe this is due to there being more than 150 data points for a single metric, though I'm not sure as this seems to be internal to AWS servers.

This limitation seems to be documented here:

For example, a single PutMetricData call can include 20 metrics and 150 data points.

Expected behavior

Client library should construct the request object while taking into account this limitation either by sending the additional data points into later requests or by dropping the additional data points.

Current behavior

cloudWatchClient.putMetricData fails and a logging message is recorded.

All metric values for this high QPS service are lost.

WARN cloudwatch:117 - Failed while publishing some or all AWS SDK client-side metrics to CloudWatch.

java.util.concurrent.CompletionException: software.amazon.awssdk.services.cloudwatch.model.InvalidParameterValueException: The collection MetricData.member.13.Values must not have a size greater than 150. (Service: CloudWatch, Status Code: 400, Request ID: a84e1dcb-bfba-495b-97f5-60d7316c1800)

Steps to Reproduce

I haven't validated but I believe this should cause the error to be triggered.

var publisher = CloudWatchMetricPublisher.builder()
    .namespace("Test")
    .cloudWatchClient(asyncClient)
    .detailedMetrics(CoreMetric.API_CALL_DURATION)
    .dimensions(CoreMetric.OPERATION_NAME)
    .build();

for (int i=0;i<1000;i++) {
    MetricCollector methodCollector = MetricCollector.create("RPC");
    methodCollector.reportMetric(CoreMetric.API_CALL_DURATION, i);
    methodCollector.reportMetric(CoreMetric.OPERATION_NAME, "YourRPC");
    metricPublisher.publish(methodCollector.collect());
}

// Wait for the periodic metric flush
Thread.sleep(120*1000);

Possible Solution

software/amazon/awssdk/metrics/publishers/cloudwatch/internal/transform/MetricCollectionAggregator.java:128 could be updated from:

MetricDatum data = detailedMetricDatum(timeBucket, detailedAggregator, startIndex, MAX_VALUES_PER_REQUEST - valuesInRequestCounter.get());

To something like:

MetricDatum data = detailedMetricDatum(timeBucket, detailedAggregator, startIndex, Math.min(MAX_VALUES_PER_REQUEST - valuesInRequestCounter.get(), 150));

Context

Trying to implement metrics to monitor the performance of my API server using custom CloudWatch metrics.

AWS Java SDK version used

17

JDK version used

Whatever is in docker image openjdk:17-oracle

Operating System and version

Docker image: openjdk:17-oracle linux-oracle?

@Kurru Kurru added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 4, 2022
@debora-ito
Copy link
Member

@Kurru sorry this is biting you. Yes, the exception is probably related to a CloudWatch service quota, although the funny thing is I cannot reproduce with the repro code you provided.

The CloudWatch service quota docs mention there's a 150 TPS limit for PutMetricData, so maybe the issue is the request rate?

@debora-ito debora-ito added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. and removed needs-triage This issue or PR still needs to be triaged. labels Mar 8, 2022
@debora-ito debora-ito self-assigned this Mar 8, 2022
@Kurru
Copy link
Author

Kurru commented Mar 9, 2022

I don't think its the 150 TPS limit for PutMetricData but rather the request size limit of PutMetricData.

I found a reference to this here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html

Using the Values and Counts method enables you to publish up to 150 values per metric with one PutMetricData request, and supports retrieving percentile statistics on this data.

I made a small edit to the sample code to ensure that the latency metric is a 'detailedMetric' so that the library would report individual values, not a pre-aggregation. Additionally, its worthy of note, the failure is silent. The failure is only exposed as a log message AFTER 60 seconds (or per your CloudWatchMetricPublisher.uploadFrequency config)

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 9, 2022
@debora-ito
Copy link
Member

Thank you for the reference. I still cannot reproduce after the code update, I get a successful 200 response when it really shouldn't, according to the docs.

Screen Shot 2022-03-10 at 11 56 21 AM

918 is the sample count, it seems some metrics are being dropped in the process.

Do you see the exception when you run your repro code?
Can you provide the verbose wirelogs? Please make sure to redact any sensitive information.

@debora-ito debora-ito added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 10, 2022
@github-actions
Copy link

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

@github-actions github-actions bot added the closing-soon This issue will close in 4 days unless further comments are made. label Mar 15, 2022
@Kurru
Copy link
Author

Kurru commented Mar 15, 2022

Sorry for the delay @debora-ito

I put together a small reproduction git repo here: https://github.com/Kurru/cloudwatch-highqps

To run it

  • clone the repo
  • cd to /demo directory
  • run ./gradlew :installdist && build/install/demo/bin/demo

I've been able to reproduce the stack trace reliability with this script on my mac-laptop.

One thing I found important was that without a dependency on sl4j-simple the error log didn't appear due to disabled logging by SLF4J

@github-actions github-actions bot removed closing-soon This issue will close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels Mar 16, 2022
@Kurru
Copy link
Author

Kurru commented Mar 19, 2022

Oh, full stacktrace for this issue:

[sdk-async-response-0-3] WARN software.amazon.awssdk.metrics.publishers.cloudwatch - Failed while publishing some or all AWS SDK client-side metrics to CloudWatch.
java.util.concurrent.CompletionException: software.amazon.awssdk.services.cloudwatch.model.InvalidParameterValueException: The collection MetricData.member.1.Values must not have a size greater than 150. (Service: CloudWatch, Status Code: 400, Request ID: 51f0a1fc-4ea9-42db-9fa7-7b2215174cef)
        at software.amazon.awssdk.utils.CompletableFutureUtils.errorAsCompletionException(CompletableFutureUtils.java:62)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncExecutionFailureExceptionReportingStage.lambda$execute$0(AsyncExecutionFailureExceptionReportingStage.java:51)
        at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
        at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
        at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeAttemptExecute(AsyncRetryableStage.java:103)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:181)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:167)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:105)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$executeHttpRequest$3(MakeAsyncHttpRequestStage.java:163)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: software.amazon.awssdk.services.cloudwatch.model.InvalidParameterValueException: The collection MetricData.member.1.Values must not have a size greater than 150. (Service: CloudWatch, Status Code: 400, Request ID: 51f0a1fc-4ea9-42db-9fa7-7b2215174cef)
        at software.amazon.awssdk.services.cloudwatch.model.InvalidParameterValueException$BuilderImpl.build(InvalidParameterValueException.java:118)
        at software.amazon.awssdk.services.cloudwatch.model.InvalidParameterValueException$BuilderImpl.build(InvalidParameterValueException.java:78)
        at software.amazon.awssdk.protocols.query.internal.unmarshall.AwsXmlErrorUnmarshaller.unmarshall(AwsXmlErrorUnmarshaller.java:99)
        at software.amazon.awssdk.protocols.query.unmarshall.AwsXmlErrorProtocolUnmarshaller.handle(AwsXmlErrorProtocolUnmarshaller.java:102)
        at software.amazon.awssdk.protocols.query.unmarshall.AwsXmlErrorProtocolUnmarshaller.handle(AwsXmlErrorProtocolUnmarshaller.java:82)
        at software.amazon.awssdk.core.http.MetricCollectingHttpResponseHandler.lambda$handle$0(MetricCollectingHttpResponseHandler.java:52)
        at software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:64)
        at software.amazon.awssdk.core.http.MetricCollectingHttpResponseHandler.handle(MetricCollectingHttpResponseHandler.java:52)
        at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler.lambda$prepare$0(AsyncResponseHandler.java:89)
        at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1150)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
        at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler$BaosSubscriber.onComplete(AsyncResponseHandler.java:132)
        at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$DataCountingPublisher$1.onComplete(ResponseHandler.java:513)
        at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.runAndLogError(ResponseHandler.java:249)
        at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.access$600(ResponseHandler.java:74)
        at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:370)
        at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.publishMessage(HandlerPublisher.java:402)
        at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.flushBuffer(HandlerPublisher.java:338)
        at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.receivedDemand(HandlerPublisher.java:291)
        at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.access$200(HandlerPublisher.java:61)
        at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher$ChannelSubscription$1.run(HandlerPublisher.java:495)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        ... 1 more

@debora-ito debora-ito added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Nov 2, 2022
@debora-ito
Copy link
Member

@Kurru I'm sorry for loosing track of this issue. Thank you for the repro project. I still can't repro though.

Starting a Gradle Daemon (subsequent builds will be faster)

BUILD SUCCESSFUL in 6s
4 actionable tasks: 4 executed
[main] INFO demo.cloudwatch.HighQpsRepo - Connecting to AWS
[main] INFO demo.cloudwatch.HighQpsRepo - Creating lots of latency metrics
[main] INFO demo.cloudwatch.HighQpsRepo - Done. Waiting for exception...
[main] INFO demo.cloudwatch.HighQpsRepo - Shutdown

I'll work backwards and start from the error message, then see how I can force the error to be thrown.

@debora-ito
Copy link
Member

Can you enable the verbose wirelogs and share the logs with the error?

@Kurru
Copy link
Author

Kurru commented Nov 5, 2022

Do these verbose wirelogs include my AWS credentials?

I have the logs, though not sure how to share that to you if they contain something so sensitive?

@Kurru
Copy link
Author

Kurru commented Nov 5, 2022

I found a trace log in software.amazon.awssdk.metrics.publishers.cloudwatch.internal.MetricUploader#logRequest which should help.

TRACE software.amazon.awssdk.metrics.publishers.cloudwatch - Sending request to CloudWatch: PutMetricDataRequest(Namespace=Test, MetricData=[MetricDatum(MetricName=ApiCallDuration, Dimensions=[Dimension(Name=OperationName, Value=YourRPC)], Timestamp=2022-11-05T21:51:00Z, Values=[2.0, 0.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 33.0, 32.0, 35.0, 34.0, 37.0, 36.0, 39.0, 38.0, 41.0, 40.0, 43.0, 42.0, 45.0, 44.0, 47.0, 46.0, 49.0, 48.0, 51.0, 50.0, 53.0, 52.0, 55.0, 54.0, 57.0, 56.0, 59.0, 58.0, 61.0, 60.0, 63.0, 62.0, 67.0, 66.0, 65.0, 64.0, 71.0, 70.0, 69.0, 68.0, 75.0, 74.0, 73.0, 72.0, 79.0, 78.0, 77.0, 76.0, 83.0, 82.0, 81.0, 80.0, 87.0, 86.0, 85.0, 84.0, 91.0, 90.0, 89.0, 88.0, 95.0, 94.0, 93.0, 92.0, 99.0, 98.0, 97.0, 96.0, 103.0, 102.0, 101.0, 100.0, 107.0, 106.0, 105.0, 104.0, 111.0, 110.0, 109.0, 108.0, 115.0, 114.0, 113.0, 112.0, 119.0, 118.0, 117.0, 116.0, 123.0, 122.0, 121.0, 120.0, 127.0, 126.0, 125.0, 124.0, 135.0, 134.0, 133.0, 132.0, 131.0, 130.0, 129.0, 128.0, 143.0, 142.0, 141.0, 140.0, 139.0, 138.0, 137.0, 136.0, 151.0, 150.0, 149.0, 148.0, 147.0, 146.0, 145.0, 144.0, 159.0, 158.0, 157.0, 156.0, 155.0, 154.0, 153.0, 152.0, 1.0], Counts=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], Unit=Milliseconds)])

@debora-ito
Copy link
Member

debora-ito commented Nov 9, 2022

Yes, verbose wirelogs will include the access key id, but you can remove it and any other sensitive data before sharing.

I was hoping to see just the part of the request that shows the MetricData members, in the same format of the examples in the API Reference here, something like:

Action=PutMetricData
&Version=2010-08-01
&Namespace=TestMetrics
&MetricData.member.1.MetricName=ApiCallDuration
&MetricData.member.1.Dimensions.member.1.Name=OperationName
&MetricData.member.1.Dimensions.member.1.Value=YourRPC
&MetricData.member.1.Unit=Milliseconds
&MetricData.member.1.Values.member.1=5
&MetricData.member.1.Values.member.2=8
&MetricData.member.1.Values.member.3=10
&MetricData.member.1.Values.member.4=9
&MetricData.member.1.Counts.member.1=1
&MetricData.member.1.Counts.member.2=1
&MetricData.member.1.Counts.member.3=1
&MetricData.member.1.Counts.member.4=1

For reference, in my local tests the metrics are being split in 4 async PutMetricData calls, in each request MetricData has only 1 member (MetricData.member.1) with around 300 Values and Counts members (the example above has 4 members).

@Kurru
Copy link
Author

Kurru commented Nov 11, 2022

I was able to manipulate hex logs using regex into the format you requested. pastebin

I had a hard time getting that specific log format, it doesnt seem to be logged by AWS directly?
This is what the verbose log instructions get me (800lines of hex encoding): pastebin
Perhaps a better/plaining debug logging could be provided by AWS team

@debora-ito debora-ito added p2 This is a standard priority issue and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Mar 20, 2023
@bvanderveen
Copy link

I've just been bitten by this. Changing this line to 150 should fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

3 participants