Error sending hot-shots message: Error: congestion #131

nicgirault · 2019-10-10T12:51:30Z

Hello & thanks for your work.

I receive plenty of errors "congestion" and I've no clue how to solve it. Any tips?

bdeitte · 2019-10-14T15:50:23Z

I haven't heard of this one before, and it sounds like an issue connecting with your statsd server and not something that can be solved in here. What protocol are you using? You can also set an errorHandler to catch the error to see if that gives you any more info.

nicgirault · 2019-10-14T18:59:14Z

Sorry I didn't precise that I use UDS protocol.

My configuration looks like this:

return datadog({
    dogstatsd: new StatsD({
      protocol: 'uds',
      path: '/var/run/datadog/dsd.socket',
      errorHandler: error => {
          logger.warn(`hot-shots UDS error: ${error.message}`)
      },
    }),
    method: true,
    response_code: true,
  })

Since the socket exists do you think this can come from connecting to statsd server? Is statsd server handled by datadog agent? datadog agent seems well connected to the socket since I receive metrics and logs.

bdeitte · 2019-10-16T01:21:52Z

The datadog agent is a form of a statsd server. It seems to me that it would be a problem with this library connecting with the dsd.socket. There's been some attempts to make retries in these cases work better, but this is still a work in progress: #128

bdeitte · 2019-10-25T00:29:07Z

I am going to assume this wasn't using the latest and close this out to focus on the fixes in the ticket mentioned above. After that is fixed, if this is still an issue, we can reopen.

pcothenet · 2020-03-21T22:10:56Z

@bdeitte I am getting this error right now, which I wasn't getting on a prior project. They both use the latest version (as of last week, I think a new (unrelated) one just got published)

pcothenet · 2020-03-22T00:45:35Z

We have temporarily alleviated our problems by adding a errorHandler that logs instead of throwing and crashing the program.

@bdeitte do you know what could be causing this "congestion"?

bdeitte · 2020-03-22T19:39:37Z

@pcothenet Can you provide more details on your protocol setting, version in use, what type of server you're using it with, etc?

I'm also surprised to hear it's crashing the program, that's different than what I read about. So simply adding an errorHandler means it isn't restarting? And you just see those errors when it was restarting?

pcothenet · 2020-03-23T01:21:42Z

Hi @bdeitte,

Happy to share more details:

Here's the client I expose to my applications:

import { ClientOptions, StatsD } from 'hot-shots';
const { environment } = initEnv();

const options: ClientOptions = {
  protocol: process.env.DD_DOGSTATSD_PROTOCOL === 'uds' ? 'uds' : undefined,
  path: process.env.DD_DOGSTATSD_SOCKET,
  globalTags: { env: environment },
  errorHandler: (error) => {
    console.error('hot-shots error', error.message);
  }
};
const dog = new StatsD(options);

(I added the errorHandler yesterday)

(then the application imports this library and use dog.increment, etc...)

We've been using hot-shots in regular UDP (on ElasticBeanstalk environment) for about two years now (thanks for all the work this has been super helpful). We recently started migrating our backend-worker to ECS, with one Datadog agent running as a daemon container.

After figuring out the settings to run out with UDP, we rolled this out to one application two weeks ago. We encountered issue #114 but it hasn't been much of a problem.

We rolled out to a second application on Friday and started seeing containers crashing over and over with the congestion error.

Version in use in 7.2.0 (the version of two weeks ago was having the same issue)
Adding the errorHandler stopped crashing the program (although we still see the error)
The program was crashing (I think) because we catch unhandledRejection and exit
The error message was hot-shots error { Error: Error sending hot-shots message: Error: congestion, which in the absence of a errorHandler is emitted here: https://github.com/brightcove/hot-shots/blob/master/lib/statsd.js#L376
it doesn't get gracefully handled here: https://github.com/brightcove/hot-shots/blob/master/lib/statsd.js#L142 (but I haven't dug into why)

Couple more info on context:

This second application sends a lot more metrics to StatsD
We reduced the sampleRate to a meager 0.1 which seems to reduce the issue but does not eradicate it
We have several containers running. Only the ones handling high volumes of data seem to encounter this issue.

I'm pretty new to the UDS so not sure what is causing the congestion or how to solve it. Any idea on your end? I might reach out to Datadog support to get more info as well.

Thanks a lot again for your work here, really appreciated!

bdeitte · 2020-03-29T19:04:42Z

Hi @pcothenet thanks for all the details. So it make sense that for an unhandled error this would restart things if you have an unhandledRejection setup and exit on it. So I'm not going to focus on that part, and instead it sounds like there's more than one person having this congestion issue. I'll reopen this issue, but to be honest it's still not clear to me if this is an issue with hot-shots or somewhere else in the stack (Node, ECS, or elsewhere). But reopening will at least make clearer to more people this is something that is run into.

I am not a big expert in UDS, haven't started using it on the project I work on. But hot-shots uses https://github.com/bnoordhuis/node-unix-dgram for this so I spent a little time poking around there. I do see there's a congestion callback that I don't believe is in use in hot-shots. It probably would make sense to provide some optionality there in how much to retry if congestion is seen. https://github.com/bnoordhuis/node-unix-dgram#socketsendbuf-callback

pcothenet · 2020-03-29T20:38:34Z

Yeah, I've generally found that the error handling in the library makes sense. Thanks for maintaining this project, I've been a big user!

I'm not an expert in UDS as well. It looks like one of those things that node.js isn't particularly well-suited for (node-unix-dgram adds the dreaded node-gyp steps when installing). I'll poke datadog to see if it's something they can provide clarity on.

coffenbacher · 2020-05-27T01:48:36Z

FWIW I'm also seeing this error consistently using UDS, with a totally different stack (Rancher K3s on a DigitalOcean VM) and very low volume. It's weird because it works, but I get this error all the time, about ~10% of the calls I make to it. I suspect it might be because I make a few back to back statsd calls, even though the volume is low overall, might be a clue.

Anyways, I don't have any new info for now, I am going to experiment at some point, but I just wanted to add another data point to the thread.

coffenbacher · 2020-05-27T21:15:38Z

I was able to repro this locally by running a dev UDS server:

socat UNIX-RECVFROM:/var/run/datadog/dsd.socket,fork -

and then spamming the socket:

for (let i = 0; i < 1000; i++) {
  console.log(i)
  hotshot.increment("test", [i.toString()])
  console.log("Incremented", i)
}

Locally, about 300 messages would go before crashing with the congestion error.

I don't have time at the moment to dig into the library and submit a well-designed PR, but if it helps anyone else, my quick fix that successfully sends all the messages (eventually) if anyone needs a workaround:

import retry from "async-retry"
import StatsD from "hot-shots"

const hotshot = new StatsD({
    protocol: "uds",
    path: "/var/run/datadog/dsd.socket",
    errorHandler: (error) => {
      const isCongested = error.message.includes("congestion")
      if (isCongested) {
        throw error
      } else {
        console.error({
          message: error.message,
          error: error,
          library: "hot-shots",
        })
      }
    },
})

for (let i = 0; i < 1000; i++) {
  retry((bail) => {
    console.log(i)
    hotshot.increment("test", [i.toString()])
    console.log("Incremented")
  })
}

DerGut · 2020-07-01T18:19:17Z

I have stepped into this too by using a very similar setup without explicitly specifying the option udsGracefulErrorHandling:

new StatsD({
    protocol: "uds",
    errorHandler: (error) => logger.warn("hot-shots error: ", error)
  });

The documentation states that udsGracefulErrorHandling is true by default and looking at the following, it almost seems like it.

hot-shots/lib/statsd.js

Line 70 in b5477ec

    
           this.udsGracefulErrorHandling = 'udsGracefulErrorHandling' in options ? options.udsGracefulErrorHandling : true;

However, when I look at the code that sets up the error handling, it doesn't actually use this.udsGracefulErrorHandling. It appears like the error handling is only setup if udsGracefulErrorHandling is explicitly set to true in the options.

hot-shots/lib/statsd.js

Line 126 in 2da574f

    
           if (!options.isChild && options.protocol === PROTOCOL.UDS && options.udsGracefulErrorHandling) {

I am still on the way to testing whether this removes the Error sending hot-shots message: Error: congestion errors for me. But I am happy to provide a PR if it does.

Thanks for your work!

bdeitte · 2020-07-02T00:28:23Z

Oh nice catch there @DerGut! Definitely that's a bug in any case, even if it doesn't help here, for that last line to be options instead of this.

DerGut · 2020-07-02T12:06:55Z

Definitely, you are right @bdeitte!
Unfortunately, it hasn't solved the congestion error for me but I have created a separate issue #172 to track this one.

Regarding the congestion error: I am getting an error code of 1 which isn't handled.

DerGut · 2020-07-03T14:33:52Z

I have looked a little more into this.

In the unix-dgram library the lines handling the failed send operation are these.
It specifically handles the error codes EAGAIN, EWOULDBLOCK and ENOBUFS which are all indicative of a full socket buffer.

When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK
sys/socket.h

The output queue for a network interface was full. This generally indicates that the interface has stopped sending, but may be caused by transient congestion. (Normally, this does not occur in Linux. Packets are just silently dropped when a device queue overflows.)
sys/socket.h

unix-dgram propagates these errors with an error code of 1 back to node and returns a congestion error back to hot-shots's handleCallback function.

Now, according to the Unix Domain Sockets support specification by Datadog

Timeout / blocking call: the server is not handling packets fast enough and the queue is full. Client should keep the connection open, but drop the packet. Reconnecting will not help.

On these types of errors the client should be dropping the packets. And in fact, the datadogpy implementation (timeout and EAGAIN) does so..

bdeitte · 2020-07-04T20:08:26Z

Nice finds @DerGut. Makes sense given your investigation here that this issue would be fixed by handling the congestion callback internally to hot-shots to not propagate an error any further.

DerGut · 2020-07-06T20:06:38Z

@bdeitte, I thought so too until I found this document
Perhaps it could be valuable to know that packages are dropped such that one can optimise for it?

I think it makes sense to either:

propagate the error but mention its meaning/ refer to the document in the docs or
implement some of the client side metrics mentioned at the bottom of the document and omit the propagation

bdeitte · 2020-07-07T12:52:25Z

I have never seen those metrics, good to know about. If anybody at Datadog is reading this, you can always create issues here to make this client more in sync. :)

Since you've been doing great research here @DerGut what are your thoughts for this plan? I think those metrics make the most sense, to be aligned here, but it's certainly a bunch of work. Keeping the error as propogated but documenting better could make sense first though- which is close to being done already, just need to add a README note. Let me know if you're going to do that or I can one of these days. As for metrics, happy to take a PR from anyone on that, and once the other part is done I think a new issue would make sense to track this which references this one.

DerGut · 2020-07-12T19:16:13Z

I can submit a PR to add a section to the README and open another issue for the metrics.

You are right, that this is probably the best solution for now as I don't have time to implement the metrics at this point. If there hasn't been a PR for it in the next couple of weeks I can take another look at it 👍

Fixes brightcove#131

bdeitte closed this as completed Oct 25, 2019

bdeitte reopened this Mar 29, 2020

DerGut mentioned this issue Jul 2, 2020

udsGracefulErrorHandling default isn't used #172

Closed

DerGut mentioned this issue Jul 12, 2020

Implement Datadog client side telemetry #174

Open

17 tasks

DerGut added a commit to DerGut/hot-shots that referenced this issue Jul 12, 2020

Add "Congestion error" section under "Errors"

2c85a6e

Fixes brightcove#131

DerGut mentioned this issue Jul 12, 2020

Add "Congestion error" section to README #175

Merged

bdeitte closed this as completed in #175 Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error sending hot-shots message: Error: congestion #131

Error sending hot-shots message: Error: congestion #131

nicgirault commented Oct 10, 2019

bdeitte commented Oct 14, 2019

nicgirault commented Oct 14, 2019

bdeitte commented Oct 16, 2019

bdeitte commented Oct 25, 2019

pcothenet commented Mar 21, 2020

pcothenet commented Mar 22, 2020

bdeitte commented Mar 22, 2020

pcothenet commented Mar 23, 2020 •

edited

Loading

bdeitte commented Mar 29, 2020

pcothenet commented Mar 29, 2020 •

edited

Loading

coffenbacher commented May 27, 2020

coffenbacher commented May 27, 2020

DerGut commented Jul 1, 2020

bdeitte commented Jul 2, 2020

DerGut commented Jul 2, 2020

DerGut commented Jul 3, 2020

bdeitte commented Jul 4, 2020

DerGut commented Jul 6, 2020

bdeitte commented Jul 7, 2020

DerGut commented Jul 12, 2020

Error sending hot-shots message: Error: congestion #131

Error sending hot-shots message: Error: congestion #131

Comments

nicgirault commented Oct 10, 2019

bdeitte commented Oct 14, 2019

nicgirault commented Oct 14, 2019

bdeitte commented Oct 16, 2019

bdeitte commented Oct 25, 2019

pcothenet commented Mar 21, 2020

pcothenet commented Mar 22, 2020

bdeitte commented Mar 22, 2020

pcothenet commented Mar 23, 2020 • edited Loading

bdeitte commented Mar 29, 2020

pcothenet commented Mar 29, 2020 • edited Loading

coffenbacher commented May 27, 2020

coffenbacher commented May 27, 2020

DerGut commented Jul 1, 2020

bdeitte commented Jul 2, 2020

DerGut commented Jul 2, 2020

DerGut commented Jul 3, 2020

bdeitte commented Jul 4, 2020

DerGut commented Jul 6, 2020

bdeitte commented Jul 7, 2020

DerGut commented Jul 12, 2020

pcothenet commented Mar 23, 2020 •

edited

Loading

pcothenet commented Mar 29, 2020 •

edited

Loading