Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error sending hot-shots message: Error: congestion #131

Closed
nicgirault opened this issue Oct 10, 2019 · 20 comments · Fixed by #175
Closed

Error sending hot-shots message: Error: congestion #131

nicgirault opened this issue Oct 10, 2019 · 20 comments · Fixed by #175

Comments

@nicgirault
Copy link

Hello & thanks for your work.

I receive plenty of errors "congestion" and I've no clue how to solve it. Any tips?

@bdeitte
Copy link
Collaborator

bdeitte commented Oct 14, 2019

I haven't heard of this one before, and it sounds like an issue connecting with your statsd server and not something that can be solved in here. What protocol are you using? You can also set an errorHandler to catch the error to see if that gives you any more info.

@nicgirault
Copy link
Author

Sorry I didn't precise that I use UDS protocol.

My configuration looks like this:

return datadog({
    dogstatsd: new StatsD({
      protocol: 'uds',
      path: '/var/run/datadog/dsd.socket',
      errorHandler: error => {
          logger.warn(`hot-shots UDS error: ${error.message}`)
      },
    }),
    method: true,
    response_code: true,
  })

Since the socket exists do you think this can come from connecting to statsd server? Is statsd server handled by datadog agent? datadog agent seems well connected to the socket since I receive metrics and logs.

@bdeitte
Copy link
Collaborator

bdeitte commented Oct 16, 2019

The datadog agent is a form of a statsd server. It seems to me that it would be a problem with this library connecting with the dsd.socket. There's been some attempts to make retries in these cases work better, but this is still a work in progress: #128

@bdeitte
Copy link
Collaborator

bdeitte commented Oct 25, 2019

I am going to assume this wasn't using the latest and close this out to focus on the fixes in the ticket mentioned above. After that is fixed, if this is still an issue, we can reopen.

@bdeitte bdeitte closed this as completed Oct 25, 2019
@pcothenet
Copy link

@bdeitte I am getting this error right now, which I wasn't getting on a prior project. They both use the latest version (as of last week, I think a new (unrelated) one just got published)

@pcothenet
Copy link

We have temporarily alleviated our problems by adding a errorHandler that logs instead of throwing and crashing the program.

@bdeitte do you know what could be causing this "congestion"?

@bdeitte
Copy link
Collaborator

bdeitte commented Mar 22, 2020

@pcothenet Can you provide more details on your protocol setting, version in use, what type of server you're using it with, etc?

I'm also surprised to hear it's crashing the program, that's different than what I read about. So simply adding an errorHandler means it isn't restarting? And you just see those errors when it was restarting?

@pcothenet
Copy link

pcothenet commented Mar 23, 2020

Hi @bdeitte,

Happy to share more details:

Here's the client I expose to my applications:

import { ClientOptions, StatsD } from 'hot-shots';
const { environment } = initEnv();

const options: ClientOptions = {
  protocol: process.env.DD_DOGSTATSD_PROTOCOL === 'uds' ? 'uds' : undefined,
  path: process.env.DD_DOGSTATSD_SOCKET,
  globalTags: { env: environment },
  errorHandler: (error) => {
    console.error('hot-shots error', error.message);
  }
};
const dog = new StatsD(options);

(I added the errorHandler yesterday)

(then the application imports this library and use dog.increment, etc...)

We've been using hot-shots in regular UDP (on ElasticBeanstalk environment) for about two years now (thanks for all the work this has been super helpful). We recently started migrating our backend-worker to ECS, with one Datadog agent running as a daemon container.

After figuring out the settings to run out with UDP, we rolled this out to one application two weeks ago. We encountered issue #114 but it hasn't been much of a problem.

We rolled out to a second application on Friday and started seeing containers crashing over and over with the congestion error.

Couple more info on context:

  • This second application sends a lot more metrics to StatsD
  • We reduced the sampleRate to a meager 0.1 which seems to reduce the issue but does not eradicate it
  • We have several containers running. Only the ones handling high volumes of data seem to encounter this issue.

I'm pretty new to the UDS so not sure what is causing the congestion or how to solve it. Any idea on your end? I might reach out to Datadog support to get more info as well.

Thanks a lot again for your work here, really appreciated!

@bdeitte
Copy link
Collaborator

bdeitte commented Mar 29, 2020

Hi @pcothenet thanks for all the details. So it make sense that for an unhandled error this would restart things if you have an unhandledRejection setup and exit on it. So I'm not going to focus on that part, and instead it sounds like there's more than one person having this congestion issue. I'll reopen this issue, but to be honest it's still not clear to me if this is an issue with hot-shots or somewhere else in the stack (Node, ECS, or elsewhere). But reopening will at least make clearer to more people this is something that is run into.

I am not a big expert in UDS, haven't started using it on the project I work on. But hot-shots uses https://github.com/bnoordhuis/node-unix-dgram for this so I spent a little time poking around there. I do see there's a congestion callback that I don't believe is in use in hot-shots. It probably would make sense to provide some optionality there in how much to retry if congestion is seen. https://github.com/bnoordhuis/node-unix-dgram#socketsendbuf-callback

@bdeitte bdeitte reopened this Mar 29, 2020
@pcothenet
Copy link

pcothenet commented Mar 29, 2020

Yeah, I've generally found that the error handling in the library makes sense. Thanks for maintaining this project, I've been a big user!

I'm not an expert in UDS as well. It looks like one of those things that node.js isn't particularly well-suited for (node-unix-dgram adds the dreaded node-gyp steps when installing). I'll poke datadog to see if it's something they can provide clarity on.

@coffenbacher
Copy link

FWIW I'm also seeing this error consistently using UDS, with a totally different stack (Rancher K3s on a DigitalOcean VM) and very low volume. It's weird because it works, but I get this error all the time, about ~10% of the calls I make to it. I suspect it might be because I make a few back to back statsd calls, even though the volume is low overall, might be a clue.

Anyways, I don't have any new info for now, I am going to experiment at some point, but I just wanted to add another data point to the thread.

@coffenbacher
Copy link

I was able to repro this locally by running a dev UDS server:

socat UNIX-RECVFROM:/var/run/datadog/dsd.socket,fork -

and then spamming the socket:

for (let i = 0; i < 1000; i++) {
  console.log(i)
  hotshot.increment("test", [i.toString()])
  console.log("Incremented", i)
}

Locally, about 300 messages would go before crashing with the congestion error.

I don't have time at the moment to dig into the library and submit a well-designed PR, but if it helps anyone else, my quick fix that successfully sends all the messages (eventually) if anyone needs a workaround:

import retry from "async-retry"
import StatsD from "hot-shots"

const hotshot = new StatsD({
    protocol: "uds",
    path: "/var/run/datadog/dsd.socket",
    errorHandler: (error) => {
      const isCongested = error.message.includes("congestion")
      if (isCongested) {
        throw error
      } else {
        console.error({
          message: error.message,
          error: error,
          library: "hot-shots",
        })
      }
    },
})

for (let i = 0; i < 1000; i++) {
  retry((bail) => {
    console.log(i)
    hotshot.increment("test", [i.toString()])
    console.log("Incremented")
  })
}

@DerGut
Copy link
Contributor

DerGut commented Jul 1, 2020

I have stepped into this too by using a very similar setup without explicitly specifying the option udsGracefulErrorHandling:

new StatsD({
    protocol: "uds",
    errorHandler: (error) => logger.warn("hot-shots error: ", error)
  });

The documentation states that udsGracefulErrorHandling is true by default and looking at the following, it almost seems like it.

this.udsGracefulErrorHandling = 'udsGracefulErrorHandling' in options ? options.udsGracefulErrorHandling : true;

However, when I look at the code that sets up the error handling, it doesn't actually use this.udsGracefulErrorHandling. It appears like the error handling is only setup if udsGracefulErrorHandling is explicitly set to true in the options.

if (!options.isChild && options.protocol === PROTOCOL.UDS && options.udsGracefulErrorHandling) {

I am still on the way to testing whether this removes the Error sending hot-shots message: Error: congestion errors for me. But I am happy to provide a PR if it does.

Thanks for your work!

@bdeitte
Copy link
Collaborator

bdeitte commented Jul 2, 2020

Oh nice catch there @DerGut! Definitely that's a bug in any case, even if it doesn't help here, for that last line to be options instead of this.

@DerGut
Copy link
Contributor

DerGut commented Jul 2, 2020

Definitely, you are right @bdeitte!
Unfortunately, it hasn't solved the congestion error for me but I have created a separate issue #172 to track this one.

Regarding the congestion error: I am getting an error code of 1 which isn't handled.

@DerGut
Copy link
Contributor

DerGut commented Jul 3, 2020

I have looked a little more into this.

In the unix-dgram library the lines handling the failed send operation are these.
It specifically handles the error codes EAGAIN, EWOULDBLOCK and ENOBUFS which are all indicative of a full socket buffer.

When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK
sys/socket.h

The output queue for a network interface was full. This generally indicates that the interface has stopped sending, but may be caused by transient congestion. (Normally, this does not occur in Linux. Packets are just silently dropped when a device queue overflows.)
sys/socket.h

unix-dgram propagates these errors with an error code of 1 back to node and returns a congestion error back to hot-shots's handleCallback function.

Now, according to the Unix Domain Sockets support specification by Datadog

Timeout / blocking call: the server is not handling packets fast enough and the queue is full. Client should keep the connection open, but drop the packet. Reconnecting will not help.

On these types of errors the client should be dropping the packets. And in fact, the datadogpy implementation (timeout and EAGAIN) does so..

@bdeitte
Copy link
Collaborator

bdeitte commented Jul 4, 2020

Nice finds @DerGut. Makes sense given your investigation here that this issue would be fixed by handling the congestion callback internally to hot-shots to not propagate an error any further.

@DerGut
Copy link
Contributor

DerGut commented Jul 6, 2020

@bdeitte, I thought so too until I found this document
Perhaps it could be valuable to know that packages are dropped such that one can optimise for it?

I think it makes sense to either:

  • propagate the error but mention its meaning/ refer to the document in the docs or
  • implement some of the client side metrics mentioned at the bottom of the document and omit the propagation

@bdeitte
Copy link
Collaborator

bdeitte commented Jul 7, 2020

I have never seen those metrics, good to know about. If anybody at Datadog is reading this, you can always create issues here to make this client more in sync. :)

Since you've been doing great research here @DerGut what are your thoughts for this plan? I think those metrics make the most sense, to be aligned here, but it's certainly a bunch of work. Keeping the error as propogated but documenting better could make sense first though- which is close to being done already, just need to add a README note. Let me know if you're going to do that or I can one of these days. As for metrics, happy to take a PR from anyone on that, and once the other part is done I think a new issue would make sense to track this which references this one.

@DerGut
Copy link
Contributor

DerGut commented Jul 12, 2020

I can submit a PR to add a section to the README and open another issue for the metrics.

You are right, that this is probably the best solution for now as I don't have time to implement the metrics at this point. If there hasn't been a PR for it in the next couple of weeks I can take another look at it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants