-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error sending hot-shots message: Error: congestion #131
Comments
I haven't heard of this one before, and it sounds like an issue connecting with your statsd server and not something that can be solved in here. What protocol are you using? You can also set an errorHandler to catch the error to see if that gives you any more info. |
Sorry I didn't precise that I use UDS protocol. My configuration looks like this: return datadog({
dogstatsd: new StatsD({
protocol: 'uds',
path: '/var/run/datadog/dsd.socket',
errorHandler: error => {
logger.warn(`hot-shots UDS error: ${error.message}`)
},
}),
method: true,
response_code: true,
}) Since the socket exists do you think this can come from connecting to statsd server? Is statsd server handled by datadog agent? datadog agent seems well connected to the socket since I receive metrics and logs. |
The datadog agent is a form of a statsd server. It seems to me that it would be a problem with this library connecting with the dsd.socket. There's been some attempts to make retries in these cases work better, but this is still a work in progress: #128 |
I am going to assume this wasn't using the latest and close this out to focus on the fixes in the ticket mentioned above. After that is fixed, if this is still an issue, we can reopen. |
@bdeitte I am getting this error right now, which I wasn't getting on a prior project. They both use the latest version (as of last week, I think a new (unrelated) one just got published) |
We have temporarily alleviated our problems by adding a @bdeitte do you know what could be causing this "congestion"? |
@pcothenet Can you provide more details on your protocol setting, version in use, what type of server you're using it with, etc? I'm also surprised to hear it's crashing the program, that's different than what I read about. So simply adding an errorHandler means it isn't restarting? And you just see those errors when it was restarting? |
Hi @bdeitte, Happy to share more details: Here's the client I expose to my applications: import { ClientOptions, StatsD } from 'hot-shots';
const { environment } = initEnv();
const options: ClientOptions = {
protocol: process.env.DD_DOGSTATSD_PROTOCOL === 'uds' ? 'uds' : undefined,
path: process.env.DD_DOGSTATSD_SOCKET,
globalTags: { env: environment },
errorHandler: (error) => {
console.error('hot-shots error', error.message);
}
};
const dog = new StatsD(options); (I added the errorHandler yesterday) (then the application imports this library and use dog.increment, etc...) We've been using hot-shots in regular UDP (on ElasticBeanstalk environment) for about two years now (thanks for all the work this has been super helpful). We recently started migrating our backend-worker to ECS, with one Datadog agent running as a daemon container. After figuring out the settings to run out with UDP, we rolled this out to one application two weeks ago. We encountered issue #114 but it hasn't been much of a problem. We rolled out to a second application on Friday and started seeing containers crashing over and over with the congestion error.
Couple more info on context:
I'm pretty new to the UDS so not sure what is causing the congestion or how to solve it. Any idea on your end? I might reach out to Datadog support to get more info as well. Thanks a lot again for your work here, really appreciated! |
Hi @pcothenet thanks for all the details. So it make sense that for an unhandled error this would restart things if you have an unhandledRejection setup and exit on it. So I'm not going to focus on that part, and instead it sounds like there's more than one person having this congestion issue. I'll reopen this issue, but to be honest it's still not clear to me if this is an issue with hot-shots or somewhere else in the stack (Node, ECS, or elsewhere). But reopening will at least make clearer to more people this is something that is run into. I am not a big expert in UDS, haven't started using it on the project I work on. But hot-shots uses https://github.com/bnoordhuis/node-unix-dgram for this so I spent a little time poking around there. I do see there's a congestion callback that I don't believe is in use in hot-shots. It probably would make sense to provide some optionality there in how much to retry if congestion is seen. https://github.com/bnoordhuis/node-unix-dgram#socketsendbuf-callback |
Yeah, I've generally found that the error handling in the library makes sense. Thanks for maintaining this project, I've been a big user! I'm not an expert in UDS as well. It looks like one of those things that node.js isn't particularly well-suited for (node-unix-dgram adds the dreaded node-gyp steps when installing). I'll poke datadog to see if it's something they can provide clarity on. |
FWIW I'm also seeing this error consistently using UDS, with a totally different stack (Rancher K3s on a DigitalOcean VM) and very low volume. It's weird because it works, but I get this error all the time, about ~10% of the calls I make to it. I suspect it might be because I make a few back to back statsd calls, even though the volume is low overall, might be a clue. Anyways, I don't have any new info for now, I am going to experiment at some point, but I just wanted to add another data point to the thread. |
I was able to repro this locally by running a dev UDS server: socat UNIX-RECVFROM:/var/run/datadog/dsd.socket,fork - and then spamming the socket: for (let i = 0; i < 1000; i++) {
console.log(i)
hotshot.increment("test", [i.toString()])
console.log("Incremented", i)
} Locally, about 300 messages would go before crashing with the congestion error. I don't have time at the moment to dig into the library and submit a well-designed PR, but if it helps anyone else, my quick fix that successfully sends all the messages (eventually) if anyone needs a workaround: import retry from "async-retry"
import StatsD from "hot-shots"
const hotshot = new StatsD({
protocol: "uds",
path: "/var/run/datadog/dsd.socket",
errorHandler: (error) => {
const isCongested = error.message.includes("congestion")
if (isCongested) {
throw error
} else {
console.error({
message: error.message,
error: error,
library: "hot-shots",
})
}
},
})
for (let i = 0; i < 1000; i++) {
retry((bail) => {
console.log(i)
hotshot.increment("test", [i.toString()])
console.log("Incremented")
})
} |
I have stepped into this too by using a very similar setup without explicitly specifying the option new StatsD({
protocol: "uds",
errorHandler: (error) => logger.warn("hot-shots error: ", error)
}); The documentation states that Line 70 in b5477ec
However, when I look at the code that sets up the error handling, it doesn't actually use Line 126 in 2da574f
I am still on the way to testing whether this removes the Thanks for your work! |
Oh nice catch there @DerGut! Definitely that's a bug in any case, even if it doesn't help here, for that last line to be options instead of this. |
I have looked a little more into this. In the
Now, according to the Unix Domain Sockets support specification by Datadog
On these types of errors the client should be dropping the packets. And in fact, the datadogpy implementation (timeout and EAGAIN) does so.. |
Nice finds @DerGut. Makes sense given your investigation here that this issue would be fixed by handling the congestion callback internally to hot-shots to not propagate an error any further. |
@bdeitte, I thought so too until I found this document I think it makes sense to either:
|
I have never seen those metrics, good to know about. If anybody at Datadog is reading this, you can always create issues here to make this client more in sync. :) Since you've been doing great research here @DerGut what are your thoughts for this plan? I think those metrics make the most sense, to be aligned here, but it's certainly a bunch of work. Keeping the error as propogated but documenting better could make sense first though- which is close to being done already, just need to add a README note. Let me know if you're going to do that or I can one of these days. As for metrics, happy to take a PR from anyone on that, and once the other part is done I think a new issue would make sense to track this which references this one. |
I can submit a PR to add a section to the README and open another issue for the metrics. You are right, that this is probably the best solution for now as I don't have time to implement the metrics at this point. If there hasn't been a PR for it in the next couple of weeks I can take another look at it 👍 |
Hello & thanks for your work.
I receive plenty of errors "congestion" and I've no clue how to solve it. Any tips?
The text was updated successfully, but these errors were encountered: