Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle issues with udsGracefulErrorHandling #128

Closed
bdeitte opened this issue Oct 7, 2019 · 5 comments · Fixed by #189
Closed

Handle issues with udsGracefulErrorHandling #128

bdeitte opened this issue Oct 7, 2019 · 5 comments · Fixed by #189

Comments

@bdeitte
Copy link
Collaborator

bdeitte commented Oct 7, 2019

See #123 (comment). It appears we need to handle more error cases for retries.

@msiebuhr
Copy link
Contributor

Shouldn't we just try re-connecting on all of them.

And a important detail I noticed: Error-codes are negative. My code sees -102, -107 and -111.

@bdeitte bdeitte changed the title Handle more error cases with udsGracefulErrorHandling Handle issues with udsGracefulErrorHandling Oct 14, 2019
@bdeitte
Copy link
Collaborator Author

bdeitte commented Oct 14, 2019

@msiebuhr I see your comments in #114 (comment) as well so making this a more generic ticket on potential issues with this feature. Also interesting on what you see for codes, as it's different than what @alessio-modus mentions in #123

@pcothenet
Copy link

We're experiencing something like #114. When the agent restart, hot-shots never seem to reconnect properly (error -107)

@dhermes
Copy link
Contributor

dhermes commented Sep 25, 2020

I'm currently looking into this as well. (I'll probably update this post as I go.) I've got a local-dd-agent Docker container / host UDS I've been using to reproduce this.

Error codes:

  • 102 / 0x66 (Linux): ENETRESET
  • 107 / 0x6b (Linux): ENOTCONN
  • 111 / 0x6f (Linux): ECONNREFUSED (see also require('util').getSystemErrorName(-111))
  • 54 / 0x36 (macOS): ECONNRESET, happens only on the first failure when UDS is deleted
  • 39 / 0x27 (macOS): EDESTADDRREQ, happens for every attempt to send after deleting the initial UDS (even if a new file replaces it)

I see the handling of these send() errors just hardcodes 107 and 111, which is unfortunately platform specific.

@dhermes
Copy link
Contributor

dhermes commented Sep 25, 2020

So here is a summary of the problems with the current implementation:

  1. The error codes are negative (i.e. -107, -111) so no recovery will happen as written:

    hot-shots/lib/statsd.js

    Lines 128 to 129 in c488cc7

    case 107:
    case 111: {
  2. If this.socket doesn't get created on start (as a UDS) then it will never recover (even if the UDS eventually gets fixed / recreated). This can be addressed by attempting to re-create a socket in sendMessage()
    if (!this.socket) {
  3. As mentioned above the list of errno codes are platform-specific to Linux

    hot-shots/lib/statsd.js

    Lines 128 to 129 in c488cc7

    case 107:
    case 111: {
    I recommend ditching the switch statement for a something like errorCodes.includes(-err.code) with errorCodes produced by
    function udsErrors() {
      if (process.platform == 'linux') {
        return [
          os.constants.errno.ENOTCONN,
          os.constants.errno.ECONNREFUSED,
        ]
      }
    
      if (process.platform == 'darwin') {
        return [
          os.constants.errno.EDESTADDRREQ,
          os.constants.errno.ECONNRESET,
        ]
      }
    
      // Unknown / not yet implemented
      return []
    }
  4. When a new socket is created in the UDS recovery code
    this.socket = createTransport(this, {
    it doesn't re-register socket error handling code, so this means a UDS failure can only be recovered from one time

I'll send a PR (based on dhermes#1) with fixes for all 4 but this may be a longer discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants