Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug the ConnectionReverse warning: [ERR_STREAM_WRITE_AFTER_END]: write after end #300

Closed
joshuakarp opened this issue Nov 30, 2021 · 17 comments · Fixed by #292
Closed
Assignees
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@joshuakarp
Copy link
Contributor

Describe the bug

We've started to notice a particular warning popping up in various parts of the codebase that utilise the networking domain. Namely, the following warning:

WARN:ConnectionReverse 127.0.0.1:59070:Connection Error: Error [ERR_STREAM_WRITE_AFTER_END]: write after end

My initial thought is that this is potentially just a warning message that is being erroneously propagated up to the console output when a connection is torn down. But further investigation is required.

To Reproduce

We've noticed this occur in a few different areas:

  1. When running tests/network/index.test.ts:

    [nix-shell:~/Documents/MatrixAI/js-polykey]$ npm test -- tests/network/index.test.ts 
    
    > @matrixai/[email protected] test /home/josh/Documents/MatrixAI/js-polykey
    > jest "tests/network/index.test.ts"
    
    Determining test suites to run...
    GLOBAL SETUP
    WARN:ConnectionReverse 127.0.0.1:59070:Connection Error: Error [ERR_STREAM_WRITE_AFTER_END]: write after end
     PASS  tests/network/index.test.ts (8.163 s)
      network index
        ✓ integration of forward and reverse proxy (4751 ms)
    
    Test Suites: 1 passed, 1 total
    Tests:       1 passed, 1 total
    Snapshots:   0 total
    Time:        8.217 s
    Ran all test suites matching /tests\/network\/index.test.ts/i.
    GLOBAL TEARDOWN
    
  2. dfsd

Expected behavior

No warning message to appear at all (either by resolving some underlying issue that's causing a stream to be written to, or by stopping the warning message from propagating up).

Additional context

Notify maintainers

@CMCDragonkai

@joshuakarp joshuakarp added the bug Something isn't working label Nov 30, 2021
@CMCDragonkai
Copy link
Member

Is this in master?

Any new changes affecting networking domain atm?

@CMCDragonkai
Copy link
Member

@joshuakarp have a read of the connection compose logic see where writes could occur.

@joshuakarp
Copy link
Contributor Author

joshuakarp commented Nov 30, 2021

Is this in master?

Any new changes affecting networking domain atm?

Yeah, in master. This has been around for several months (as early as the test was added back in May here https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/178#note_583511389) - just created this issue to keep track of it.

@CMCDragonkai
Copy link
Member

This is the sort of bug we can use our new VSCode debugger powers for. Maybe have a try at it soon @joshuakarp.

@joshuakarp
Copy link
Contributor Author

Good idea - will take a bit of a look into it now.

@joshuakarp
Copy link
Contributor Author

Figured I'd log the debugging process I took in case it assists anyone else too.

Firstly, I wanted to figure out exactly what the various parts of the debugging toolbar did:

image

From left to right:

  1. Continue: the normal "move to next breakpoint"
    image
    Stops just before executing line 136 here (at the breakpoint set at line 136). Pressing continue completes the execution.
  2. Step Over: takes a single "step" in your code (i.e. executes the current line you've stopped at, then stops at the next line)
    image
    press "Step Over"
    image
  3. Step Into: same as Step Over, except if it's a function call, with enter the called function and stop on the first line of the called function
  4. Step Out: if currently in a function call, this allows you to immediately jump to and stop wherever the function will return

@joshuakarp
Copy link
Contributor Author

It's actually rather difficult to use these options to debug something that uses await. As expected, there's a fair bit of plumbing for supporting the async-await pattern, so using "Step Into" on fwdProxy.stop causes us to enter the plumbing functions:

image

then "Step Into":

image

You can mitigate this a bit by adding a breakpoint in the called async function itself though. e.g. adding a breakpoint within ForwardProxy.stop and then using "Continue" to skip over the async-await plumbing.

@joshuakarp
Copy link
Contributor Author

For this particular issue though, the warning message arises after closing the ConnectionForward, during the await sleep call in ForwardProxy.stop:

  public async stop(): Promise<void> {
    this.logger.info('Stopping Forward Proxy Server');
    const serverClose = promisify(this.server.close).bind(this.server);
    await serverClose();
    // Ensure no new connections are created while this is iterating
    await Promise.all(
      Array.from(this.connections.ingress, ([, conn]) => conn.stop()),
    );
    // Delay socket close by about 1 second
    // this gives some time for the end/FIN packets to be sent
    await sleep(1000);
    // Even when all connections are destroyed
    // the utp socket sometimes hangs in closing
    // here we asynchronously close and unreference it
    // in order to speed up the closing
    this.utpSocket.close();
    this.utpSocket.unref();
    this.logger.info('Stopped Forward Proxy Server');
  }

It's also a little tricky to debug this with the debugger because the operations aren't sequential (because of the networking going on under the hood). So often the error will be a timeout error instead if we pause on the breakpoint for too long.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Nov 30, 2021

Well it looks like you found it. This error can arise due to:

    // Delay socket close by about 1 second
    // this gives some time for the end/FIN packets to be sent
    await sleep(1000);
    // Even when all connections are destroyed
    // the utp socket sometimes hangs in closing
    // here we asynchronously close and unreference it
    // in order to speed up the closing
    this.utpSocket.close();

Which means the socket got closed during the sleep. And then when you call this.utpSocket.close() this tries to send a FIN packet. And you cannot write a fin packet when the socket is already closed.

At least this is what seems to be the case. One way to get around this is maybe checking if the socket is already closed, and if so don't bother calling close again.

I didn't expect the socket to be closed at that point though. Maybe something else is causing the socket to be closed. The sleep is only there to wait for all connections to stop, not necessarily for the "socket" to be closed.

If you check if the socket is closed prior, does the warning still occur?

@CMCDragonkai
Copy link
Member

It could also be that we have closed the UTP socket but the connection failed to stop in time. That's why there's a 1000 ms sleep before closing the socket. What happens if that 1000 becomes 2000?

@CMCDragonkai
Copy link
Member

The next question is why is the connection taking so long to close? I had investigated this prior but didn't find the reason, hence for the comment I wrote there. This may just be an intermittent error. It may not affect normal operations...

@CMCDragonkai
Copy link
Member

The real reason may be in the C++ codebase that is not worth going into until we have swapped out UTP with something more robust like QUIC or webrtc. #234

@scottmmorris
Copy link
Contributor

This is the jest test that it occured on for me:

  beforeAll(async () => {
    localKeynode = await testUtils.setupRemoteKeynode({
      logger: logger.getChild('Local Keynode'),
    });
    await localKeynode.start({
      password: 'password',
    });
    localKeynodeId = localKeynode.keyManager.getNodeId();

    remoteKeynode1 = await testUtils.setupRemoteKeynode({
      logger: logger.getChild('Remote Keynode 1'),
    });
    await remoteKeynode1.start({
      password: 'password',
    });
    remoteKeynode1Id = remoteKeynode1.keyManager.getNodeId();
    remoteKeynode2 = await testUtils.setupRemoteKeynode({
      logger: logger.getChild('Remote Keynode 2'),
    });
    await remoteKeynode2.start({
      password: 'password',
    });
    remoteKeynode2Id = remoteKeynode2.keyManager.getNodeId();
    await testUtils.addRemoteDetails(localKeynode, remoteKeynode2);
    await testUtils.addRemoteDetails(remoteKeynode1, remoteKeynode2);
    await testUtils.addRemoteDetails(remoteKeynode2, localKeynode);
    await testUtils.addRemoteDetails(remoteKeynode2, remoteKeynode1);

    await gestaltGraph.setNode({
      id: remoteKeynode1Id,
      chain: {},
    });
    await gestaltGraph.setNode({
      id: remoteKeynode2Id,
      chain: {},
    });
    await remoteKeynode1.gestaltGraph.setNode({
      id: localKeynodeId,
      chain: {},
    });
    await remoteKeynode1.gestaltGraph.setNode({
      id: remoteKeynode2Id,
      chain: {},
    });
    await remoteKeynode2.gestaltGraph.setNode({
      id: localKeynodeId,
      chain: {},
    });
    await remoteKeynode2.gestaltGraph.setNode({
      id: remoteKeynode1Id,
      chain: {},
    });
    remoteVault = await remoteKeynode1.vaultManager.createVault(vaultName);
    await remoteKeynode1.vaultManager.shareVault(
      remoteVault.vaultId,
      localKeynodeId,
    );
    await remoteKeynode1.vaultManager.shareVault(
      remoteVault.vaultId,
      remoteKeynode2Id,
    );
  });

  afterAll(async () => {
    await testUtils.cleanupRemoteKeynode(remoteKeynode2);
    await testUtils.cleanupRemoteKeynode(remoteKeynode1);
    await testUtils.cleanupRemoteKeynode(localKeynode);
  });
  test(
    'clone and pull from other cloned vaults',
    async () => {
      const clonedVaultRemote2 = await remoteKeynode2.vaultManager.cloneVault(
        remoteKeynode1Id,
        remoteVault.vaultId,
      );
      await localKeynode.acl.setNodePerm(remoteKeynode2Id, {
        gestalt: {
          notify: null,
        },
        vaults: {},
      });
      await remoteKeynode2.vaultManager.shareVault(
        clonedVaultRemote2.vaultId,
        localKeynodeId,
      );
    },
    global.defaultTimeout * 2,
  );
});

When running this in a normal test.ts file the code hung instead of giving the warning. That may have been because I set it up incorrectly though

@joshuakarp
Copy link
Contributor Author

Well it looks like you found it. This error can arise due to:

    // Delay socket close by about 1 second
    // this gives some time for the end/FIN packets to be sent
    await sleep(1000);
    // Even when all connections are destroyed
    // the utp socket sometimes hangs in closing
    // here we asynchronously close and unreference it
    // in order to speed up the closing
    this.utpSocket.close();

Which means the socket got closed during the sleep. And then when you call this.utpSocket.close() this tries to send a FIN packet. And you cannot write a fin packet when the socket is already closed.

At least this is what seems to be the case. One way to get around this is maybe checking if the socket is already closed, and if so don't bother calling close again.

I didn't expect the socket to be closed at that point though. Maybe something else is causing the socket to be closed. The sleep is only there to wait for all connections to stop, not necessarily for the "socket" to be closed.

If you check if the socket is closed prior, does the warning still occur?

It's actually occurring before we even reach the utpSocket.close - the warning is displayed before this call. If I add a breakpoint on a logger output:

  public async stop(): Promise<void> {
    this.logger.info('Stopping Forward Proxy Server');
    const serverClose = promisify(this.server.close).bind(this.server);
    await serverClose();
    // Ensure no new connections are created while this is iterating
    await Promise.all(
      Array.from(this.connections.ingress, ([, conn]) => conn.stop()),
    );
    // Delay socket close by about 1 second
    // this gives some time for the end/FIN packets to be sent
    await sleep(1000);
    // Even when all connections are destroyed
    // the utp socket sometimes hangs in closing
    // here we asynchronously close and unreference it
    // in order to speed up the closing
    this.logger.info('Closing the socket'); // breakpoint here
    this.utpSocket.close();
    this.utpSocket.unref();
    this.logger.info('Stopped Forward Proxy Server');
  }

We get the following output:

[nix-shell:~/Documents/MatrixAI/js-polykey]$ npm test -- tests/network/index.test.ts 

> @matrixai/[email protected] test /home/josh/Documents/MatrixAI/js-polykey
> jest "tests/network/index.test.ts"

Debugger listening on ws://127.0.0.1:42905/3e75d593-56b0-445e-85d5-7ceae175bed2
For help, see: https://nodejs.org/en/docs/inspector
Debugger attached.
Determining test suites to run...
GLOBAL SETUP
WARN:ConnectionReverse 127.0.0.1:39800:Connection Error: Error [ERR_STREAM_WRITE_AFTER_END]: write after end

It could also be that we have closed the UTP socket but the connection failed to stop in time. That's why there's a 1000 ms sleep before closing the socket. What happens if that 1000 becomes 2000?

No change, still the same.

The next question is why is the connection taking so long to close? I had investigated this prior but didn't find the reason, hence for the comment I wrote there. This may just be an intermittent error. It may not affect normal operations...

I haven't come across an instance of this warning that's caused an actual issue with the implementation.

@joshuakarp
Copy link
Contributor Author

joshuakarp commented Dec 1, 2021

Interestingly, if I change the order of shutdown in tests/network/index.test.ts from:

    await client.destroy();
    await fwdProxy.stop();
    await revProxy.stop();
    await closeTestServer(server);
  });

to

    await client.destroy();
    await revProxy.stop();
    await fwdProxy.stop();
    await closeTestServer(server);
  });

we no longer get the warning.

@CMCDragonkai
Copy link
Member

This may actually be a problem coming from utp-native: mafintosh/utp-native#41

So the reason the error is being emitted is because when the ConnectionReverse receives an end event, it then tries to end it from it's own side.

In the ReverseProxy, the allowHalfOpen is set to false. This is the same for ForwardProxy.

When the allowHalfOpen is false, it means that when an end event is received, then you are automatically ended. You don't need to explicit end from your side. Explanation is here: https://nodejs.org/api/net.html#event-end

However in our case when we set allowHalfOpen to be true, this still doesn't work, it still throws this error.

Strangely it doesn't occur when we do the same for ConnectionForward.

Due to mafintosh/utp-native#41 I believe that there are some underlying issues in utp-native that is causing this.

For now, we can do is to simply remove the end call.

      // Propagate end, error, close and data
      tlsSocket.on('end', () => {
        if (utpConn.destroyed) {
          // The utp connection may already be destroyed
          tlsSocket.destroy();
        } else {
          // Prevent half open connections
          // tlsSocket.end();
        }
      });

The else block can be removed entirely.

We can do the same to ConnectionForward as well, and just write a comment in the allowHalfOpen area about this behaviour and that allowHalfOpen: true is not well tested by the underlying library.

Alternatively if we can find a boolean condition for whether something can be ended or not, that would also work and possibly be more robust if in the future we do enable allowHalfOpen.

@CMCDragonkai
Copy link
Member

The above problem I believe will be solved by applying this solution #292 (comment) to the reverse proxy as well. I'm also changing allowHalfOpen to stay as true since utp-native seems to ignore this setting and defaults to the option being true. Copied the comment here too:

The original comment here #292 (comment) and here #292 (comment) said that a hack was needed when ending the tls socket:

utpConn.end();
tlsSocket.destroy();

Well yes a hack is needed, but a better solution was found. And they exist for both when receiving end and initiating the end:

When you receive end, and you want to respond to end, you must:

    this.logger.debug('Receives tlsSocket ending');
    if (this.utpConn.destroyed) {
      this.logger.debug('Destroys tlsSocket');
      this.tlsSocket.destroy();
    } else {
      this.logger.debug('Responds tlsSocket ending');
      this.tlsSocket.end();
      this.tlsSocket.destroy();
    }

When you initiate ending, and you want to do it gracefully, whether or not you timed out, you must destroy.

    const { p: endP, resolveP: resolveEndP, } = promise<void>();
    socket.once('end', resolveEndP);
    socket.end();
    const timer = timerStart(timeout);
    await Promise.race([ endP, timer.timerP ]);
    socket.removeListener('end', resolveEndP);
    if (timer.timedOut) {
      socket.emit('error', new networkErrors.ErrorConnectionEndTimeout());
    } else {
      timerStop(timer);
    }
    // Must be destroyed if timed out
    // If not timed out, force destroy the socket due to buggy tlsSocket and utpConn
    socket.destroy();

If the former is not done, the reverse side hangs on receiving graceful confirmation.

If the latter is not done, the forward side's close event is not emitted. It should be by default, since this behaviour is consistent with standard net sockets. Such as in this prototype:

import net from 'net';

async function main () {

  let socket;
  const p = new Promise<void>((resolve) => {
    socket = net.connect(
      {
        port: 80,
        host: '142.250.66.206',
        allowHalfOpen: false
      },
      () => {
        resolve();
      }
    );
    socket.on('close', () => {
      console.log('CLOSE EVENT EMITTED');
    });
  });

  const p2 = new Promise<void>((resolve) => {
    socket.on('end', () => {
      resolve();
    });
  });
  socket.end();
  await p2;

  console.log('allow half open', socket.allowHalfOpen);
  console.log('ready state', socket.readyState);
  console.log('destroyed', socket.destroyed);


}

main();

And with these 2 applied to the ConnectionForward, we end up with these 2 tests available that is passing:

PASS  tests/network/ForwardProxy.test.ts (11.195 s)
  ForwardProxy
    ✓ open connection success - forward initiates end (1086 ms)
    ✓ open connection success - reverse initiates end (1052 ms)

I'm also going to leave those debug logger statements, as they are quite useful.

@CMCDragonkai CMCDragonkai self-assigned this Jul 24, 2022
@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

Successfully merging a pull request may close this issue.

3 participants