Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: merge staging to master #474

Merged
merged 333 commits into from
Aug 3, 2023
Merged

ci: merge staging to master #474

merged 333 commits into from
Aug 3, 2023

Conversation

MatrixAI-Bot
Copy link
Member

@MatrixAI-Bot MatrixAI-Bot commented Oct 5, 2022

This is an automatic PR generated by the pipeline CI/CD. This will be automatically fast-forward merged if successful.

Tasks

  1. - Enable automated testnet connection tests
  2. [ ] - Investigate random crashes - unnecessary since CLI tests will go to Polykey-CLI
  3. - Testnet nodes should never crash regardless of what connecting nodes are doing

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 658670966 for 06df14a

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/658670966

@MatrixAI-Bot MatrixAI-Bot self-assigned this Oct 5, 2022
@ghost
Copy link

ghost commented Oct 5, 2022

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@CMCDragonkai
Copy link
Member

@tegefaulkes the Nat tests failed here. There's a failure that is not just a timeout.

@tegefaulkes
Copy link
Contributor

Looking at it now. The initial problem here is that the pings are failing after the default timeout of 20 seconds.

  console.time
    ping1: 22561 ms

      at Object.<anonymous> (tests/nat/DMZ.test.ts:213:15)

  console.log
    {"success":false,"message":"No response received"}

      at Object.<anonymous> (tests/nat/DMZ.test.ts:214:15)

  console.log
    {"type":"ErrorCLINodePingFailed","data":{"message":"No response received","timestamp":"2022-10-06T02:14:59.406Z","data":{},"stack":"ErrorCLINodePingFailed: No response received\n    at /home/faulkes/matrixcode/polykey/js-polykey/src/bin/nodes/CommandPing.ts:72:19\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n    at async CommandPing.<anonymous> (/home/faulkes/matrixcode/polykey/js-polykey/src/bin/CommandPolykey.ts:81:7)\n    at async CommandPolykey.parseAsync (/home/faulkes/matrixcode/polykey/js-polykey/node_modules/commander/lib/command.js:923:5)\n    at async main (/home/faulkes/matrixcode/polykey/js-polykey/src/bin/polykey.ts:54:5)","description":"Node was not online or not found.","exitCode":1}}

      at Object.<anonymous> (tests/nat/DMZ.test.ts:215:15)

So it's not related to the bug from yesterday. I need to look at the bin command closer to know why the ping is failing. But right now it's consistent with failing to connect.

@tegefaulkes
Copy link
Contributor

It's odd, the core problem is that the connection is timing out with UTP_ETIMEDOUT. So it's UTP timing out here after 20 seconds. I'm not sure what change caused this, I need to review the the last merge for that.

Another odd thing is that setting a shorter timeout of '10000' doesn't change the outcome, I'd expect it to abort with a different timeout error but it doesn't. So there is a bug in the code here somewhere. But that relates to the nodes domain cancellability i'm currently working on so it should be addressed there.

@tegefaulkes
Copy link
Contributor

The only changes was updating the network domain with cancellability. The network and nodes tests are passing so the connection logic is still working, including pingNode. So why is it only failing in the NAT testing? Is it the network namespacing? because we're using alternate IP addresses? Because it's in a separate process?

@CMCDragonkai
Copy link
Member

Does the test work locally? Maybe you do need to fix the bug in the nodes domain first.

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Oct 6, 2022

The tests are failing locally. the problem doesn't seem to be nodes domain specifically. So far as I can tell the ForwardConnection is just failing to connect and timing out, but only in the NAT tests. The problem does overlap enough that I'll need to fix it for the nodes PR. I may encounter a fix as part of that work.

I'll come back to this, I need to think on it for a little bit.

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 660898712 for e428206

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/660898712

@tegefaulkes tegefaulkes mentioned this pull request Oct 10, 2022
3 tasks
@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 669366315 for d0ffca0

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/669366315

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 675968580 for 0421f73

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/675968580

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 677224551 for b381e29

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/677224551

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 678321421 for bb9f39b

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/678321421

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 678384368 for 04737bc

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/678384368

@CMCDragonkai
Copy link
Member

Make sure any logs left over should be following the rest of the code in their logging structure.

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 688345460 for 1890801

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/688345460

@tegefaulkes
Copy link
Contributor

There seems to be a bug causing the seed nodes to crash when receiving certain connections.

{
    "type": "TypeError",
    "data": {
        "message": "Cannot read properties of null (reading 'finishShutdown')",
        "stack": "TypeError: Cannot read properties of null (reading 'finishShutdown')\n    at JSStreamSocket.finishShutdown (node:internal/js_stream_socket:160:12)\n    at node:internal/js_stream_socket:147:14\n    at processTicksAndRejections (node:internal/process/task_queues:78:11)\n    at runNextTicks (node:internal/process/task_queues:65:3)\n    at processImmediate (node:internal/timers:437:9)"
    }
}

It is internal to some node implementation. It may be related to this issue. nodejs/node#35695.

For now I'm going to disable the testnet connection tests due to the seed nodes instability.

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 689345933 for 2e972e7

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/689345933

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 689348586 for 6a18b6b

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/689348586

@CMCDragonkai
Copy link
Member

There seems to be a bug causing the seed nodes to crash when receiving certain connections.

{
    "type": "TypeError",
    "data": {
        "message": "Cannot read properties of null (reading 'finishShutdown')",
        "stack": "TypeError: Cannot read properties of null (reading 'finishShutdown')\n    at JSStreamSocket.finishShutdown (node:internal/js_stream_socket:160:12)\n    at node:internal/js_stream_socket:147:14\n    at processTicksAndRejections (node:internal/process/task_queues:78:11)\n    at runNextTicks (node:internal/process/task_queues:65:3)\n    at processImmediate (node:internal/timers:437:9)"
    }
}

It is internal to some node implementation. It may be related to this issue. nodejs/node#35695.

For now I'm going to disable the testnet connection tests due to the seed nodes instability.

I think we saw that exception before? Is this is what is causing the random failures? Or is the random failures coming elsewhere?

Are you sure, it's HTTP2? And not the UDP, uTP or elsewhere?

@CMCDragonkai
Copy link
Member

Also we should try to replicate that, and use rr to record the execution.

@tegefaulkes
Copy link
Contributor

Yeah, I think we've seen this one before. I'm not sure exactly where it's happening but I found that issue after a quick search.

Triggering this bug might be a little tricky. Right now it seems to be caused when I connect to a seed node with the testnetConnection.test.ts test.

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 689376099 for ab8d759

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/689376099

@CMCDragonkai
Copy link
Member

In order to make it easier for us to see if our manual test is succeeding, we need a pk nodes connections command that lists out all the active node connections. This means we don't have to keep looking over the logs to see if we managed to connect to the testnet.

If this succeeds and we don't crash the testnet, then testnet connection tests should succeed automatically too. We just need to ensure we are doing the same thing.

However our networking system is getting too complex and it is resulting in hard to figure out bugs. We may need to start on a networking refactoring soon after we merge in the feature-crypto.

@CMCDragonkai
Copy link
Member

We can also add a pk network connections which shows active proxy connections. This can be compared to pk nodes connections.

@CMCDragonkai
Copy link
Member

We'll look into network refactoring after feature-crypto (this should eliminate the situations where we have random failures). That will also need to fix MatrixAI/js-mdns#1.

Also try to rule out this https://aws.amazon.com/premiumsupport/knowledge-center/ecs-resolve-outofmemory-errors/ happening. It seems it never gets past 50% memory usage in the service.

Remove the old polykey-testnet service and update the cloudwatch dashboard accordingly. We are now using 1 service per NodeId.

Turns out it wasn't needed. I ran all the tests and there were no errors relating to it.

[ci skip]
`js-quic` integration and Agent migration
@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 951440760 for fc82a3d

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/951440760

@CMCDragonkai
Copy link
Member

Once #534 is merged, this is expected to pass and merge into master. However CI deployment to testnet and mainnet will be moved to Polykey-CLI repository.

@MatrixAI-Bot
Copy link
Member Author

Pipeline Attempt on 954294586 for f7708f6

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/954294586

@MatrixAI-Bot
Copy link
Member Author

Pipeline Succeeded on 954294586 for f7708f6

https://gitlab.com/MatrixAI/open-source/Polykey/-/pipelines/954294586

@MatrixAI-Bot MatrixAI-Bot merged commit fc82a3d into master Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants