Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-merge go-algorand 3.1.3-beta #3230

Merged
merged 23 commits into from
Nov 18, 2021
Merged

Conversation

onetechnical
Copy link
Contributor

go-algorand 3.1.3-beta re-merge

Algo-devops-service and others added 23 commits October 5, 2021 21:04
…cleci nightly testing. (#3016)

For CircleCI testing, increase machine size from medium to large and set parallelism to 4.
Our travisci keeps breaking for different reasons so we decided to move our deploy to circleci.
Noticed our environment passing variable is not working as intended, implemented a different way of passing it.
Fixing CircleCI deploy for Betanet and Stablenet (#3077)
## Summary

When a 2.1 (i.e. no txsync) client connects to a 3.0 relay (i.e. txsync), the relay needs to request the client to keep sending it a TX messages - otherwise, these transactions would not get propagated.

The 2.1 and 3.0 above are network protocol versions, not a algod release version.

## Test Plan

e2e test added.
…tion messages (#3102)

Summary
This PR adds the missing bridge between the txnsync and the classic transaction relaying:
when a transaction message arrive and being added to the transaction pool, we need to attempt to
relay these messages right away using the classic transaction messages. That would allow relays to
be compatible with both 2.1 and 3.0 nodes.

Test Plan
e2e test was added.
## Summary

During fast catchup, we restart the transaction sync service very quickly.
This can cause a network message being sent, and the response would be returned to the "restarted" txnsync.

Since we don't want to disconnect the network connection itself ( which could have some messages enqueued ), the transaction sync would need to store the "returned" channel before sending the message. That would avoid the data race ( and safely ignore the incoming message ).

## Test Plan

Use existing testing, and confirm against that.
## Summary

This PR adds the node ( both client and server ), the ability to measure the time it takes to establish an outgoing connection ( excluding the TCP connection time ).
This duration is captured as the initial latency, which would need to get updated via a pingpong style logic.

## Test Plan

- [x] Extend existing unit tests
- [x] mainnet-model testing is needed as well to confirm correctness
## Summary

I noticed log messages like this:
```
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.220089Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.220226Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.220300Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.228731Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.228828Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.228893Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.228946Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.229012Z"}
{"file":"txHandler.go","function":"github.com/algorand/go-algorand/data.(*solicitedAsyncTxHandler).loop","level":"info","line":541,"msg":"solicitedAsyncTxHandler was unable to relay transaction message : %!v(MISSING)","time":"2021-10-20T23:18:40.229080Z"}
```

## Test Plan

No test changes.
## Summary

The previous web socket version was not flushing the write buffer when writing a Close control message.
As a result, we ended up disconnection the connection on one side correctly, while having it in a zombie state on the other side.

## Test Plan

Use existing tests. The `TestSlowPeerDisconnection` was already observing the issue being resolved.

## Summary


Improve the bandwidth estimation within the transaction sync by having the estimation account for latency, transaction compression time, and time spent waiting in the incoming queue.

## Test Plan


Wrote unit tests for correctness, ran network on mainnet model and observed measured bandwidths. Before the bandwidth would converge to the minimum over time as well have erratic inaccuracies. Now the numbers look much more in range, at most a factor of 2 off.
## Summary

Running betanet, we've seen errors such as
```
unable to enqueue incoming message into peer incoming message backlog. disconnecting from peer.
```
as well as multiple
```
Peer 162.202.32.72:56543 disconnected: SlowConnection
```

The goal of this PR is to remove from the pending incoming message queue entries that are associated with network peers that have been disconnected ( by the network package ), or have been scheduled for disconnection ( by the transaction synch package ).

Removing these would increase the "available" space in the incoming message queue and would prevent redundant disconnections.

## Test Plan

- [x] unit tests updated.
- [x] ran s1 network with and without load, successfully.
## Summary

The catchup was occasionally reporting
```
(1): fetchAndWrite(13932148): ledger write failed: block evaluation for round 13932148 requires sequential evaluation while the latest round is 13932148
```

This issue indicates that the catchup was attempting to validate a block which is not the latest+1, but rather newer.
In this case, we can safely ignore this error, and skip applying this block, since the block was already added to the ledger.

## Test Plan

Tested manually.
@algojohnlee algojohnlee merged commit 4634e98 into master Nov 18, 2021
@algojohnlee algojohnlee deleted the update-master-relbeta3.1.3 branch November 18, 2021 22:08
@egieseke egieseke mentioned this pull request Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants