Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transactions.py with nightly_protocol #4133

Closed
mikhailOK opened this issue Mar 17, 2021 · 7 comments · Fixed by #4135
Closed

transactions.py with nightly_protocol #4133

mikhailOK opened this issue Mar 17, 2021 · 7 comments · Fixed by #4135
Labels
A-python-test Area: issues related to python tests

Comments

@mikhailOK
Copy link
Contributor

pytest sanity/transactions.py --features nightly_protocol --features nightly_protocol_features

http://nayduck.eastus.cloudapp.azure.com:3000/#/test/117583

Traceback (most recent call last):
  File "tests/sanity/transactions.py", line 81, in <module>
    sent_height, height)
AssertionError: Balances before: [999999999999999999999999999999773, 999999999999999999999999999999786, 1000000000000000000000000000000876, 999999999999999999999999999999429, 1000000000000000000000000000000136]
Expected balances: [999999999999999999999999999999245, 999999999999999999999999999999057, 1000000000000000000000000000001438, 999999999999999999999999999999813, 1000000000000000000000000000000447]
Current balances: [1000000000000000000000000000000187, 999999999999999999999999999999786, 1000000000000000000000000000000876, 999999999999999999999999999999439, 999999999999999999999999999999712]
Sent at height: 20, cur height: 31
@mikhailOK mikhailOK added the A-python-test Area: issues related to python tests label Mar 17, 2021
@mikhailOK
Copy link
Contributor Author

Passes with disabled protocol_feature_forward_chunk_parts:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/1367

@mikhailOK
Copy link
Contributor Author

mikhailOK commented Mar 18, 2021

Investigating this test.

  1. The test uses status = nodes[3].get_status() and nodes[4].send_tx(tx)
    Sometimes it makes it fail right away like here: http://nayduck.eastus.cloudapp.azure.com:3000/#/test/118402 because of Invalid tx: expired or from a different fork. Getting status from node 4 fixes it
  2. status = nodes[4].get_status() suddenly breaks the test with nightly_protocol disabled. Example failure: http://nayduck.eastus.cloudapp.azure.com:3000/#/test/118417 , test2 gets stuck in catchup. No idea how a test change could cause the node to get stuck, will send more runs to investigate.
  3. Setting a low state_sync_timeout fixes nodes getting stuck in catchup.
  4. Changing chunk forwarding from get_epoch_block_producers_ordered to get_epoch_block_approvers_ordered by itself changes 10/10 fail to 2/10 pass (with nightly_protocol).
  5. Now the test with nightly_protocol fails because the observer node often gets a lot of orphans with missing chunks, example: http://nayduck.eastus.cloudapp.azure.com:3000/#/test/118432 - test4 goes up to 10 orphans. It seems like observer nodes struggle to get chunks. @birchmd - any idea what could be happening?

@mikhailOK
Copy link
Contributor Author

For 5, the problem seems to be in this logic:

            let old_block = header_head.last_block_hash != parent_hash
                && header_head.prev_block_hash != parent_hash;
...
            if !is_chunk_forwarding_enabled || fetch_from_archival || old_block {
                let request_result = self.request_partial_encoded_chunk(

Log from the node:

�[2mMar 18 01:23:15.446�[0m �[34mDEBUG�[0m chain: Process block 2gs4KKJGs89V7qBnFc8U9AUz6hpAkFG5Ga4rn4MtZAAd at 67, approvals: 3, me: Some("test4")
�[2mMar 18 01:23:15.448�[0m �[34mDEBUG�[0m chain: Process block: orphan: `2gs4KKJGs89V7qBnFc8U9AUz6hpAkFG5Ga4rn4MtZAAd`, # orphans 10
�[2mMar 18 01:23:15.450�[0m �[34mDEBUG�[0m client: Some("test4") Received block 2gs4KKJGs89V7qBnFc8U9AUz6hpAkFG5Ga4rn4MtZAAd <- 2e65xvAkgoF1i5jFdQwRWka4CQg6f9NDQHe3RLRm8WA9 at 67 from ed25519:FXXrTXiKWpXj1R6r5fBvMLpstd8gPyrBq3qMByqKVzKF, requested: false    
�[2mMar 18 01:23:15.450�[0m �[34mDEBUG�[0m chain: Process block header: 2gs4KKJGs89V7qBnFc8U9AUz6hpAkFG5Ga4rn4MtZAAd at 67
�[2mMar 18 01:23:15.451�[0m �[34mDEBUG�[0m client: Some("test4") Received block 2gs4KKJGs89V7qBnFc8U9AUz6hpAkFG5Ga4rn4MtZAAd <- 2e65xvAkgoF1i5jFdQwRWka4CQg6f9NDQHe3RLRm8WA9 at 67 from ed25519:5BVKQ5zZKtEUM3AHgj6KAerTpuh3CVZHtAbmVfxLbLhc, requested: false    
�[2mMar 18 01:23:15.451�[0m �[34mDEBUG�[0m chain: Process block header: 2gs4KKJGs89V7qBnFc8U9AUz6hpAkFG5Ga4rn4MtZAAd at 67
�[2mMar 18 01:23:15.452�[0m �[34mDEBUG�[0m client: Some("test4") Received block 2gs4KKJGs89V7qBnFc8U9AUz6hpAkFG5Ga4rn4MtZAAd <- 2e65xvAkgoF1i5jFdQwRWka4CQg6f9NDQHe3RLRm8WA9 at 67 from ed25519:Eo9W44tRMwcYcoua11yM7Xfr1DjgR4EWQFM3RU27MEX8, requested: false    
�[2mMar 18 01:23:15.452�[0m �[34mDEBUG�[0m chain: Process block header: 2gs4KKJGs89V7qBnFc8U9AUz6hpAkFG5Ga4rn4MtZAAd at 67
�[2mMar 18 01:23:15.457�[0m �[34mDEBUG�[0m chunks: Reconstructed and decoded chunk 4qiehQfuGDzeAnxCumncvYj1GHtYUAMZwxAp2MzkBH2y, encoded length was 8, num txs: 0, I'm Some("test4")    
�[2mMar 18 01:23:15.463�[0m �[34mDEBUG�[0m chunks: Reconstructed and decoded chunk BmVtgqef4hkbmjjXCE8rokmmBbbnRc2xhu4EKgWSRnHB, encoded length was 8, num txs: 0, I'm Some("test4")    
�[2mMar 18 01:23:15.467�[0m �[34mDEBUG�[0m chunks: Reconstructed and decoded chunk 4Pid7BG2KPQsfxJ474Brn6bQRi3HqicUw2Xyxj5tK593, encoded length was 8, num txs: 0, I'm Some("test4")    
�[2mMar 18 01:23:15.471�[0m �[34mDEBUG�[0m chunks: Reconstructed and decoded chunk D64uhaLHk25bgk7tXSqHj6iHuFj7GLVoJv3fv6EjSwWA, encoded length was 8, num txs: 0, I'm Some("test4")    
�[2mMar 18 01:23:15.473�[0m �[34mDEBUG�[0m chain: Process block ATiYcjgowN4CxfWhmAEQx4qQGRKXLDQazcPBgossHZaE at 57, approvals: 3, me: Some("test4")
�[2mMar 18 01:23:15.473�[0m �[34mDEBUG�[0m chain: Some("test4") Process block ATiYcjgowN4CxfWhmAEQx4qQGRKXLDQazcPBgossHZaE, is_caught_up: true, need_to_start_fetching_state: false
�[2mMar 18 01:23:15.480�[0m �[34mDEBUG�[0m chain: Header head updated to ATiYcjgowN4CxfWhmAEQx4qQGRKXLDQazcPBgossHZaE at 57

old_block is always false because header head doesn't update until we get chunks.
@SkidanovAlex @bowenwang1996 is header head supposed to update on blocks with missing chunks?

@mikhailOK
Copy link
Contributor Author

mikhailOK commented Mar 18, 2021

For 2, here is a commit: 28326d6 and a test run.
30/30 passed to 11/30 passed with nodes failing to catchup state sync

fixed by disabling storage consistency checking

@bowenwang1996
Copy link
Collaborator

is header head supposed to update on blocks with missing chunks?

Yes, as long as you have the headers. Also note that this logic should not matter because the request should be resent in resend_chunk_requests anyways

@mikhailOK
Copy link
Contributor Author

Yes, as long as you have the headers.

Then there is a bug, because header head is 57 in the snippet

mikhailOK added a commit that referenced this issue Mar 18, 2021
Use get_epoch_block_approvers_ordered instead of get_epoch_block_producers_ordered.
Make some changes to the test: query the observer node, disable storage
checking and set a low state sync timeout.
Fixes #4133

Test plan
---------
nayduck
mikhailOK added a commit that referenced this issue Mar 18, 2021
Some changes to the test: query the observer node, disable storage
checking and set a low state sync timeout.
Fixes #4133

Test plan
---------
nayduck
@birchmd
Copy link
Contributor

birchmd commented Mar 18, 2021

It seems like observer nodes struggle to get chunks.

Chunks are not forwarded to observer nodes, so observers need to send chunk requests, which is naturally slower than getting forwarded parts (this indeed was the whole point of chunk forwarding -- to make chunk distribution to validators faster).

I've noticed this before as well with sharded performance testing. I think in practice it is fine to have RPC nodes be a couple blocks behind validator nodes, though it may cause problems for tests, yes.

I think there are other ways we could try to optimize chunk distribution for RPC nodes as well (e.g. nodes could inspect routed messages and choose to process ones which are relevant to shards it tracks in addition to forwarding the message along; or if peers knew what shards each other tracked then we could expand the chunk forwarding logic). But I don't think we should implement those unless there are problems in a real network (e.g. testnet).

mikhailOK added a commit that referenced this issue Mar 18, 2021
Some changes to the test: query the observer node, disable storage
checking and set a low state sync timeout.
Fixes #4133

Test plan
---------
nayduck
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-python-test Area: issues related to python tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants