Skip to content
This repository has been archived by the owner on Nov 14, 2023. It is now read-only.

cannot restart validator: not connecting to peers #26

Closed
dckc opened this issue Aug 24, 2021 · 10 comments
Closed

cannot restart validator: not connecting to peers #26

dckc opened this issue Aug 24, 2021 · 10 comments
Assignees
Labels
testnet-problem issues / problems surfaced during testnet

Comments

@dckc
Copy link
Member

dckc commented Aug 24, 2021

Describe the bug
After tweaking commit_timeout in config.toml, chainflow is getting errors when trying to restart.

Additional context

logs: https://hackmd.io/@KFEZk8oMTz6vBlwADz0M4A/S1XlYT-ZK

esp this bit about block 55575:

Aug 24 03:17:06 cflow-agoric-tn-01 ag-chain-cosmos[2184852]: 2021-08-24T01:17:06.891Z block-manager: block 55575 begin
Aug 24 03:17:26 cflow-agoric-tn-01 ag-chain-cosmos[2184852]: 2021-08-24T01:17:26.898Z block-manager: block 55575 commit

discord chat: https://discord.com/channels/585576150827532298/819073555446759444/879535698984714290

cc @michaelfig @katelynsills @warner

@dckc dckc self-assigned this Aug 24, 2021
@katelynsills
Copy link

This looks like a "swap" on the AMM was being performed. Do we know what was being traded?

@warner
Copy link
Member

warner commented Aug 24, 2021

I extracted the cranks from block 55576, which appears to include the "rights were not conserved" rejection. Here's a map of vatIDs on this chain:

  v1:   xs-worker        583  bank
  v2:   xs-worker        718  board
  v3:   xs-worker         60  distributeFees
  v4:   xs-worker        395  ibc
  v5:   xs-worker       1416  mints
  v6:   xs-worker        403  network
  v7:   xs-worker        562  priceAuthority
  v8:   xs-worker        283  provisioning
  v9:   xs-worker          1  sharing
 v10:   xs-worker       2072  zoe
 v11:   xs-worker       1230  bootstrap
 v12:   xs-worker         61  vatAdmin
 v13:       local       5879  comms
 v14:   xs-worker       7252  vattp
 v15:   xs-worker         79  timer
 v16:   xs-worker        751  packages/treasury/src/stablecoinMachine.js
 v17:   xs-worker          3  packages/pegasus/src/pegasus.js
 v18:   xs-worker        226  packages/zoe/src/contracts/multipoolAutoswap/multipoolAutoswap.js
 v19:   xs-worker          2  packages/treasury/src/liquidateMinimum.js
 v20:   xs-worker          2  packages/treasury/src/liquidateMinimum.js
 v21:   xs-worker          2  packages/treasury/src/liquidateMinimum.js
 v22:   xs-worker          2  packages/treasury/src/liquidateMinimum.js
 v23:   xs-worker          2  packages/treasury/src/liquidateMinimum.js

and here are the deliveries in block 55576:

["v15",["message","ko38",{"method":"getCurrentTimestamp","args":{"body":"[]","slots":[]},"result":"kp488548"}]]
["v10",["notify",[["kp488543",{"state":"fulfilled","refCount":1,"data":{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["ko105230"]}}]]]]
["v10",["notify",[["kp488544",{"state":"fulfilled","refCount":1,"data":{"body":"{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"51053409\"}}","slots":["ko116"]}}]]]]
["v16",["notify",[["kp488545",{"state":"fulfilled","refCount":1,"data":{"body":"{\"@qclass\":\"undefined\"}","slots":[]}}]]]]
["v16",["notify",[["kp488546",{"state":"fulfilled","refCount":1,"data":{"body":"{\"updateCount\":{\"@qclass\":\"undefined\"},\"value\":{\"@qclass\":\"undefined\"}}","slots":[]}}]]]]
["v16",["notify",[["kp488547",{"state":"fulfilled","refCount":1,"data":{"body":"{\"updateCount\":{\"@qclass\":\"undefined\"},\"value\":{\"@qclass\":\"undefined\"}}","slots":[]}}]]]]
["v13",["notify",[["kp488516",{"state":"fulfilled","refCount":1,"data":{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["ko105229"]}}]]]]
["v18",["notify",[["kp488548",{"state":"fulfilled","refCount":1,"data":{"body":"{\"@qclass\":\"bigint\",\"digits\":\"1629767527\"}","slots":[]}}]]]]
["v18",["message","ko90",{"method":"handleOffer","args":{"body":"[{\"@qclass\":\"slot\",\"iface\":\"Alleged: InvitationHandle\",\"index\":0},{\"@qclass\":\"slot\",\"iface\":\"Alleged: zoeSeatAdmin\",\"index\":1},{\"initialAllocation\":{\"In\":{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":2},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"51053409\"}},\"Out\":{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":3},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"0\"}}},\"notifier\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: notifier\",\"index\":4},\"offerArgs\":{\"@qclass\":\"undefined\"},\"proposal\":{\"exit\":{\"onDemand\":null},\"give\":{\"In\":{\"brand\":{\"@qclass\":\"slot\",\"index\":2},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"51053409\"}}},\"want\":{\"Out\":{\"brand\":{\"@qclass\":\"slot\",\"index\":3},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"0\"}}}},\"seatHandle\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: SeatHandleHandle\",\"index\":5}}]","slots":["ko105213","ko105231","ko116","ko71","ko105232","ko105233"]},"result":"kp488549"}]]
["v13",["notify",[["kp488520",{"state":"fulfilled","refCount":3,"data":{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: userSeat\",\"index\":0}","slots":["ko105234"]}}]]]]
["v10",["message","ko105234",{"method":"getPayout","args":{"body":"[\"In\"]","slots":[]},"result":"kp488521"}]]
["v10",["message","ko105234",{"method":"getPayout","args":{"body":"[\"Out\"]","slots":[]},"result":"kp488522"}]]
["v7",["notify",[["kp488542",{"state":"fulfilled","refCount":1,"data":{"body":"{\"quoteAmount\":{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: Quote brand\",\"index\":0},\"value\":[{\"amountIn\":{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":1},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"0\"}},\"amountOut\":{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":2},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"0\"}},\"timer\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: timerService\",\"index\":3},\"timestamp\":{\"@qclass\":\"bigint\",\"digits\":\"1629767527\"}}]},\"quotePayment\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: Quote payment\",\"index\":4}}","slots":["ko105235","ko116","ko71","ko38","ko105236"]}}]]]]
["v10",["notify",[["kp488549",{"state":"fulfilled","refCount":1,"data":{"body":"{\"exitObj\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: exitObj\",\"index\":0},\"offerResultP\":{\"@qclass\":\"slot\",\"index\":1}}","slots":["ko105237","kp488550"]}}],["kp488550",{"state":"rejected","refCount":1,"data":{"body":"{\"@qclass\":\"error\",\"errorId\":\"error:liveSlots:v18#70090\",\"message\":\"rights were not conserved for brand (an object)\",\"name\":\"Error\"}","slots":[]}}]]]]
["v10",["message","ko105231",{"method":"fail","args":{"body":"[{\"@qclass\":\"error\",\"errorId\":\"error:liveSlots:v18#70089\",\"message\":\"rights were not conserved for brand (an object)\",\"name\":\"Error\"}]","slots":[]},"result":"kp488551"}]]
["v16",["notify",[["kp488538",{"state":"fulfilled","refCount":1,"data":{"body":"{\"quoteAmount\":{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: Quote brand\",\"index\":0},\"value\":[{\"amountIn\":{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":1},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"0\"}},\"amountOut\":{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":2},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"0\"}},\"timer\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: timerService\",\"index\":3},\"timestamp\":{\"@qclass\":\"bigint\",\"digits\":\"1629767527\"}}]},\"quotePayment\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: Quote payment\",\"index\":4}}","slots":["ko105235","ko116","ko71","ko38","ko105236"]}}]]]]
["v5",["message","ko131",{"method":"withdraw","args":{"body":"[{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"51053409\"}}]","slots":["ko116"]},"result":"kp488552"}]]
["v18",["notify",[["kp488551",{"state":"fulfilled","refCount":1,"data":{"body":"{\"@qclass\":\"undefined\"}","slots":[]}}]]]]
["v13",["notify",[["kp488522",{"state":"fulfilled","refCount":1,"data":{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN payment\",\"index\":0}","slots":["ko105238"]}}]]]]
["v10",["notify",[["kp488552",{"state":"fulfilled","refCount":1,"data":{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["ko105239"]}}]]]]
["v13",["notify",[["kp488521",{"state":"fulfilled","refCount":1,"data":{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["ko105239"]}}]]]]

(that may be slightly easier to read if you paste it into a file, then pipe it into jq or jq -c, which will colorized the property names)

@michaelfig
Copy link
Member

After tweaking commit_timeout in config.toml, chainflow is getting errors when trying to restart.

These errors are a red herring. They are completely harmless (except to the people expecting those user-level messages to succeed) and are printed only because we haven't yet explicitly silenced them when running a validator. The output is debug logging that is completely isolated from the correct functioning of a validator.

I agree that they are misleading, and we really should silence them.

But the more important question is: why didn't the node ever connect to any of its peers? I would expect to see several network failures after a restart, but eventually things should catch up. Patience is important here.

@warner
Copy link
Member

warner commented Aug 24, 2021

For reference, this particular instance of a "rights not conserved" error is sent from the AMM contract (v18) at crank 1346190, in block 55576. I've found 196 instances so far, starting with crank 1322587 at time epoch+1629762640.

@rowgraus
Copy link
Contributor

rowgraus commented Aug 24, 2021

Comment from Chris at Chainflow says he's still out of commission pending a resolution on this issue. We had one more question from Decentralizer with apparently the same issue.

What's the guidance on troubleshooting this?

@cfl0ws
Copy link

cfl0ws commented Aug 24, 2021

It appears I'm not picking up any peers at all.

Note that the logs above are from a restart after I cleared my addressbook. If I do a restart w/o clearing the addressbook, the logs start here -

Aug 24 17:05:27 cflow-agoric-tn-01 systemd[1]: Started Agoric Cosmos daemon.
Aug 24 17:05:32 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
Aug 24 17:05:32 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
Aug 24 17:05:32 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
Aug 24 17:05:32 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
Aug 24 17:05:32 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
Aug 24 17:05:32 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
Aug 24 17:05:32 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:05PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

@cfl0ws
Copy link

cfl0ws commented Aug 24, 2021

More context, it looks like I'm slowly picking up peers. I have three peers now and appear to be syncing. FWIW, the Stopping peer for error log entries make it difficult to see anything else substantive in the logs.

Aug 24 17:09:02 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:09PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
Aug 24 17:09:02 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:09PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193283]: 5:09PM ERR dialing failed (attempts: 2): dial tcp 173.249.40.87:26656: i/o timeout addr={"id":"f8c2937cc825188b0a6af52f8026cb735eb68111","ip":"173.249.40.87","port":26656} module=pex
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Logging sent error stack (Error#143)
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error#143: rights were not conserved for brand [object Alleged: BLD brand]
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error: rights were not conserved for brand (an object)
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at makeError ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at fail ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at baseAssert ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at forEach ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at assertEqualPerBrand ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at assertRightsConserved ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at reallocate ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at reallocate ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at swapIn ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at win ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error#143 ERROR_NOTE: Thrown from: (Error#144) : 52522 . 0
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error#143 ERROR_NOTE: Sent as error:liveSlots:v18#70095
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error#144: Event: 52521.1
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error: Event: 52521.1
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at construct ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at Error ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at trackTurns ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at handle ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at apply ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at handleOffer ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at win ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error#144 ERROR_NOTE: Caused by: (Error#145)
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error#145: Event: 52520.1
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error: Event: 52520.1
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at construct ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at Error ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at trackTurns ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at handle ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at deliver ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at dispatchToUserspace ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at runWithoutMetering ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]:  at ()
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Error#143 ERROR_NOTE: Sent as error:liveSlots:v18#70096
Aug 24 17:09:03 cflow-agoric-tn-01 ag-chain-cosmos[2193371]: Logging sent error stack (Error#143)

@dckc dckc changed the title cannot restart validator: Error#125: rights were not conserved for brand [object Alleged: RUN brand] cannot restart validator: not connecting to peers Aug 24, 2021
@warner
Copy link
Member

warner commented Aug 24, 2021

I'm splitting out the rights-conservation error into a separate ticket #29, and I'll copy my preliminary analysis over there.

@dckc
Copy link
Member Author

dckc commented Aug 24, 2021

@michaelfig @dtribble what do you know about our network topology? What is the peer discovery algorithm?

@michaelfig
Copy link
Member

@michaelfig [...] what do you know about our network topology? What is the peer discovery algorithm?

On this testnet we ran 4 public persistent peers. They were also RPC nodes, so that was interfering with them being able to process blocks. We also ran 2 public seed nodes (which gossip other peers to all nodes that connect to the seeds).

My understanding is that this should be adequate, if the nodes could keep up. The absence of "state sync" (Agoric/agoric-sdk#3769) is what makes starting a new validator from scratch take so long, and the problem with ABCI queries being blocked by computing blocks (tendermint/tendermint#6899) caused the public peers to be slow.

There may be a deeper underlying problem, but we won't discover that until the above are fixed and we have more testing.

Closing for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
testnet-problem issues / problems surfaced during testnet
Projects
None yet
Development

No branches or pull requests

6 participants