getHistoryInterpreter: AcquireFailurePointTooOld #434

erikd · 2020-12-10T22:28:59Z

If the node state is deleted and then node and db-sync started together then sometimes we get a error:

getHistoryInterpreter: AcquireFailurePointTooOld

Would be nice to find a neat work around for this.

The text was updated successfully, but these errors were encountered:

erikd · 2020-12-10T22:31:19Z

Currently, when we hit this issue, db-sync will retry (possibly multiple times) and eventually work correctly.

ArturWieczorek · 2020-12-14T11:15:22Z

@erikd I am using:

db-sync:

git log -1
commit e9474143ba6ce07c2f6d498ee80178c007d68a28 (HEAD -> release/7.1.x, origin/release/7.1.x)
Author: Duncan Coutts <[email protected]>
Date:   Fri Dec 11 16:07:07 2020 +0000

    Version 7.1.2

node: 1.24.2 commit 400d18092ce604352cf36fe5f105b0d7c78be074

Steps for building/running:

node:

git clone https://github.com/input-output-hk/cardano-node
cd cardano-node
git checkout tags/1.24.2
nix-build -A scripts.mainnet.node -o mainnet-node-local

# Start node:
./mainnet-node-local

db-sync:

git clone https://github.com/input-output-hk/cardano-db-sync
cd cardano-db-sync
git checkout release/7.1.x
nix-build -A cardano-db-sync-extended -o db-sync-node-extended

# Create DB:
PGPASSFILE=config/pgpass-mainnet scripts/postgresql-setup.sh --createdb
All good!

# Start db-sync:
PGPASSFILE=config/pgpass-mainnet db-sync-node-extended/bin/cardano-db-sync-extended \
--config config/mainnet-config.yaml \
--socket-path ../cardano-node/state-node-mainnet/node.socket \
--schema-dir schema/ \
--state-dir ledger-state/mainnet

This is how it ends for me after running db-sync for several minutes (both node and db-sync started around the same time, there is maybe 2 sec delay between starting those components ) - after displaying error it just stops without restart:

[db-sync-node:Info:43] [2020-12-14 10:15:48.36 UTC] insertByronBlock: slot 380000, block 379948, hash d32a3695c15ff1a87fccae2100948f4ff19423136f98cc1a6aeaaf206c5f069b
[db-sync-node:Info:43] [2020-12-14 10:16:17.93 UTC] insertByronBlock: slot 385000, block 384948, hash 5266a9ca0a5195dcb4368968f35900bcf8d1f81a6fb53a93a115bd1e87eefe1b
[db-sync-node:Error:46] [2020-12-14 10:16:19.52 UTC] recvMsgRollForward: FatalError {fatalErrorMessage = "getHistoryInterpreter: AcquireFailurePointTooOld"}
[db-sync-node:Error:43] [2020-12-14 10:16:19.53 UTC] runDBThread: AsyncCancelled
[db-sync-node:Error:38] [2020-12-14 10:16:19.53 UTC] ChainSyncWithBlocksPtcl: FatalError {fatalErrorMessage = "getHistoryInterpreter: AcquireFailurePointTooOld"}
[db-sync-node.Mux:Info:35] [2020-12-14 10:16:19.53 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "state-node-mainnet/node.socket"} event: State: Dead
[db-sync-node.Mux:Notice:35] [2020-12-14 10:16:19.53 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "state-node-mainnet/node.socket"} event: Miniprotocol MiniProtocolNum 5 InitiatorDir terminated with exception FatalError {fatalErrorMessage = "getHistoryInterpreter: AcquireFailurePointTooOld"}
[db-sync-node.Subscription:Error:33] [2020-12-14 10:16:19.53 UTC] [String "Application Exception: LocalAddress \"../cardano-node/state-node-mainnet/node.socket\" FatalError {fatalErrorMessage = \"getHistoryInterpreter: AcquireFailurePointTooOld\"}",String "SubscriptionTrace"]
cardano-db-sync-extended: FatalError {fatalErrorMessage = "getHistoryInterpreter: AcquireFailurePointTooOld"}
[db-sync-node.ErrorPolicy:Error:4] [2020-12-14 10:16:19.53 UTC] [String "ErrorPolicyLocalNodeError (ApplicationExceptionTrace (FatalError {fatalErrorMessage = \"getHistoryInterpreter: AcquireFailurePointTooOld\"}))",String "ErrorPolicyTrace",String "LocalAddress \"../cardano-node/state-node-mainnet/node.socket\""]
artur@artur-desktop:~/Projects/db-sync-7-1/cardano-db-sync$

However when I built locally docker image using this version of db-sync and used that image in docker-compose file from GQL repo, db-sync restarted shortly after displaying above error...

kderme · 2020-12-14T20:16:24Z

Looking at the logs above hash 5266a9ca0a5195dcb4368968f35900bcf8d1f81a6fb53a93a115bd1e87eefe1b this is epoch 17. This is the epoch with a record number of transactions for the Byron era. In my tests the error appears between epochs 16-20, which have many transactions in general. This can't be a coincidence. Looking at logs it takes 10-15s between recvMsgRollForward and when the block is finally added to the DB during this period. This probably shows some big congestion at the DB queue, I haven't verified this with logs for the queue size though.

evalSlotDetails always fails for EBBs and a new interpreter is acquired. When there is a congestion and a new interpreter is required, the AcquireFailurePointTooOld is usually triggered.

I'm not sure if there is any way to actually fix the race condition without changes at ouroboros-netowrk. On our side, we can try to take the interpreter as fast as possible after recvMsgRollForward runs. I found very suspicious that I sometimes get Error: Switchboard's queue full, dropping log items! warnings. Digging deeper, I found this from iohk-monitoring

instance IsEffectuator Switchboard a where
    effectuate switchboard item = do
        let writequeue :: TBQ.TBQueue (LogObject a) -> LogObject a -> IO ()
            writequeue q i = do
                    nocapacity <- atomically $ TBQ.isFullTBQueue q
                    if nocapacity
                    then handleOverflow switchboard
                    else atomically $ TBQ.writeTBQueue q i

        sb <- readMVar (getSB switchboard)

        if (sbRunning sb) == SwitchboardRunning
            then writequeue (sbQueue sb) item
            else TIO.hPutStrLn stderr "Error: Switchboard is not running, dropping log items!"
        
    handleOverflow _ = TIO.hPutStrLn stderr "Error: Switchboard's queue full, dropping log items!"

It seems there is another race condition here: First it checks if the queue is full and then writes on it. But These are not atomic, so the queue may get full in between, so the thread will sleep.

I need to have a better look on exactly how these issues trigger the error.

erikd · 2021-01-21T07:59:10Z

This should have been fixed in PR #495 , specifically, with commit dad6e36 .

vhulchenko-iohk added bug Something isn't working priority high issues/PRs that MUST be addressed. The release can't happen without this; labels Jan 6, 2021

rhyslbw mentioned this issue Jan 15, 2021

fix: Delay cardano-db-sync-startup until cardano-node is at network tip cardano-foundation/cardano-rosetta#281

Closed

erikd closed this as completed Jan 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getHistoryInterpreter: AcquireFailurePointTooOld #434

getHistoryInterpreter: AcquireFailurePointTooOld #434

erikd commented Dec 10, 2020

erikd commented Dec 10, 2020 •

edited

Loading

ArturWieczorek commented Dec 14, 2020

kderme commented Dec 14, 2020

erikd commented Jan 21, 2021 •

edited

Loading

getHistoryInterpreter: AcquireFailurePointTooOld #434

getHistoryInterpreter: AcquireFailurePointTooOld #434

Comments

erikd commented Dec 10, 2020

erikd commented Dec 10, 2020 • edited Loading

ArturWieczorek commented Dec 14, 2020

kderme commented Dec 14, 2020

erikd commented Jan 21, 2021 • edited Loading

erikd commented Dec 10, 2020 •

edited

Loading

erikd commented Jan 21, 2021 •

edited

Loading