Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new logging policy #5359

Closed
philknows opened this issue Apr 13, 2023 · 7 comments
Closed

Implement new logging policy #5359

philknows opened this issue Apr 13, 2023 · 7 comments
Labels
prio-high Resolve issues as soon as possible. scope-logging Issue about logs: hygiene, format issues, improvements. scope-ux Issues for CLI UX or general consumer UX.

Comments

@philknows
Copy link
Member

philknows commented Apr 13, 2023

This issue is to track the implementation of #5299 into Lodestar's logging. After the Shapella hard fork, we experienced a consistent, ignorable error messages relating to the QUEUE_ERROR_QUEUE_MAX_LENGTH which created a poor logging UX to users of Lodestar during the fork transition due to the flow of large batches of BLStoExecution messages. This is a wake up call for us to urgently implement the new logging policy for the next attainable release to fix our poor logging UX.

There are many errors we would like to fix listed in #4400 which should now also abide by the new policy.

@philknows philknows added prio-high Resolve issues as soon as possible. scope-logging Issue about logs: hygiene, format issues, improvements. scope-ux Issues for CLI UX or general consumer UX. labels Apr 13, 2023
@philknows philknows added this to the v1.9.0 milestone Apr 13, 2023
@nflaig
Copy link
Member

nflaig commented Apr 23, 2023

Potential candidate to move from warn to debug as this log happens quite frequently even if there is nothing wrong with the node itself. This was mentioned by a user in discord but I also see this log a lot and don't know what to do with it.

this.logger.warn("Unclean disconnect", {peer: prettyPrintPeerId(peer)}, e as Error);

warn: Unclean disconnect peer=16...A3GmAv Socket close timeout
CodeError: Socket close timeout
    at Timeout._onTimeout (file:///usr/app/node_modules/@libp2p/tcp/src/socket-to-conn.ts:143:28)
    at listOnTimeout (node:internal/timers:569:17)
    at processTimers (node:internal/timers:512:7)

Probably someone with more experience in the networking stack needs to assess this but I don't see how this log would be actionable to an end user and if a user is not able to get sufficient peers then the debug logs should be looked at.

@nflaig
Copy link
Member

nflaig commented Apr 30, 2023

These should be less verbose if it is due to Service Unavailable: Node is syncing, we do not need the stack trace and should consider logging as warning as this is expected but user should still be aware that something is not yet working

Error: Error on getProposerDuties - Service Unavailable: Node is syncing - headSlot 5529152 currentSlot 5529417
    at Function.assert (file:///usr/app/packages/api/src/utils/client/httpClient.ts:44:13)
    at BlockDutiesService.pollBeaconProposers (file:///usr/app/packages/validator/src/services/blockDuties.ts:155:14)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at BlockDutiesService.pollBeaconProposersAndNotify (file:///usr/app/packages/validator/src/services/blockDuties.ts:129:5)
    at BlockDutiesService.runBlockDutiesTask (file:///usr/app/packages/validator/src/services/blockDuties.ts:87:9)
    at Clock.runAtMostEvery (file:///usr/app/packages/validator/src/util/clock.ts:96:7)
Apr-30 13:23:24.418[]                error: Failed to download attester duties epoch=172794 Failed to obtain attester duty - Service Unavailable: Node is syncing - headSlot 5529152 currentSlot 5529417
Error: Failed to obtain attester duty - Service Unavailable: Node is syncing - headSlot 5529152 currentSlot 5529417
    at Function.assert (file:///usr/app/packages/api/src/utils/client/httpClient.ts:44:13)
    at AttestationDutiesService.pollBeaconAttestersForEpoch (file:///usr/app/packages/validator/src/services/attestationDuties.ts:214:14)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at AttestationDutiesService.pollBeaconAttesters (file:///usr/app/packages/validator/src/services/attestationDuties.ts:163:7)
    at async Promise.all (index 1)
    at AttestationDutiesService.runDutiesTasks (file:///usr/app/packages/validator/src/services/attestationDuties.ts:124:5)
    at Clock.runAtMostEvery (file:///usr/app/packages/validator/src/util/clock.ts:96:7)
Apr-30 13:23:24.423[]                error: Failed to download SyncDuties epoch=172794 Failed to obtain SyncDuties - Service Unavailable: Node is syncing - headSlot 5529152 currentSlot 5529417
Error: Failed to obtain SyncDuties - Service Unavailable: Node is syncing - headSlot 5529152 currentSlot 5529417
    at Function.assert (file:///usr/app/packages/api/src/utils/client/httpClient.ts:44:13)
    at SyncCommitteeDutiesService.pollSyncCommitteesForEpoch (file:///usr/app/packages/validator/src/services/syncCommitteeDuties.ts:233:14)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at SyncCommitteeDutiesService.pollSyncCommittees (file:///usr/app/packages/validator/src/services/syncCommitteeDuties.ts:178:7)
    at async Promise.all (index 1)
    at SyncCommitteeDutiesService.runDutiesTasks (file:///usr/app/packages/validator/src/services/syncCommitteeDuties.ts:140:5)
    at Clock.runAtMostEvery (file:///usr/app/packages/validator/src/util/clock.ts:96:7)

@dapplion
Copy link
Contributor

dapplion commented May 2, 2023

There's a not so great attempt at controlling those type of error logs, see

isSyncing(e: Error) {
but should be improved and applied to all paths that query the API

@nflaig
Copy link
Member

nflaig commented May 23, 2023

Another candidate which we might wanna rework.

error: Error on pollBeaconProposers  request to http://localhost:9596/eth/v1/validator/duties/proposer/177958 failed, reason: connect ECONNREFUSED 127.0.0.1:9596
FetchError: request to http://localhost:9596/eth/v1/validator/duties/proposer/177958 failed, reason: connect ECONNREFUSED 127.0.0.1:9596
    at ClientRequest.<anonymous> (/home/devops/goerli/lodestar/node_modules/node-fetch/lib/index.js:1461:11)
    at ClientRequest.emit (node:events:513:28)
    at Socket.socketErrorListener (node:_http_client:502:9)
    at Socket.emit (node:events:513:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at processTicksAndRejections (node:internal/process/task_queues:82:21)

Right now, this is quite noisy as similar errors will be spammed from different internal services of the VC.

I think we could just handle this similar to if node is syncing and just log a single line error (no stack trace) which states that node is unreachable/offline and just log the error message.

@philknows
Copy link
Member Author

Agreed, the ECONNREFUSED with the IP address is a pretty good explanation already and clearly explains what the problem is for the user to action.

@nflaig
Copy link
Member

nflaig commented Jun 26, 2023

We create those two logs when block error happens

Jun-25 23:09:25.255[chain]           debug: Block error slot=5935547 code=BLOCK_ERROR_ALREADY_KNOWN, root=0xdc3da9c9eaf50c1fbb2a140e006dc6f8007feaee195eb4ed8ea7f18c9ff2dbb8
Error: BLOCK_ERROR_ALREADY_KNOWN
    at verifyBlocksSanityChecks (file:///home/devops/goerli/lodestar/packages/beacon-node/src/chain/blocks/verifyBlocksSanityChecks.ts:102:15)
    at BeaconChain.processBlocks (file:///home/devops/goerli/lodestar/packages/beacon-node/src/chain/blocks/index.ts:61:82)
    at JobItemQueue.BlockProcessor.jobQueue.JobItemQueue.maxLength [as itemProcessor] (file:///home/devops/goerli/lodestar/packages/beacon-node/src/chain/blocks/index.ts:27:30)
    at JobItemQueue.runJob (file:///home/devops/goerli/lodestar/packages/beacon-node/src/util/queue/itemQueue.ts:101:33)
    at JobItemQueue.runJob (file:///home/devops/goerli/lodestar/packages/beacon-node/src/util/queue/itemQueue.ts:118:15)
    at runNextTicks (node:internal/process/task_queues:60:5)
    at listOnTimeout (node:internal/timers:540:9)
    at processTimers (node:internal/timers:514:7)
Jun-25 23:09:25.255[network]         error: Error receiving block slot=5935547, peer=16Uiu2HAmR7yebF8ip58VVxCcbnSv27WFys9osJBncAi15wCSdGdk code=BLOCK_ERROR_ALREADY_KNOWN, root=0xdc3da9c9eaf50c1fbb2a140e006dc6f8007feaee195eb4ed8ea7f18c9ff2dbb8
Error: BLOCK_ERROR_ALREADY_KNOWN
    at verifyBlocksSanityChecks (file:///home/devops/goerli/lodestar/packages/beacon-node/src/chain/blocks/verifyBlocksSanityChecks.ts:102:15)
    at BeaconChain.processBlocks (file:///home/devops/goerli/lodestar/packages/beacon-node/src/chain/blocks/index.ts:61:82)
    at JobItemQueue.BlockProcessor.jobQueue.JobItemQueue.maxLength [as itemProcessor] (file:///home/devops/goerli/lodestar/packages/beacon-node/src/chain/blocks/index.ts:27:30)
    at JobItemQueue.runJob (file:///home/devops/goerli/lodestar/packages/beacon-node/src/util/queue/itemQueue.ts:101:33)
    at JobItemQueue.runJob (file:///home/devops/goerli/lodestar/packages/beacon-node/src/util/queue/itemQueue.ts:118:15)
    at runNextTicks (node:internal/process/task_queues:60:5)
    at listOnTimeout (node:internal/timers:540:9)
    at processTimers (node:internal/timers:514:7)

The error log might look scary to users and raise concerns (see discord). AFAIK those errors are kinda expected to happen, we could consider logging all those to debug or just for a subset based on error code.

logger.error("Error receiving block", {slot: signedBlock.message.slot, peer: peerIdStr}, e as Error);

@philknows
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prio-high Resolve issues as soon as possible. scope-logging Issue about logs: hygiene, format issues, improvements. scope-ux Issues for CLI UX or general consumer UX.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants