Avoid `writer poisoned` errors #4533

lutter · 2023-04-11T18:25:17Z

If a subgraph encountered an error, it would get restarted with some backoff. If that error was caused by some store interaction, the WritableStore would remain in poisoned state and refuse to process any more writes.

This PR changes that so that the WritableStore is restarted when a subgraph is restarted, clearing the error. That requires that we clear any internal state that is based on assumptions of what has been written so far. I am not entirely sure that the changes in core/src/subgraph/runner.rs is enough to do that, and would appreciate a thorough review of what other state might have to be cleared.

leoyvens · 2023-04-14T14:55:35Z

For correctness, you should use SubgraphRunner::revert_state, which is suppose to clear any dirty state written at or beyond a given block number.

The current approach also has a performance issue, which is that it clears the entity cache on every Action::Restart. But that is also used when a dynamic data source is created, where clearing the cache would be unnecessarily detrimental to performance. So we want to revert the state in the error case, but not in the created data source case. One way to do this would be to add an Action::RestartFromError(BlockPtr), which is used when restarting mid-block, with potentially dirty state.

lutter · 2023-04-14T18:00:24Z

I added a commit that hopefully addresses that

leoyvens · 2023-04-20T22:30:16Z

store/postgres/src/writable.rs

+            let logger = self.store.logger.clone();
+            if let Err(e) = self.stop().await {
+                warn!(logger, "Writable had error when stopping, it is safe to ignore this error"; "error" => e.to_string());
+            }


Is it sufficient to send a stop request here, shouldn't this also join the task handle of the writer, to ensure is finished?

I am actually wondering if this is too defensive - if the writer is poisoned, it's because we had an error in the Queue, and start_writer will return, i.e. shut down, on any error. Maybe I should just remove that code and just leave a comment here?

Actually, this would always fail - when a writer has been poisoned, it doesn't allow adding anything more to the queue. I changed this code to just log a warning if the join handle indicates that the background writer is still running. Joining here is a bit tricky since it risks blocking indexing of that subgraph if we really do have a problem with the writer stopping.

When the subgraph runner encounters an error, it needs to restart the store to clear any errors that might have happened.

We only need to reset the state of the runner when the `WritableStore` actually had to be restarted because of an error; if it had an error, we have to reset the state. Use `SubgraphRunner.revert_state` to properly reset the runner state.

lutter · 2023-04-25T18:41:30Z

Rebased to latest master.

lutter · 2023-04-25T19:12:37Z

Woops .. I screwed up, I thought this had been approved already, and merged it.

azf20 mentioned this pull request Apr 12, 2023

Store connection issue prevents subgraph indexing until graph-node is restarted #4190

Closed

leoyvens self-requested a review April 14, 2023 13:40

leoyvens reviewed Apr 20, 2023

View reviewed changes

lutter added 5 commits April 25, 2023 11:28

core, graph, store: Restart writable when restarting subgraph

b94ba9d

When the subgraph runner encounters an error, it needs to restart the store to clear any errors that might have happened.

core: Simplify the return type of SubgraphRunner.run

9ac5a43

store: Improve errors from Writer.check_queue_running

2574b37

store: Warn when background writer hasn't stopped yet

2ce7566

lutter force-pushed the lutter/restart branch from 485322e to 2ce7566 Compare April 25, 2023 18:39

lutter merged commit 2ce7566 into master Apr 25, 2023

lutter deleted the lutter/restart branch April 25, 2023 19:10

alexcos20 mentioned this pull request Nov 16, 2023

Migrate to latest graph-node oceanprotocol/ocean-subgraph#739

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid `writer poisoned` errors #4533

Avoid `writer poisoned` errors #4533

lutter commented Apr 11, 2023

leoyvens commented Apr 14, 2023

lutter commented Apr 14, 2023

leoyvens Apr 20, 2023

lutter Apr 21, 2023

lutter Apr 25, 2023

lutter commented Apr 25, 2023

lutter commented Apr 25, 2023

Avoid writer poisoned errors #4533

Avoid writer poisoned errors #4533

Conversation

lutter commented Apr 11, 2023

leoyvens commented Apr 14, 2023

lutter commented Apr 14, 2023

leoyvens Apr 20, 2023

Choose a reason for hiding this comment

lutter Apr 21, 2023

Choose a reason for hiding this comment

lutter Apr 25, 2023

Choose a reason for hiding this comment

lutter commented Apr 25, 2023

lutter commented Apr 25, 2023

Avoid `writer poisoned` errors #4533

Avoid `writer poisoned` errors #4533