Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race in TestServerSetup #5837

Closed
AskAlexSharov opened this issue Oct 22, 2022 · 5 comments
Closed

race in TestServerSetup #5837

AskAlexSharov opened this issue Oct 22, 2022 · 5 comments
Assignees
Labels
gateway Issue assigned to gateway team

Comments

@AskAlexSharov
Copy link
Collaborator

==================
WARNING: DATA RACE
Write at 0x00c0002b4858 by goroutine 9231:
  runtime.racewrite()
      <autogenerated>:1 +0x24
  github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).Close()
      github.com/ledgerwatch/[email protected]/kv/mdbx/kv_mdbx.go:411 +0x84
  github.com/ledgerwatch/erigon/p2p/enode.(*DB).Close()
      github.com/ledgerwatch/erigon/p2p/enode/nodedb.go:615 +0x1a8
  github.com/ledgerwatch/erigon/p2p.(*Server).Stop()
      github.com/ledgerwatch/erigon/p2p/server.go:424 +0x11d
  github.com/ledgerwatch/erigon/p2p.TestServerSetupConn.func1.2()
      github.com/ledgerwatch/erigon/p2p/server_test.go:466 +0x39
  runtime.deferreturn()
      runtime/panic.go:476 +0x32
  testing.tRunner()
      testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      testing/testing.go:1493 +0x47

Previous read at 0x00c0002b4858 by goroutine 9234:
  runtime.raceread()
      <autogenerated>:1 +0x24
  github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRw.func1()
      github.com/ledgerwatch/[email protected]/kv/mdbx/kv_mdbx.go:471 +0x6a
  runtime.deferreturn()
      runtime/panic.go:476 +0x32
  github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).Update()
      github.com/ledgerwatch/[email protected]/kv/mdbx/kv_mdbx.go:654 +0xa5
  github.com/ledgerwatch/erigon/p2p/enode.(*DB).storeUint64()
      github.com/ledgerwatch/erigon/p2p/enode/nodedb.go:279 +0x1f6
  github.com/ledgerwatch/erigon/p2p/enode.(*DB).storeLocalSeq()
      github.com/ledgerwatch/erigon/p2p/enode/nodedb.go:536 +0x1d7
  github.com/ledgerwatch/erigon/p2p/enode.(*LocalNode).bumpSeq()
      github.com/ledgerwatch/erigon/p2p/enode/localnode.go:289 +0x270
  github.com/ledgerwatch/erigon/p2p/enode.(*LocalNode).sign()
      github.com/ledgerwatch/erigon/p2p/enode/localnode.go:274 +0x192
  github.com/ledgerwatch/erigon/p2p/enode.(*LocalNode).Node()
      github.com/ledgerwatch/erigon/p2p/enode/localnode.go:97 +0xd0
  github.com/ledgerwatch/erigon/p2p.(*Server).run()
      github.com/ledgerwatch/erigon/p2p/server.go:739 +0xc4
  github.com/ledgerwatch/erigon/p2p.(*Server).Start.func2()
      github.com/ledgerwatch/erigon/p2p/server.go:519 +0x39

Goroutine 9231 (running) created at:
  testing.(*T).Run()
      testing/testing.go:1493 +0x75d
  github.com/ledgerwatch/erigon/p2p.TestServerSetupConn()
      github.com/ledgerwatch/erigon/p2p/server_test.go:447 +0xc8c
  testing.tRunner()
      testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      testing/testing.go:1493 +0x47

Goroutine 9234 (running) created at:
  github.com/ledgerwatch/erigon/p2p.(*Server).Start()
      github.com/ledgerwatch/erigon/p2p/server.go:519 +0xb1b
  github.com/ledgerwatch/erigon/p2p.(*Server).TestStart()
      github.com/ledgerwatch/erigon/p2p/server_test.go:94 +0x404
  github.com/ledgerwatch/erigon/p2p.TestServerSetupConn.func1()
      github.com/ledgerwatch/erigon/p2p/server_test.go:463 +0x405
  testing.tRunner()
      testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      testing/testing.go:1493 +0x47
==================
--- FAIL: TestServerSetupConn (0.04s)
    --- FAIL: TestServerSetupConn/doEncHandshake,close, (0.01s)
@revitteth revitteth self-assigned this Nov 21, 2022
@nanevardanyan nanevardanyan added the gateway Issue assigned to gateway team label Dec 23, 2022
@hexoscott hexoscott self-assigned this Mar 6, 2023
@hexoscott
Copy link
Collaborator

The culprit here is a waitgroup having Add called on it from multiple go routines. We should call add from the same routine that we call wait on, and we're not guaranteeing that here. I'd imagine in a test scenario we're using the database without waiting for things to have finished before asking it to close.

Question is how should we handle this without the use of a waitgroup? I've tried using atomics but they're unreliable and cause the same problem of "check for close, not closed, continue on, close happens, defer called to Add, boom!". Mutexes would defeat the point and make things synchronous.

I've started experimenting with using a go routine spun up inside the call to Open which will make using the waitgroup synchronous in some respect by using 3 channels over the calls to Wait/Add/Done, or at least make it easier to handle a call to Add happening after a call to Wait on the waitgroup but it seems a heavy approach and might end up in the same situation but with more complicated code.

Any ideas?

@AskAlexSharov
Copy link
Collaborator Author

AskAlexSharov commented Mar 14, 2023

General goodness:

  • call Add before create new goroutine, call Done in defer.
  • cancel goroutine by context, but wait for goroutine to fully finish (by wg.Wait is ok)
  • if need return error from goroutine - can use errgroup package. If don’t need then just “go”.
  • channels: close channel by writer (producer) when all writes are done.
  • for http (p2p) servers: “graceful shutdown” means don’t accept any new requests, wait for existing requests to finish (cancel existing requests by context: immediately or by timeout) (waitgroup is ok), then close low-level resources like db/snapshots.

@hexoscott
Copy link
Collaborator

I think we're breaking quite a few of those rules in the mdbx stuff. We're not really in a position to call Add before the go routine is created because it could be off the back of an RPC call for example so the call to db.BeginRo could come from anywhere at any time.

This is where the thought for using channels to manage communicating with the waitgroup came from. The 3 waitgroup actions we care about would all be handled within the same go routine effectively, the one that called Open on the db.

@AskAlexSharov
Copy link
Collaborator Author

waitgroup inside kv_mdbx.go - it's not much for goroutines, but just to not call Close() when not all txs are finished. because we have many components which using own mdbx db - and some of them are nested from Geth and not really adopted for transactions use. So, i was just lazy to change implementation of higher-level components - so they can guarantee don't call Close before all read/write activity is stopped.

My opinion: we don't need any fancy communication patterns - just TestServerSetup must guarantee to wait for all read/write goroutines to finish - and only then call db.Close(). If TestServerSetup can't guarantee something because we spawn goroutines which TestServerSetup can't control - then just don't spawn such goroutiness - can ask TestServerSetup to spawn for us.

I think if goroutine A created goroutine B - then A is responsible to: cancel/success_finish/wait_for_done of goroutine B.

@hexoscott
Copy link
Collaborator

created #7111 to deal with this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gateway Issue assigned to gateway team
Projects
None yet
Development

No branches or pull requests

4 participants