Add document change event broadcast using members list #189

dc7303 · 2021-05-24T14:29:43Z

What this PR does / why we need it:
It broadcasts document change events to replicated agents.

Add grpc broadcast service.
Implement the Publish API.
Broadcast client management.
~~Add grpc_recover.~~
When requesting from the server to the server, the server should not be stopped when the replicated agent is in a state where communication is not possible.
Add stress test
Add design document for broadcast
Code cleanup

Added issue

If you enable the server in multiple tests for the cluster mode test, they affect each other and fail.
example)

 Error Trace:    cluster_stress_test.go:32
                Error:          Received unexpected error:
                                listen tcp :21501: bind: address already in use
---
 Error Trace:    cluster_mode_test.go:59
                                cluster_mode_test.go:128
                Error:          "map[c2oi......]" should have 2 item(s), but has 7

Which issue(s) this PR fixes:

Address #183

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Additional documentation:

Checklist:

Added relevant tests or not required
Didn't break anything

1. Improvements so that each agent can simultaneously send requests and conduct tests. 2. Added handling of `negative WaitGroup counter` errors that appear intermittently in stress tests. 3. Fixed a situation where change reject occurred in the test.

dc7303 · 2021-05-28T18:59:58Z

yorkie/test/helper/helper.go

Lines 110 to 136 in a1b8407

    
           var portOffset = 0 
        
           // TestConfig returns config for creating Yorkie instance. 
        
           func TestConfig(authWebhook string) *yorkie.Config { 
        
           	portOffset += 100 
        
           	return &yorkie.Config{ 
        
           		RPC: &rpc.Config{ 
        
           			Port: RPCPort + portOffset, 
        
           		}, 
        
           		Metrics: &prometheus.Config{ 
        
           			Port: MetricsPort + portOffset, 
        
           		}, 
        
           		Backend: &backend.Config{ 
        
           			SnapshotThreshold:       SnapshotThreshold, 
        
           			AuthorizationWebhookURL: authWebhook, 
        
           		}, 
        
           		Mongo: &mongo.Config{ 
        
           			ConnectionURI:        MongoConnectionURI, 
        
           			ConnectionTimeoutSec: MongoConnectionTimeoutSec, 
        
           			PingTimeoutSec:       MongoPingTimeoutSec, 
        
           			YorkieDatabase:       TestDBName(), 
        
           		}, 
        
           		ETCD: &etcd.Config{ 
        
           			Endpoints: ETCDEndpoints, 
        
           		}, 
        
           	} 
        
           }

Port conflicts occur when multiple packages use this function because it recompiles when the package being tested is replaced. I am looking for a way to use ports stably to run multiple servers.
go-test doc

'Go test' recompiles each package along with any files with names matching
the file pattern "*_test.go"
....

hackerwins

Thanks for your contribution.
I left a few simple questions.

api/yorkie.proto

test/integration/cluster_mode_test.go

test/stress/cluster_stress_test.go

yorkie/backend/sync/etcd/membermap.go

yorkie/backend/sync/etcd/client.go

yorkie/rpc/broadcast.go

yorkie/rpc/server.go

api/yorkie.proto

yorkie/backend/sync/etcd/pubsub.go

hackerwins · 2021-05-31T09:39:21Z

yorkie/backend/sync/etcd/pubsub.go

+	docEvent, err := converter.ToDocEvent(event)
+	if err != nil {
+		log.Logger.Error(err)
+		return


Is there any reason we broke the error chain here?

I didn't think it was necessary to return an error from the API where a particular Document event occurred just because it couldn't broadcast to another agent.
So I left an error log at the time of occurrence. This is because we thought that document sync could be synced through other sync events.

I have been thinking about how to alert when a problem that can occur within the method (such as a convert failure or an abnormal problem in the target agent) occurs continuously, but I was not sure, so I tried to suggest it later.

Can you please let me know if there are more things I should consider?

When we try to convert an event type that is not supported, an error occurs, and there is no need to loop further in the caller. I wish we could handle errors explicitly.

1. Improved 'watch document across agents test'. 2. Remove cluster stress test. 3. Cleanup code. 4. Change service's name Broadcast to Cluster. 5. Remove grpc recovery

codecov · 2021-05-31T13:18:16Z

Codecov Report

Merging #189 (28cc71b) into main (b0af224) will decrease coverage by 1.21%.
The diff coverage is 22.61%.

❗ Current head 28cc71b differs from pull request most recent head 6d267ab. Consider uploading reports for the commit 6d267ab to get more accurate results

@@            Coverage Diff             @@
##             main     #189      +/-   ##
==========================================
- Coverage   61.58%   60.36%   -1.22%     
==========================================
  Files          41       42       +1     
  Lines        3454     3550      +96     
==========================================
+ Hits         2127     2143      +16     
- Misses       1131     1211      +80     
  Partials      196      196

Impacted Files	Coverage Δ
api/converter/from_pb.go	`65.22% <0.00%> (-1.88%)`	⬇️
api/converter/to_pb.go	`82.93% <0.00%> (-2.72%)`	⬇️
client/client.go	`12.13% <0.00%> (ø)`
yorkie/backend/sync/etcd/membermap.go	`0.00% <0.00%> (ø)`
yorkie/backend/sync/etcd/pubsub.go	`0.00% <0.00%> (ø)`
yorkie/backend/sync/memory/coordinator.go	`0.00% <0.00%> (ø)`
yorkie/rpc/cluster.go	`0.00% <0.00%> (ø)`
yorkie/rpc/server.go	`50.72% <3.57%> (+0.91%)`	⬆️
yorkie/backend/sync/memory/pubsub.go	`91.89% <89.18%> (-1.33%)`	⬇️
yorkie/backend/sync/etcd/client.go	`71.42% <100.00%> (+1.73%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0af224...6d267ab. Read the comment docs.

yorkie/backend/sync/etcd/pubsub.go

yorkie/rpc/server.go

yorkie/backend/sync/etcd/client.go

yorkie/backend/sync/etcd/pubsub.go

1. Improve lock usage when accessing cluster server clients. 2. Improved error handling. 3. Remove duplicate lock calls.

The DocEvent and Event types seem to overlap each other. Integrated these two types with each other and improved the logic that previously used DocEvent.

Broadcasting should also stop when the parent context terminates.

- Remove Members from pubsub - Remove code that uses two locks together - Change the key of clusterClientMap to id - Cleanup comments

hackerwins

Thank you for your work. In particular, changing the topic to multi-key based in PubSub seems to be neat.

Broadcast seems to have many weaknesses compared to Gossip-based algorithms, but first of all, it seems to be able to operate a small-scale cluster.

(I made a few changes during the review.)

yorkie-team/yorkie#189

Co-authored-by: Hackerwins <[email protected]>

hackerwins marked this pull request as draft May 25, 2021 00:52

Add document change event broadcast using members list

4eaf726

dc7303 force-pushed the broadcasting branch 2 times, most recently from ec27bbb to a5ed817 Compare May 26, 2021 14:01

Improve stress test

158fb36

1. Improvements so that each agent can simultaneously send requests and conduct tests. 2. Added handling of `negative WaitGroup counter` errors that appear intermittently in stress tests. 3. Fixed a situation where change reject occurred in the test.

dc7303 force-pushed the broadcasting branch from a5ed817 to 158fb36 Compare May 27, 2021 17:13

dc7303 added 3 commits May 28, 2021 02:16

Merge branch 'main' into broadcasting

a5a8bcb

Cleanup code

2629b83

Merge branch 'main' into broadcasting

2690d0f

dc7303 marked this pull request as ready for review May 28, 2021 19:09

dc7303 requested review from hackerwins, mojosoeun and ppeeou May 28, 2021 19:10

dc7303 assigned hackerwins, mojosoeun and dc7303 May 28, 2021

hackerwins requested changes May 29, 2021

View reviewed changes

hackerwins reviewed May 29, 2021

View reviewed changes

yorkie/rpc/server.go Outdated Show resolved Hide resolved

hackerwins assigned ppeeou and unassigned hackerwins May 29, 2021

hackerwins reviewed May 31, 2021

View reviewed changes

api/yorkie.proto Outdated Show resolved Hide resolved

hackerwins reviewed May 31, 2021

View reviewed changes

Fix based on review content

5aaffbf

1. Improved 'watch document across agents test'. 2. Remove cluster stress test. 3. Cleanup code. 4. Change service's name Broadcast to Cluster. 5. Remove grpc recovery