Test Synapse in worker mode under Complement #12638

clokep · 2021-08-13T13:17:47Z

There's been some work to support this in matrix-org/complement#62, matrix-org/complement#105, and matrix-org/complement#116 but it still isn't possible to run complement against Synapse in worker mode. We should finish this up.

This is a blocker for removing tests from sytest / switching to Complement.

clokep · 2021-08-13T13:17:58Z

@anoadragon453 could you braindump what you think needs to get done here?

kegsay · 2021-08-17T09:41:48Z

matrix-org/pipelines#123 (comment) is probably a good start

anoadragon453 · 2021-08-17T10:47:53Z

#10065 (comment) is probably the best summary at the moment.

The main takeaway is signalling to Complement that all processes have completed startup before testing begins.

kegsay · 2021-11-22T11:53:16Z

Was this done?

anoadragon453 · 2021-11-22T11:56:03Z

@kegsay Nope, still a TODO :/

richvdh · 2022-05-04T16:33:26Z

#10065 is now resolved, and it is relatively easy to run complement against a synapse-with-workers by setting the WORKERS env var when running https://github.com/matrix-org/synapse/blob/develop/scripts-dev/complement.sh.

However, I just tried to do so, and (after 15 minutes) it exploded with a pile of fail, so more work to be done here :(.

richvdh · 2022-05-05T09:17:18Z

ok, it works much better if you bump up the timeout for starting the homeserver (#12637).

The failing tests are now:

FAIL	github.com/matrix-org/complement/tests	605.382s
--- FAIL: TestCanRegisterAdmin (39.04s)
--- FAIL: TestServerNotices (23.15s)
--- FAIL: TestRegistration (34.00s)
    --- FAIL: TestRegistration/parallel (1.95s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_admin_with_shared_secret (0.06s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret (0.04s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_disallows_symbols (0.02s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_downcases_capitals (0.02s)
FAIL	github.com/matrix-org/complement/tests/csapi	604.665s

... which doesn't sound so bad. Presumably there is some snafu with routing of registration, which will explain 6 out of 7 of those fails.

richvdh · 2022-05-05T09:21:49Z

oh hum, it looks like the whole of the /csapi package failed:

panic: test timed out after 10m0s

richvdh · 2022-05-05T13:07:36Z

right, if you bump the timeout right up, you still get this list of test failures:

--- FAIL: TestSyncBlocksDuringPartialStateJoin (48.45s)
--- FAIL: TestImportHistoricalMessages (280.31s)
FAIL
FAIL	github.com/matrix-org/complement/tests	3067.783s
--- FAIL: TestCanRegisterAdmin (50.52s)
--- FAIL: TestServerNotices (39.16s)
--- FAIL: TestRegistration (29.87s)
    --- FAIL: TestRegistration/parallel (2.07s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_admin_with_shared_secret (0.04s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret (0.02s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_disallows_symbols (0.01s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_downcases_capitals (0.01s)
FAIL
FAIL	github.com/matrix-org/complement/tests/csapi	2861.820s
FAIL

Which still isn't so bad. (Though a runtime of getting on for an hour is obviously sub-optimal.)

reivilibre · 2022-05-11T13:37:15Z

I ran Complement with workers¹, obtaining a runtime of 11m 7s and the following failures:

--- FAIL: TestSyncBlocksDuringPartialStateJoin (12.50s)
--- FAIL: TestMediaFilenames (12.89s)
    --- FAIL: TestMediaFilenames/Parallel (0.00s)
        --- FAIL: TestMediaFilenames/Parallel/Unicode (0.00s)
            --- FAIL: TestMediaFilenames/Parallel/Unicode/Can_download_with_Unicode_file_name_over_federation (0.50s)
--- FAIL: TestLocalPngThumbnail (6.21s)
--- FAIL: TestRemotePngThumbnail (14.49s)
--- FAIL: TestImportHistoricalMessages (50.93s)
FAIL
FAIL    github.com/matrix-org/complement/tests  664.096s
--- FAIL: TestCanRegisterAdmin (8.45s)
--- FAIL: TestServerNotices (6.20s)
--- FAIL: TestRegistration (9.56s)
    --- FAIL: TestRegistration/parallel (0.42s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_admin_with_shared_secret (0.01s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret (0.01s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_disallows_symbols (0.01s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_downcases_capitals (0.01s)
--- FAIL: TestRoomImageRoundtrip (7.46s)
FAIL
FAIL    github.com/matrix-org/complement/tests/csapi    586.197s
FAIL

(Not sure what happened between Rich's run taking most of an hour and my run!)

I also get some failures with no workers²; so there might be something wrong with my setup (as CI doesn't reproduce these failures):

--- FAIL: TestRoomState (5.77s)
    --- FAIL: TestRoomState/Parallel (0.00s)
        --- FAIL: TestRoomState/Parallel/GET_/publicRooms_lists_newly-created_room (1.97s)
FAIL
FAIL    github.com/matrix-org/complement/tests/csapi    247.104s
FAIL

So the summary of 'What's the state of workerised Synapse under Complement?' is:

the current approach seems viable overall
some tests fail and will need to be fixed
some tests might take a long time (on Rich's machine, haven't reproduced here though)

(With firewall disabled!) WORKERS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh &> ../_misc/2022-05/comlog6_with_workers ↩
COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh &> ../_misc/2022-05/comlog6_no_workers ↩

richvdh · 2022-05-16T11:57:56Z

some tests might take a long time (on Rich's machine, haven't reproduced here though)

they all take a long time. Every single test seems to take a good 30 seconds to spin up Synapse: AFAICT it's just busy spinning up 14(?) python processes - or 28, if it's a federated test.

It's entirely possible the difference is in hardware. FTR my machine is a i7-6560U with 16G of ram, though it's a bit venerable and I suspect it gets thermally throttled when it's busy.

reivilibre · 2022-05-16T12:35:18Z

Every single test seems to take a good 30 seconds to spin up Synapse: AFAICT it's just busy spinning up 14(?) python processes - or 28, if it's a federated test.

Why is Sytest not bitten by this problem? Come to think of it, does Sytest not spin up a new homeserver for every test?

I also wonder if supervisor isn't starting up the workers at the same time or something like that;

It's entirely possible the difference is in hardware. FTR my machine is a i7-6560U with 16G of ram

That's a good point; the machine I use is very performant (Ryzen 7 5800H with 32 GB RAM). The cooling is also very competent so I suspect it's not thermally throttled. It has enough threads for all the workers to get their own (crazy I know). A 6x difference in runtime doesn't sound so crazy taking all that into account.

What does Complement's runtime look like for you when using a monolithic deployment? As a point of comparison, COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh takes 5m8s here.

richvdh · 2022-05-17T11:51:42Z

Why is Sytest not bitten by this problem? Come to think of it, does Sytest not spin up a new homeserver for every test?

No it does not, and that is why it is not bitten by this problem :). It spins up two HS instances, and uses them for each test. On the one hand, it's faster. On the other, it's very easy for state to leak from one test to another.

I also wonder if supervisor isn't starting up the workers at the same time or something like that;

it seems to be starting them together, or near enough. Ironically I wonder if it would be better if it staggered them. But 🤷‍♂️

What does Complement's runtime look like for you when using a monolithic deployment? As a point of comparison, COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh takes 5m8s here.

Apparently, more than 10 minutes, since it timed out.

Something I forgot here: it's not just spinning up two homeservers at once (for the federated tests), it's doing three (because Complement likes to run the tests/csapi tests in parallel with the tests tests). Adding -p 1 to the commandline (see https://pkg.go.dev/cmd/go#hdr-Compile_packages_and_dependencies) helps with that.

reivilibre · 2022-05-20T16:33:20Z

I'm down to just a couple of test failures (on a custom branch with all my in-flight PRs merged, that is):

--- FAIL: TestSyncBlocksDuringPartialStateJoin (11.01s)
--- FAIL: TestImportHistoricalMessages (45.01s)
FAIL
FAIL    github.com/matrix-org/complement/tests  626.584s
FAIL

These two tests are tracked in #12822 and #12825.

After that, I'm trying to get workerised Complement enabled in CI at #12810 (looks like it's going to involve a bit of fighting GitHub Actions since my first attempt didn't just work right away).
The slowness might bite us — if CI is bogged down by this then we'll need to think of some way to improve that (perhaps by disabling unnecessary workers in some tests? At the risk of maintaining a map from test ⇒ useless workers..).

Fixing the slowness would be nice anyway, to help Rich and likely others out, but if CI turns out to be tolerable (i.e. no slower than the slowest of SyTest or Trial?) then it'd be good to get this running in CI as soon as possible.

reivilibre · 2022-05-23T11:28:23Z

#12810 is giving 52 minutes for running the suite in CI. That's not good; easily twice as slow as the next slowest thing in CI.
I have a few ideas (and it would be worth having a look to see where the time is actually going), but I'm doubtful that there will be any trivial 2× improvements!

I will note that this figure will almost certainly get worse as new tests are added (so aiming for SyTest parity will make it worse!)

anoadragon453 · 2022-05-23T12:53:21Z

Since Complement tests are self-contained, one way to handle this is to run the test suite across multiple GHA workers. It's not ideal, but would work.

richvdh · 2022-05-23T13:54:15Z

I think we need to confirm what, exactly, is taking so long. I think it's Python loading all the .pyc(?) files but it would be good to confirm that - possibly by adding extra logging.

clokep added the enhancement label Aug 13, 2021

clokep mentioned this issue Aug 16, 2021

Persist room hierarchy pagination sessions to the database. #10613

Merged

clokep mentioned this issue Aug 18, 2021

Note that /spaces & /hierarchy can be routed to workers #10648

Merged

richvdh assigned clokep Aug 26, 2021

clokep removed their assignment Sep 7, 2021

richvdh mentioned this issue Feb 18, 2022

Support testing Synapse with Postgres matrix-org/complement#323

Closed

This was referenced Mar 24, 2022

Test relations (MSC2675) and threads (MSC3440, et al) matrix-org/complement#351

Open

Run /relations on workers #12290

Closed

richvdh transferred this issue from matrix-org/complement May 5, 2022

richvdh changed the title ~~Test Synapse in worker mode~~ Test Synapse in worker mode under Complement May 5, 2022

anoadragon453 added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label May 5, 2022

erikjohnston assigned reivilibre May 12, 2022

reivilibre added this to the Q2 2022 — Testing Workerised Synapse under Complement milestone May 19, 2022

reivilibre mentioned this issue May 20, 2022

Test Synapse against Complement with workers. #12810

Merged

reivilibre closed this as completed in #12810 May 31, 2022

DMRobertson mentioned this issue Oct 17, 2022

Race condition with replication means that publishing room aliases lacks read-after-write consistency between workers #14210

Open

matrixbot mentioned this issue Dec 21, 2023

Race condition with replication means that publishing room aliases lacks read-after-write consistency between workers element-hq/synapse#14210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Synapse in worker mode under Complement #12638

Test Synapse in worker mode under Complement #12638

clokep commented Aug 13, 2021

clokep commented Aug 13, 2021

kegsay commented Aug 17, 2021

anoadragon453 commented Aug 17, 2021

kegsay commented Nov 22, 2021

anoadragon453 commented Nov 22, 2021

richvdh commented May 4, 2022

richvdh commented May 5, 2022 •

edited

Loading

richvdh commented May 5, 2022

richvdh commented May 5, 2022

reivilibre commented May 11, 2022

richvdh commented May 16, 2022

reivilibre commented May 16, 2022

richvdh commented May 17, 2022 •

edited

Loading

reivilibre commented May 20, 2022

reivilibre commented May 23, 2022

anoadragon453 commented May 23, 2022

richvdh commented May 23, 2022

Test Synapse in worker mode under Complement #12638

Test Synapse in worker mode under Complement #12638

Comments

clokep commented Aug 13, 2021

clokep commented Aug 13, 2021

kegsay commented Aug 17, 2021

anoadragon453 commented Aug 17, 2021

kegsay commented Nov 22, 2021

anoadragon453 commented Nov 22, 2021

richvdh commented May 4, 2022

richvdh commented May 5, 2022 • edited Loading

richvdh commented May 5, 2022

richvdh commented May 5, 2022

reivilibre commented May 11, 2022

Footnotes

richvdh commented May 16, 2022

reivilibre commented May 16, 2022

richvdh commented May 17, 2022 • edited Loading

reivilibre commented May 20, 2022

reivilibre commented May 23, 2022

anoadragon453 commented May 23, 2022

richvdh commented May 23, 2022

richvdh commented May 5, 2022 •

edited

Loading

richvdh commented May 17, 2022 •

edited

Loading