Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Test Synapse in worker mode under Complement #12638

Closed
clokep opened this issue Aug 13, 2021 · 17 comments · Fixed by #12810
Closed

Test Synapse in worker mode under Complement #12638

clokep opened this issue Aug 13, 2021 · 17 comments · Fixed by #12810
Assignees
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@clokep
Copy link
Member

clokep commented Aug 13, 2021

There's been some work to support this in matrix-org/complement#62, matrix-org/complement#105, and matrix-org/complement#116 but it still isn't possible to run complement against Synapse in worker mode. We should finish this up.

This is a blocker for removing tests from sytest / switching to Complement.

@clokep
Copy link
Member Author

clokep commented Aug 13, 2021

@anoadragon453 could you braindump what you think needs to get done here?

@kegsay
Copy link
Member

kegsay commented Aug 17, 2021

matrix-org/pipelines#123 (comment) is probably a good start

@anoadragon453
Copy link
Member

#10065 (comment) is probably the best summary at the moment.

The main takeaway is signalling to Complement that all processes have completed startup before testing begins.

@kegsay
Copy link
Member

kegsay commented Nov 22, 2021

Was this done?

@anoadragon453
Copy link
Member

@kegsay Nope, still a TODO :/

@richvdh
Copy link
Member

richvdh commented May 4, 2022

#10065 is now resolved, and it is relatively easy to run complement against a synapse-with-workers by setting the WORKERS env var when running https://github.com/matrix-org/synapse/blob/develop/scripts-dev/complement.sh.

However, I just tried to do so, and (after 15 minutes) it exploded with a pile of fail, so more work to be done here :(.

@richvdh
Copy link
Member

richvdh commented May 5, 2022

ok, it works much better if you bump up the timeout for starting the homeserver (#12637).

The failing tests are now:

FAIL	github.com/matrix-org/complement/tests	605.382s
--- FAIL: TestCanRegisterAdmin (39.04s)
--- FAIL: TestServerNotices (23.15s)
--- FAIL: TestRegistration (34.00s)
    --- FAIL: TestRegistration/parallel (1.95s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_admin_with_shared_secret (0.06s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret (0.04s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_disallows_symbols (0.02s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_downcases_capitals (0.02s)
FAIL	github.com/matrix-org/complement/tests/csapi	604.665s

... which doesn't sound so bad. Presumably there is some snafu with routing of registration, which will explain 6 out of 7 of those fails.

@richvdh
Copy link
Member

richvdh commented May 5, 2022

oh hum, it looks like the whole of the /csapi package failed:

panic: test timed out after 10m0s

@richvdh richvdh transferred this issue from matrix-org/complement May 5, 2022
@richvdh richvdh changed the title Test Synapse in worker mode Test Synapse in worker mode under Complement May 5, 2022
@richvdh
Copy link
Member

richvdh commented May 5, 2022

right, if you bump the timeout right up, you still get this list of test failures:

--- FAIL: TestSyncBlocksDuringPartialStateJoin (48.45s)
--- FAIL: TestImportHistoricalMessages (280.31s)
FAIL
FAIL	github.com/matrix-org/complement/tests	3067.783s
--- FAIL: TestCanRegisterAdmin (50.52s)
--- FAIL: TestServerNotices (39.16s)
--- FAIL: TestRegistration (29.87s)
    --- FAIL: TestRegistration/parallel (2.07s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_admin_with_shared_secret (0.04s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret (0.02s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_disallows_symbols (0.01s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_downcases_capitals (0.01s)
FAIL
FAIL	github.com/matrix-org/complement/tests/csapi	2861.820s
FAIL

Which still isn't so bad. (Though a runtime of getting on for an hour is obviously sub-optimal.)

@anoadragon453 anoadragon453 added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label May 5, 2022
@reivilibre
Copy link
Contributor

I ran Complement with workers1, obtaining a runtime of 11m 7s and the following failures:

--- FAIL: TestSyncBlocksDuringPartialStateJoin (12.50s)
--- FAIL: TestMediaFilenames (12.89s)
    --- FAIL: TestMediaFilenames/Parallel (0.00s)
        --- FAIL: TestMediaFilenames/Parallel/Unicode (0.00s)
            --- FAIL: TestMediaFilenames/Parallel/Unicode/Can_download_with_Unicode_file_name_over_federation (0.50s)
--- FAIL: TestLocalPngThumbnail (6.21s)
--- FAIL: TestRemotePngThumbnail (14.49s)
--- FAIL: TestImportHistoricalMessages (50.93s)
FAIL
FAIL    github.com/matrix-org/complement/tests  664.096s
--- FAIL: TestCanRegisterAdmin (8.45s)
--- FAIL: TestServerNotices (6.20s)
--- FAIL: TestRegistration (9.56s)
    --- FAIL: TestRegistration/parallel (0.42s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_admin_with_shared_secret (0.01s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret (0.01s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_disallows_symbols (0.01s)
        --- FAIL: TestRegistration/parallel/POST_/_synapse/admin/v1/register_with_shared_secret_downcases_capitals (0.01s)
--- FAIL: TestRoomImageRoundtrip (7.46s)
FAIL
FAIL    github.com/matrix-org/complement/tests/csapi    586.197s
FAIL

(Not sure what happened between Rich's run taking most of an hour and my run!)

I also get some failures with no workers2; so there might be something wrong with my setup (as CI doesn't reproduce these failures):

--- FAIL: TestRoomState (5.77s)
    --- FAIL: TestRoomState/Parallel (0.00s)
        --- FAIL: TestRoomState/Parallel/GET_/publicRooms_lists_newly-created_room (1.97s)
FAIL
FAIL    github.com/matrix-org/complement/tests/csapi    247.104s
FAIL

So the summary of 'What's the state of workerised Synapse under Complement?' is:

  • the current approach seems viable overall
  • some tests fail and will need to be fixed
  • some tests might take a long time (on Rich's machine, haven't reproduced here though)

Footnotes

  1. (With firewall disabled!) WORKERS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh &> ../_misc/2022-05/comlog6_with_workers

  2. COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh &> ../_misc/2022-05/comlog6_no_workers

@richvdh
Copy link
Member

richvdh commented May 16, 2022

some tests might take a long time (on Rich's machine, haven't reproduced here though)

they all take a long time. Every single test seems to take a good 30 seconds to spin up Synapse: AFAICT it's just busy spinning up 14(?) python processes - or 28, if it's a federated test.

It's entirely possible the difference is in hardware. FTR my machine is a i7-6560U with 16G of ram, though it's a bit venerable and I suspect it gets thermally throttled when it's busy.

@reivilibre
Copy link
Contributor

Every single test seems to take a good 30 seconds to spin up Synapse: AFAICT it's just busy spinning up 14(?) python processes - or 28, if it's a federated test.

Why is Sytest not bitten by this problem? Come to think of it, does Sytest not spin up a new homeserver for every test?

I also wonder if supervisor isn't starting up the workers at the same time or something like that;

It's entirely possible the difference is in hardware. FTR my machine is a i7-6560U with 16G of ram

That's a good point; the machine I use is very performant (Ryzen 7 5800H with 32 GB RAM). The cooling is also very competent so I suspect it's not thermally throttled. It has enough threads for all the workers to get their own (crazy I know). A 6x difference in runtime doesn't sound so crazy taking all that into account.

What does Complement's runtime look like for you when using a monolithic deployment? As a point of comparison, COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh takes 5m8s here.

@richvdh
Copy link
Member

richvdh commented May 17, 2022

Why is Sytest not bitten by this problem? Come to think of it, does Sytest not spin up a new homeserver for every test?

No it does not, and that is why it is not bitten by this problem :). It spins up two HS instances, and uses them for each test. On the one hand, it's faster. On the other, it's very easy for state to leak from one test to another.

I also wonder if supervisor isn't starting up the workers at the same time or something like that;

it seems to be starting them together, or near enough. Ironically I wonder if it would be better if it staggered them. But 🤷‍♂️

What does Complement's runtime look like for you when using a monolithic deployment? As a point of comparison, COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh takes 5m8s here.

Apparently, more than 10 minutes, since it timed out.

Something I forgot here: it's not just spinning up two homeservers at once (for the federated tests), it's doing three (because Complement likes to run the tests/csapi tests in parallel with the tests tests). Adding -p 1 to the commandline (see https://pkg.go.dev/cmd/go#hdr-Compile_packages_and_dependencies) helps with that.

@reivilibre
Copy link
Contributor

I'm down to just a couple of test failures (on a custom branch with all my in-flight PRs merged, that is):

--- FAIL: TestSyncBlocksDuringPartialStateJoin (11.01s)
--- FAIL: TestImportHistoricalMessages (45.01s)
FAIL
FAIL    github.com/matrix-org/complement/tests  626.584s
FAIL

These two tests are tracked in #12822 and #12825.

After that, I'm trying to get workerised Complement enabled in CI at #12810 (looks like it's going to involve a bit of fighting GitHub Actions since my first attempt didn't just work right away).
The slowness might bite us — if CI is bogged down by this then we'll need to think of some way to improve that (perhaps by disabling unnecessary workers in some tests? At the risk of maintaining a map from test ⇒ useless workers..).

Fixing the slowness would be nice anyway, to help Rich and likely others out, but if CI turns out to be tolerable (i.e. no slower than the slowest of SyTest or Trial?) then it'd be good to get this running in CI as soon as possible.

@reivilibre
Copy link
Contributor

#12810 is giving 52 minutes for running the suite in CI. That's not good; easily twice as slow as the next slowest thing in CI.
I have a few ideas (and it would be worth having a look to see where the time is actually going), but I'm doubtful that there will be any trivial 2× improvements!

I will note that this figure will almost certainly get worse as new tests are added (so aiming for SyTest parity will make it worse!)

@anoadragon453
Copy link
Member

Since Complement tests are self-contained, one way to handle this is to run the test suite across multiple GHA workers. It's not ideal, but would work.

@richvdh
Copy link
Member

richvdh commented May 23, 2022

I think we need to confirm what, exactly, is taking so long. I think it's Python loading all the .pyc(?) files but it would be good to confirm that - possibly by adding extra logging.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants