Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Sending messages to other servers is broken #9185

Closed
artemislena opened this issue Jan 20, 2021 · 19 comments
Closed

Sending messages to other servers is broken #9185

artemislena opened this issue Jan 20, 2021 · 19 comments
Labels
T-Other Questions, user support, anything else. X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@artemislena
Copy link

Description

Sending messages to other servers does not work, however, receiving works, and read events and typing events go through.

Steps to reproduce

  • Install Synapse.
  • Run.
  • Try sending a message and see if it arrives on the other server.

Expected behaviour

The message should arrive.

Actual behaviour

It does not. Log says

2021-01-20 20:24:52,681 - synapse.metrics.background_process_metrics - 210 - ERROR - process_event_queue_for_federation-21 - Background process 'process_event_queue_for_federation' threw an exception
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/synapse/metrics/background_process_metrics.py", line 208, in run
    return await maybe_awaitable(func(*args, **kwargs))
  File "/usr/lib/python3.9/site-packages/synapse/federation/sender/__init__.py", line 252, in _process_event_queue_loop
    await make_deferred_yieldable(
twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback: <class 'psycopg2.errors.SerializationFailure'>: could not serialize access due to concurrent update

/usr/lib/python3.9/site-packages/twisted/internet/defer.py:501:errback
/usr/lib/python3.9/site-packages/twisted/internet/defer.py:568:_startRunCallbacks
/usr/lib/python3.9/site-packages/twisted/internet/defer.py:654:_runCallbacks
/usr/lib/python3.9/site-packages/twisted/internet/defer.py:1475:gotResult
--- <exception caught here> ---
/usr/lib/python3.9/site-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/lib/python3.9/site-packages/twisted/python/failure.py:512:throwExceptionIntoGenerator
/usr/lib/python3.9/site-packages/synapse/federation/sender/__init__.py:246:handle_room_events
/usr/lib/python3.9/site-packages/synapse/federation/sender/__init__.py:234:handle_event
/usr/lib/python3.9/site-packages/synapse/federation/sender/__init__.py:310:_send_pdu
/usr/lib/python3.9/site-packages/synapse/storage/databases/main/transactions.py:313:store_destination_rooms_entries
/usr/lib/python3.9/site-packages/synapse/storage/database.py:656:runInteraction
/usr/lib/python3.9/site-packages/synapse/storage/database.py:739:runWithConnection
/usr/lib/python3.9/site-packages/twisted/python/threadpool.py:250:inContext
/usr/lib/python3.9/site-packages/twisted/python/threadpool.py:266:<lambda>
/usr/lib/python3.9/site-packages/twisted/python/context.py:122:callWithContext
/usr/lib/python3.9/site-packages/twisted/python/context.py:85:callWithContext
/usr/lib/python3.9/site-packages/twisted/enterprise/adbapi.py:306:_runWithConnection
/usr/lib/python3.9/site-packages/twisted/python/compat.py:464:reraise
/usr/lib/python3.9/site-packages/twisted/enterprise/adbapi.py:297:_runWithConnection
/usr/lib/python3.9/site-packages/synapse/storage/database.py:734:inner_func
/usr/lib/python3.9/site-packages/synapse/storage/database.py:534:new_transaction
/usr/lib/python3.9/site-packages/synapse/storage/databases/main/transactions.py:345:_store_destination_rooms_entries_txn
/usr/lib/python3.9/site-packages/synapse/storage/database.py:268:execute_batch
/usr/lib/python3.9/site-packages/synapse/storage/database.py:314:_do_execute
/usr/lib/python3.9/site-packages/synapse/storage/database.py:268:<lambda>
/usr/lib/python3.9/site-packages/psycopg2/extras.py:1198:execute_batch
]]

Version information

  • Homeserver: fantasycookie17.onederfultech.com

  • Version: I first ran 1.25.0, but then was recommended to switch to 1.26.0.rc3 as I was encountering an error in the logs that was fixed by this.

  • Install method: 1.25.0 via XBPS, 1.26.0.rc3 manually via git clone.

  • Platform: Void Linux in a KVM on Fedora Server. x86-64.

@clokep
Copy link
Member

clokep commented Jan 20, 2021

1.26.0.rc3 is not a version that exists, can you confirm what version you're using?

Generally the "could not serialize access due to concurrent update" error isn't something to be concerned about see #4993.

Did you have outgoing federation working and it broke or has it never worked? Additional logs of what the federation sender is doing would help.

@artemislena
Copy link
Author

Sorry, I meant rc1… And federation sending used to work, it just broke today or so after working for months.
Not sure what additional logs I should have… On loglevel ERROR, this is all I get. Should I increase it? If so, to what level?

@clokep
Copy link
Member

clokep commented Jan 20, 2021

On loglevel ERROR, this is all I get. Should I increase it? If so, to what level?

We will not be able to help you without INFO logs.

@artemislena
Copy link
Author

artemislena commented Jan 20, 2021

Oh, sure. Decided to use a pastebin because of the amount of output this generated…
https://privatebin.net/?bded4619a5024ad8#5kMm4FuxoW6SnAw853QgFumhD98rrx8M3qL9vc2Duw2Q
Edit: Due to the 24 hour time limit on the other one, here's another: https://commie.io/#UsPeL5BB

@anoadragon453
Copy link
Member

(I had some difficulty loading the commie.io link, so here's a github gist mirror just in case).

@anoadragon453 anoadragon453 added T-Other Questions, user support, anything else. X-Needs-Discussion labels Jan 21, 2021
@artemislena
Copy link
Author

You have to wait a moment (maybe 30 seconds or so) until it loads the text… But sure, appreciated.

@artemislena
Copy link
Author

I also tried setting up a federation sender worker, which did not help fix the issue, and seems to roughly give me the same logs. If those are needed, let me know.

@artemislena
Copy link
Author

@clokep Btw, I looked at the issue you linked (#4993)… While that one error message is the same, overall, it seems to be a different issue, AFAICT. What I get is an error with a full backtrace, whereas the issue you linked only ever showed warnings.

@anoadragon453
Copy link
Member

I believe could not serialize access to a concurrent update simply means that the worker was unable to get a spare connection to the database. However the transaction will be retried automatically until a connection is free.

It's possible that this error isn't caught in the process_event_queue_for_federation background process and is causing it to exit early (this may be a bug). However, that background process is run in a loop, and should re-run a few moments later.

From your logs I see a lot of transactions succeeding, with 200s returned by the remote server.

Interestingly I don't see any transactions to matrix.org. Is that perhaps the other homeserver you were testing with?

@anoadragon453 anoadragon453 added X-Needs-Info This issue is blocked awaiting information from the reporter and removed X-Needs-Discussion labels Jan 26, 2021
@artemislena
Copy link
Author

Me and different users of my server tested several different servers in total. I myself mainly tested with privacytools.io… Jonah (@jonaharagon), the admin of aragon.sh, also got an account on my server and performed some tests, coming to the conclusion that it "doesn't make any sense" due to the fact that events from my server were clearly showing up in the logs of his server, yet, no messages from it were visible nevertheless.

@anoadragon453
Copy link
Member

Hmm, I can see transactions going through to aragon.sh in your logs.

I can also see transactions from fantasycookie17.onederfultech.com on matrix.org with no errors, and your events don't look to be soft-failing. Perhaps @jonaharagon can query their database for the IDs of the missing events and see if any are marked as outliers or are rejected?

-- Check whether the event is marked as an outlier
select event_id, processed, outlier from events where event_id = '$xxx';
-- Check whether the event has been soft-failed
select event_id, internal_metadata from event_json where event_id = '$xxx';
-- Check for any rejected events from the homeserver
select * from rejections where event_id like '%fantasycookie17.onederfultech.com';

To alleviate the original could not serialize access due to concurrent update to see if that helps, you can try raising the database.cp_max value in your homeserver config. This value specifies the limit for the max number of postgres connections at any one time. Raising that value (say to 15 or 20) will reduce the number of connection contentions.

If that doesn't alleviate the issue, then I'm at a bit of a dead end. You could try PM'ing me from the affected account at @andrewm:amorgan.xyz and I can try some further troubleshooting.

@artemislena
Copy link
Author

Another error that appeared in my fedsender.log:

2021-01-27 21:06:48,088 - synapse.replication.tcp.handler - 260 - ERROR - process-replication-data-3 - Failed to handle command <synapse.replication.tcp.commands.PositionCommand object at 0x7feb4ffcea30>
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/hyperlink/_url.py", line 1142, in from_text
    port = int(port)
ValueError: invalid literal for int() with base 10: ':1:None'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/synapse/replication/http/_base.py", line 235, in send_request
    result = await request_func(uri, data, headers=headers)
  File "/usr/lib/python3.9/site-packages/synapse/http/client.py", line 570, in get_json
    body = await self.get_raw(uri, args, headers=headers)
  File "/usr/lib/python3.9/site-packages/synapse/http/client.py", line 655, in get_raw
    response = await self.request("GET", uri, headers=Headers(actual_headers))
  File "/usr/lib/python3.9/site-packages/synapse/http/client.py", line 404, in request
    request_deferred = treq.request(
  File "/usr/lib/python3.9/site-packages/treq/api.py", line 119, in request
    return _client(**kwargs).request(method, url, **kwargs)
  File "/usr/lib/python3.9/site-packages/treq/client.py", line 157, in request
    url = URL.fromText(url).asURI().asText().encode('ascii')
  File "/usr/lib/python3.9/site-packages/hyperlink/_url.py", line 1502, in fromText
    return cls.from_text(s)
  File "/usr/lib/python3.9/site-packages/hyperlink/_url.py", line 1146, in from_text
    raise URLParseError('expected integer for port, not %r' % port)
hyperlink._url.URLParseError: expected integer for port, not ':1:None'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/handler.py", line 258, in _unsafe_process_queue
    await self._process_command(cmd, conn, stream_name)
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/handler.py", line 271, in _process_command
    await self._process_position(stream_name, conn, cmd)
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/handler.py", line 554, in _process_position
    (updates, current_token, missing_updates) = await stream.get_updates_since(
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/streams/_base.py", line 185, in get_updates_since
    updates, upto_token, limited = await self.update_function(
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/streams/_base.py", line 211, in update_function
    result = await client(
  File "/usr/lib/python3.9/site-packages/synapse/replication/http/_base.py", line 254, in send_request
    raise SynapseError(502, "Failed to talk to main process") from e
synapse.api.errors.SynapseError: 502: Failed to talk to main process
2021-01-27 21:06:48,092 - synapse.replication.tcp.handler - 260 - ERROR - process-replication-data-11 - Failed to handle command <synapse.replication.tcp.commands.PositionCommand object at 0x7feb4ffcea30>
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/hyperlink/_url.py", line 1142, in from_text
    port = int(port)
ValueError: invalid literal for int() with base 10: ':1:None'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/synapse/replication/http/_base.py", line 235, in send_request
    result = await request_func(uri, data, headers=headers)
  File "/usr/lib/python3.9/site-packages/synapse/http/client.py", line 570, in get_json
    body = await self.get_raw(uri, args, headers=headers)
  File "/usr/lib/python3.9/site-packages/synapse/http/client.py", line 655, in get_raw
    response = await self.request("GET", uri, headers=Headers(actual_headers))
  File "/usr/lib/python3.9/site-packages/synapse/http/client.py", line 404, in request
    request_deferred = treq.request(
  File "/usr/lib/python3.9/site-packages/treq/api.py", line 119, in request
    return _client(**kwargs).request(method, url, **kwargs)
  File "/usr/lib/python3.9/site-packages/treq/client.py", line 157, in request
    url = URL.fromText(url).asURI().asText().encode('ascii')
  File "/usr/lib/python3.9/site-packages/hyperlink/_url.py", line 1502, in fromText
    return cls.from_text(s)
  File "/usr/lib/python3.9/site-packages/hyperlink/_url.py", line 1146, in from_text
    raise URLParseError('expected integer for port, not %r' % port)
hyperlink._url.URLParseError: expected integer for port, not ':1:None'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/handler.py", line 258, in _unsafe_process_queue
    await self._process_command(cmd, conn, stream_name)
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/handler.py", line 271, in _process_command
    await self._process_position(stream_name, conn, cmd)
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/handler.py", line 554, in _process_position
    (updates, current_token, missing_updates) = await stream.get_updates_since(
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/streams/_base.py", line 185, in get_updates_since
    updates, upto_token, limited = await self.update_function(
  File "/usr/lib/python3.9/site-packages/synapse/replication/tcp/streams/_base.py", line 211, in update_function
    result = await client(
  File "/usr/lib/python3.9/site-packages/synapse/replication/http/_base.py", line 254, in send_request
    raise SynapseError(502, "Failed to talk to main process") from e
synapse.api.errors.SynapseError: 502: Failed to talk to main process

I raised cp_max to 20 and I still get the original error… Anyway, I DMed you. And I notified Jonah…

@auscompgeek
Copy link
Contributor

That error looks like the worker config is broken. Can you share your config?

@artemislena
Copy link
Author

Well, I would, however, it seems like my server works again, for whatever reason… Messages have been arriving at other servers… This also means the config can't be broken, as I have send_federation: false in my homeserver.yaml, so only the worker could send these events. Idk, if nobody else is experiencing this issue, we can probably close this?

@anoadragon453
Copy link
Member

I think @auscompgeek means that your worker is unable to talk back to the main Synapse process due to a misconfiguration in the worker_replication_* options in your federation sender's worker config.

If that is misconfigured, then federation traffic can still be sent, but things may still break in subtle ways.

@artemislena
Copy link
Author

Oh. I see.

worker_app: synapse.app.federation_sender
worker_name: fedsender

worker_replication_host: 127.0.0.1
worker_replication_port: 9093

worker_log_config: /etc/synapse/fedsender.log.config

That's my federation_sender.yaml. Do you need anything from the homeserver.yaml?

@anoadragon453
Copy link
Member

anoadragon453 commented Jan 28, 2021

Nope, that's ok. So I'm realising that this is probably something we should really bold in the worker documentation, but worker_replication_port should be worker_replication_http_port.

Edit: or we should at least shout loudly if one hasn't included worker_replication_http_port in their worker config.

@artemislena
Copy link
Author

Ah. My bad, I guess? Anyway, changed that now, and I'm not getting any errors anymore.

@anoadragon453
Copy link
Member

Glad to hear, and we should really fail in a louder way if that happens. Most important thing is that it's working now 🙂

Closing this issue then.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Other Questions, user support, anything else. X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

4 participants