Method `_get_state_map_for_room` seems to break in presence of some ill-formed events in the database. Reimplementing this method to use `get_current_state`, which is more robust to such events.
This happened if we encountered a stream ordering in `event_push_actions` that had more rows than the batch size of the delete, as If we don't delete any rows in an iteration then the next time round we get the exact same stream ordering and get stuck.
Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.
We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special `PartialStateConflictError` exception,
which makes its way up to the method in which we computed the event
context.
To make things difficult, the exception needs to cross a replication
request: `/fed_send_events` for events coming over federation and
`/send_event` for events from clients. We transport the
`PartialStateConflictError` as a `409 Conflict` over replication and
turn `409`s back into `PartialStateConflictError`s on the worker making
the request.
All client events go through
`EventCreationHandler.handle_new_client_event`, which is called in
*a lot* of places. Instead of trying to update all the code which
creates client events, we turn the `PartialStateConflictError` into a
`429 Too Many Requests` in
`EventCreationHandler.handle_new_client_event` and hope that clients
take it as a hint to retry their request.
On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
`FederationEventHandler._auth_and_persist_outliers_inner`,
`FederationHandler.do_knock`, `FederationHandler.on_invite_request` and
`FederationHandler.do_remotely_reject_invite`. These events won't have
the partial state flag, so we do not need to do anything for then.
The remaining 3 paths which create events are
`FederationEventHandler.process_remote_join`,
`FederationEventHandler.on_send_membership_event` and
`FederationEventHandler._process_received_pdu`.
We can't experience the race in `process_remote_join`, unless we're
handling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.
`on_send_membership_event` is only called by
`FederationServer._on_send_membership_event`, so we catch the
`PartialStateConflictError` there and retry just once.
`_process_received_pdu` is called by `on_receive_pdu` for incoming
events and `_process_pulled_event` for backfill. The latter should never
try to persist partial state events, so we ignore it. We catch the
`PartialStateConflictError` in `on_receive_pdu` and retry just once.
Refering to the graph of code paths in
https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648
may make the above make more sense.
Signed-off-by: Sean Quah <seanq@matrix.org>
* Cast to postgres types when handling postgres db
* Remove unused method
* Easy annotations
* Annotate create_room
* Use `ParamSpec` to annotate looping_call
* Annotate `default_config`
* Track `now` as a float
`time_ms` returns an int like the proper Synapse `Clock`
* Introduce a `Timer` dataclass
* Introduce a Looper type
* Suppress checking of a mock
* tests.utils is typed
* Changelog
* Whoops, import ParamSpec from typing_extensions
* ditch the psycopg2 casts
==============================
Bugfixes
--------
- Fix unread counts for users on large servers. Introduced in v1.62.0rc1. ([\#13140](https://github.com/matrix-org/synapse/issues/13140))
- Fix DB performance when deleting old push notifications. Introduced in v1.62.0rc1. ([\#13141](https://github.com/matrix-org/synapse/issues/13141))
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEgQG31Z317NrSMt0QiISIDS7+X/QFAmK+0FYACgkQiISIDS7+
X/RE7w//RJTQD+9rqanBj9IxE07Vy6nbMxRoxMbhdj9DMidepxNalxg9MqkhZXJC
AJquNIVhgMLs7HDO0KAu6xePEjr/E2/kmUJZwZX6cgeh8/Yp2NcSpWp4kq+gc2IJ
vtc5GZ5tgyvQ8UJB6xozL62g3aDcCtXppRJKDx+OkBgWWrczHg+zGi6XkLfsL86L
0yBgyp57cLRIyZ97isvAH7BEYF/vSVVvpjzA/m61LdftfMalTqgpUUt89rqWmbyw
iqcaspRTukHXpeCwFyvrpZviMl82wVTN3K2rUZAGisS6Xuht7K30I7uszIPx3gd/
S47fav2onOpAwSHni2IbZa+cLsz7vEaNFIj21/QS+wZvkFpbvf17m8AIxb7y31i4
cBKJC2qRyVFlQyGYNi3yZ4V1jY2nLV+/lC9Z1epYH5EmXAKSxHvxNL609QLaMovA
OPv/wgPESQnvxpHc9qqzGh9LJ8O+YaaPY4t2MB4Kf5CkIf25yOuDjvzUcJKiM3G8
ytwEdOipw80WozyfUuivz5F5skJ3ay8OvUeL/AxlA0k+2Q5nThyLO7LODA5pZSlO
vHLvIhp8FDdoFLwRtBiGfC42cvjVEsjiYD46M7uYQCGVVdu1/UegUoaA0beFwbHI
wFyFOTGip63nwAtsSAgqkq/yyBigHJsb9Z37X+sMRweY13337nM=
=PH+y
-----END PGP SIGNATURE-----
Merge tag 'v1.62.0rc2' into develop
Synapse 1.62.0rc2 (2022-07-01)
==============================
Bugfixes
--------
- Fix unread counts for users on large servers. Introduced in v1.62.0rc1. ([\#13140](https://github.com/matrix-org/synapse/issues/13140))
- Fix DB performance when deleting old push notifications. Introduced in v1.62.0rc1. ([\#13141](https://github.com/matrix-org/synapse/issues/13141))
When we receive an event over federation during a faster join, there is no need
to wait for full state, since we have a whole reconciliation process designed
to take the partial state into account.