mirror of
https://github.com/matrix-construct/construct
synced 2025-01-23 04:50:00 +01:00
437 lines
19 KiB
Markdown
437 lines
19 KiB
Markdown
# Matrix Protocol
|
|
|
|
### Introduction
|
|
|
|
*The authoritative place for learning about matrix is at [matrix.org](https://matrix.org) but
|
|
it may be worthwhile to spend a moment and consider this introduction which explains things
|
|
by distilling the formal core of the protocol before introducing all of the networking and
|
|
communicative accoutrements...*
|
|
|
|
#### Identity
|
|
|
|
The Matrix-ID or `mxid` is a universally unique plain-text string allowing
|
|
an entity to be addressed internet-wide which is fundamental to the matrix
|
|
federation in contrast to the traditional IRC server/network. An example of an
|
|
mxid: "@user:host" where `host` is a public DNS name, `user` is a party to
|
|
`host`, and the '@' character is replaced to convey type information. The
|
|
character, called a `sigil`, is defined to be '!' for `room_id` identifiers,
|
|
'$' for `event_id` identifiers, '#' for room aliases, and '@' for users.
|
|
|
|
#### Event
|
|
|
|
The fundamental primitive of this protocol is the `event` object. This object
|
|
contains some set of key/value pairs and the protocol defines a list of such keys
|
|
which are meaningful to the protocol. Other keys which are not meaningful to the
|
|
protocol can be included directly in the `event` object but there are no guarantees
|
|
for if and how a party will pass these keys. To dive right in, here's the list
|
|
of recognized keys for an `event`:
|
|
|
|
|
|
```
|
|
auth_events
|
|
content
|
|
depth
|
|
event_id
|
|
hashes
|
|
membership
|
|
origin
|
|
origin_server_ts
|
|
prev_events
|
|
prev_state
|
|
room_id
|
|
sender
|
|
signatures
|
|
state_key
|
|
type
|
|
|
|
```
|
|
|
|
In the event structure, the value for `sender` and `room_id` and `event_id` are
|
|
all an `mxid` of the appropriate type.
|
|
|
|
The `event` object is also the *only* fundamental primitive of the protocol; in other
|
|
words: everything is an `event`. All information is conveyed in events, and governed
|
|
by rules for proper values behind these keys. The rest of the protocol specification
|
|
describes an *abstract state machine* which has its state updated by an event, in
|
|
addition to providing a standard means for communication of events between parties
|
|
over the internet. That's it.
|
|
|
|
#### Timeline
|
|
|
|
The data tape of the matrix machine consists of a singly-linked list of `event`
|
|
objects with each referencing the `event_id` of its preceding parent somewhere
|
|
in the `prev_` keys; this is called the `timeline`. Each event is signed by its
|
|
creator and affirms all referenced events preceding it. This is a very similar
|
|
structure to that used by software like Git, and Bitcoin. It allows looking back
|
|
into the past from any point, but doesn't force a party to accept a future and
|
|
leaves dispute resolution open-ended (which will be explained later).
|
|
|
|
#### State
|
|
|
|
The `state` consists of a subset of events which are accumulated according to a
|
|
few rules when playing the tape through the machine. Events which are selected
|
|
as `state` will overwrite a matching previously selected `state event` and thus
|
|
reduce the number of events in this set to far less than the entire `timeline`.
|
|
The `state` is then used to satisfy queries for deciding valid transitions for
|
|
the machine. This is like the "work tree" in Git when positioned at some commit.
|
|
|
|
* Events with a `state_key` are considered state.
|
|
|
|
* The identity of a `state event` is the concatenation of the `room_id`
|
|
value with the `type` value with the `state_key` value. Thus an event
|
|
with the same `room_id, type, state_key` replaces an older event in `state`.
|
|
|
|
* Some `state_key` values are empty strings `""`. This is a convention for
|
|
singleton `state` events, like an `m.room.create` event. The `state_key`
|
|
is used to represent a set, like with `m.room.member` events, where the
|
|
value of the `state_key` is a user `mxid`.
|
|
|
|
#### Rooms
|
|
|
|
The `room` structure encapsulates an instance of the matrix machine. A room
|
|
is a container of `event` objects in the form of a timeline. The query
|
|
complexity for information in a room timeline is as follows:
|
|
|
|
- Ephemeral (non-state) events in the timeline have a linear lookup time:
|
|
the timeline must be iterated in sequence to find a satisfying message.
|
|
|
|
- State events in the timeline have a logarithmic lookup: the implementation
|
|
is expected to maintain a map of the `type`,`state_key` values for events
|
|
present in the timeline.
|
|
|
|
The matrix protocol specifies certain `event` types which are recognized to
|
|
affect the behavior of the `room`; here is a list of some types:
|
|
|
|
|
|
```
|
|
m.room.name
|
|
m.room.create
|
|
m.room.topic
|
|
m.room.avatar
|
|
m.room.aliases
|
|
m.room.canonical_alias
|
|
m.room.join_rules
|
|
m.room.power_levels
|
|
m.room.member
|
|
m.room.message
|
|
...
|
|
```
|
|
|
|
|
|
Some of these events are state events and some are ephemeral (these will be
|
|
detailed later). All `m.room.*` namespaced events govern the functionality of the
|
|
room. Rooms may contain events of any `type`, but we don't invent new `m.room.*`
|
|
type events ourselves. This project tends to create events in the namespace
|
|
`ircd.*` These events should not alter the room's functionality for a client
|
|
with knowledge of only the published `m.room.*` events wouldn't understand.
|
|
|
|
|
|
#### Coherence
|
|
|
|
Matrix is specified as a directed acyclic graph of messages. The conversation of
|
|
messages moves in one direction: past to future. Messages only reference other
|
|
messages which have a lower degree of separation indicated by the `depth` from
|
|
the first message in the graph (where `type` was `m.room.create`). Specifically,
|
|
each message makes a reference to all known messages at the last `depth`, or all
|
|
previously unknown messages at some lower `depth`. Each new message is broadcast
|
|
to all participants in a room.
|
|
|
|
* The monotonic increase in `depth` contributes to an intuitive "light cone"
|
|
read coherence. Knowledge of any piece of information (like an event) offers
|
|
strongly ordered knowledge of all known information which preceded it at
|
|
that point.
|
|
|
|
* Write consistency is relaxed. Multiple messages may be issued at the same
|
|
depth from independent actors and multiple reference trees may form
|
|
independent of others. This provides the scalar for performance in a large
|
|
distributed internet system.
|
|
|
|
References to previous events:
|
|
|
|
|
|
```
|
|
[A0] <-- [A1] <-- [A2] | A has seen B1 and includes a reference in A2
|
|
^ |
|
|
| <---<----<
|
|
| |
|
|
^------ [B1] <-- [B2] | B hasn't yet seen A1 or A2
|
|
|
|
[T0] A release A0 :
|
|
[T1] A release A1 : B acquire A0
|
|
[T2] : B release B1
|
|
[T3] A acquire B1 : B release B2
|
|
[T4] A release A2 :
|
|
```
|
|
|
|
Both actors will have their clock (depth) now set to 2 and will issue the
|
|
next new message at clock cycle 3 referencing all messages from cycle 2 to
|
|
merge the split in the illustration above which is happening.
|
|
|
|
|
|
```
|
|
[A0] <-- [A1] <-- [A2] [A4] | A now sees B3, B2, and B1
|
|
^ | | |
|
|
| <---<----< ^--<--< <--<
|
|
| | | |
|
|
^------- [B1] <-- [B2] <-- [B3] | B now sees A2, A1, and A0
|
|
```
|
|
|
|
Keen observers may have realized by now this system is not fully coherent.
|
|
To be coherent, a system must leverage *entry consistency* and/or *release
|
|
consistency*. Translated to this system:
|
|
|
|
* *Entry* is the point where an event is created containing references to
|
|
all previous events. *Entry consistency* would mean that the knowledge
|
|
of all those references is revealed from all parties to the issuer such that
|
|
the issuer would not be issuing a conflicting event.
|
|
|
|
* *Release* is the act of broadcasting that event to other servers. *Release
|
|
consistency* would mean that the integration of the newly issued event does not
|
|
conflict at the point of acceptance by each and every party.
|
|
|
|
This system appears to strive for *eventual consistency*. To be pedantic, that
|
|
is not a third lemma supplementing the above: it's a higher order composite (like
|
|
mutual exclusion, or other algorithms). What this system wants to achieve is a
|
|
byzantine tolerance which can be continuously corrected as more information is
|
|
learned. This is a *tolerance*, not a *prevention*, because the relaxed write
|
|
consistency is of extreme practical importance.
|
|
|
|
For *eventual consistency* to be coherent, the "seeds" of a correction have to
|
|
be planted early on before any fault. When the fault occurs, all deviations
|
|
can be corrected toward some single coherent state as each party learns more
|
|
information. Once all parties learn all information from the system, there is
|
|
no possibility for incoherence. The caveat is that some parties may need to
|
|
roll back certain decisions they made without complete information.
|
|
|
|
Consider the following: `Alice` is a room founder and has one other member
|
|
`Bob` who is an op. `Alice` outranks `Bob`. Consider the following scenario:
|
|
|
|
> 1. `Charlie` joins the room. Now the room has three members. Everyone is
|
|
> still in full agreement.
|
|
>
|
|
> 2. `GNAA` ddos's `Alice` so she can't reach the internet but she can still
|
|
> use her server on her LAN.
|
|
>
|
|
> 3. `Alice` likes `Charlie` so she gives him `+e` or some ban immunity.
|
|
>
|
|
> 4. `Bob` doesn't like `Charlie` so he bans him.
|
|
|
|
Now there is a classic byzantine fault. The internet sees a room with two
|
|
members `Alice` and `Bob` again while `Alice` sees a room with three: `Alice`, `Bob`
|
|
and `Charlie`.
|
|
|
|
> 5. `GNAA` stops the ddos.
|
|
|
|
This fault now has to be resolved. This is called "state conflict resolution"
|
|
and the matrix specification does not know how to do this. What is currently
|
|
specified is that `Alice` and `Bob` can only perform actions that are valid
|
|
with the knowledge they had when they performed them. In fact, that was true
|
|
in this scenario.
|
|
|
|
Intuitively, `Alice` needs to dominate the resolution because `Alice` outranks
|
|
`Bob`. `Charlie` must not be banned and the room must continue with three
|
|
members. Exactly how to roll back the ban and reinstate `Charlie` may seem
|
|
obvious but there are practicalities to consider: Perhaps `Alice` is ddosed for
|
|
something like a year straight and `Charlie` has entirely given up on socializing
|
|
over the internet. A seemingly random and irrelevant correction will be in store
|
|
for the room and the effects might be far more complicated.
|
|
|
|
|
|
### Implementation
|
|
|
|
|
|
#### Model
|
|
|
|
This system embraces the fact that "everything is an event." It then follows
|
|
that everything is a room. We use rooms for both communication and storage of
|
|
everything.
|
|
|
|
There is only one† backend database and it stores events. For example: there
|
|
is no "user accounts database" holding all of the user data for the server-
|
|
instead there is an `!accounts` *room*. To use these rooms as efficient
|
|
databases we categorize a piece of data with an event `type` and key it with
|
|
the event `state_key` and the value is the event `content`. Iteration of these
|
|
events is also possible. This is now a sufficient key-value store as good as
|
|
any other approach; better though, since such a databasing room retains all
|
|
features and distributed capabilities of any other room. We then focus our
|
|
efforts to optimize the behavior of a room, to the benefit of all rooms, and
|
|
all things.
|
|
|
|
† Under special circumstances other databases may exist but they are purely
|
|
slave to the events database: i.e one could `rm -rf` a slave database and it
|
|
would be rebuilt from the events database. These databases only exist if an
|
|
event is *truly* inappropriate and doesn't fit the model even by a stretch.
|
|
An example of this is the search-terms database which specializes in indexing
|
|
individual words to the events where they are found so content searches can be
|
|
efficient.
|
|
|
|
#### Flow
|
|
|
|
This is a single-writer/multiple-reader approach. The "core" is the only writer.
|
|
The write itself is just the saving of an event. This serves as a transaction
|
|
advancing the state of the machine with effects visible to all future
|
|
transactions and external actors.
|
|
|
|
The core takes the pattern of
|
|
`evaluate + exclude -> write commitment -> release sequence`. The single
|
|
writer approach means that we resolve all incoherence using exclusion or
|
|
reordering or rejection on entry and before any writing and release of the
|
|
event. Many ircd::ctx's can orbit the inner core resolving their evaluation
|
|
with the tightest exclusion occurring around the write at the inner core.
|
|
This also gives us the benefit of a total serialization at this point.
|
|
|
|
:::::::
|
|
||||||| <-- evaluation + rejection
|
|
\|/ <-- evaluation + exclusion / reordering
|
|
!
|
|
* <-- actor serialized core write commitment
|
|
//|||\\
|
|
//|// \\|\\
|
|
::::::::::::: <-- release sequence propagation cone
|
|
|
|
The evaluation phase ensures the event commitment will work: that the event
|
|
is valid, and that the event is a valid transition of the machine according
|
|
to the rules. This process may take some time and many yields and IO, even
|
|
network IO -- if the server lacks a warm cache. During the evaluation phase
|
|
locks and exclusions may be acquired to maintain the validity of the
|
|
evaluation state through writing at the expense of other contexts contending
|
|
for that resource.
|
|
|
|
> Many ircd::ctx are concurrently working their way through the core. The
|
|
> "velocity" is low when an ircd::ctx on this path may yield a lot for various
|
|
> IO and allow other events to be processed. The velocity increases when
|
|
> concurrent evaluation and reordering is no longer viable to maintain
|
|
> coherence. Any yielding of an ircd::ctx at a higher velocity risks stalling
|
|
> the whole core.
|
|
|
|
::::::: <-- event input (low velocity)
|
|
||||||| <-- evaluation process (low velocity)
|
|
\|/ <-- serialization process (higher velocity)
|
|
|
|
The write commitment saves the event to the database. This is a relatively
|
|
fast operation which probably won't even yield the ircd::ctx, and all
|
|
future reads to the database will see this write.
|
|
|
|
! <-- serial write commitment (highest velocity)
|
|
|
|
The release sequence broadcasts the event so its effects can be consumed.
|
|
This works by yielding the ircd::ctx so all consumers can view the event
|
|
and apply its effects for their feature module or send the event out to
|
|
clients. This is usually faster than it sounds, as the consumers try not to
|
|
hold up the release sequence for more than their first execution-slice,
|
|
and copy the event if their output rate is slower.
|
|
|
|
* <-- event revelation (higher velocity)
|
|
//|||\\
|
|
//|// \\|\\
|
|
::::::::::::: <-- release sequence propagation cone (low velocity)
|
|
|
|
The entire core commitment process relative to an event riding through it
|
|
on an ircd::ctx has a duration tolerable for something like a REST interface,
|
|
so the response to the user can wait for the commitment to succeed or fail
|
|
and properly inform them after.
|
|
|
|
The core process is then optimized by the following facts:
|
|
|
|
* The resource exclusion zone around most matrix events is either
|
|
small or non-existent because of its relaxed write consistency.
|
|
|
|
* Writes in this implementation will not delay.
|
|
|
|
"Core dilation" is a phenomenon which occurs when large numbers of events
|
|
which have relaxed dependence are processed concurrently because none of
|
|
them acquire any exclusivity which impede the others.
|
|
|
|
:::::::
|
|
|||||||
|
|
||||||| <-- Core dilation; flow shape optimized for volume.
|
|
|||||||
|
|
/|||||\
|
|
///|||\\\
|
|
//|/|||\|\\
|
|
:::::::::::::
|
|
|
|
Close up of the charybdis's write head when tight to one schwarzschild-radius of
|
|
matrix room surface which propagates only one event through at a time.
|
|
Vertical tracks are contexts on their journey through each evaluation and exclusion
|
|
step to the core.
|
|
|
|
Input Events Phase
|
|
:::::::::::::::::::::::::::::::::::::::::::::::::::::: validation / dupcheck
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||| identity/key resolution
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||| verification
|
|
|||| ||||||||||||||| ||||||||||||||| ||||||||||||||||| head resolution
|
|
--|--|----|-|---|--|--|---|---|---|---------|---|---|- graph resolutions
|
|
----------|-|---|---------|-------|-----------------|- module evaluations
|
|
\ | | | | /
|
|
== ==============| | == Lowest velocity locks
|
|
\ | | /
|
|
== | | == Mid velocity locks
|
|
\ | | /
|
|
== | / == High velocity locks
|
|
\ | / /
|
|
== =====/= == Highest velocity lock
|
|
\ / /
|
|
\__ / __/
|
|
_ | _
|
|
! Write commitment
|
|
|
|
|
|
Above, two contexts are illustrated as contending for the highest velocity
|
|
lock. The highest velocity lock is not held for significant time, as the
|
|
holder has very little work left to be done within the core, and will
|
|
release the lock to the other context quickly. The lower velocity locks
|
|
may have to be held longer, but are also less exclusive to all contexts.
|
|
|
|
* Singularity
|
|
[ ]
|
|
/-------------[---]-------------\
|
|
/ : : \ Federation send
|
|
/ /---------[---]---------\ \
|
|
/ : : \ Client sync
|
|
out / /------[---]------\ \ out
|
|
/ / : : \ \
|
|
/ out / | | \ out \
|
|
/ out / \ out \
|
|
/ \
|
|
return
|
|
| result to |
|
|
| evaluator |
|
|
-------------
|
|
|
|
Above, a close-up of the release sequence. The new event is being "viewed" by
|
|
each consumer context separated by the horizontal lines representing a context
|
|
switch from the perspective of the event travelling down. Each consumer
|
|
performs its task for how to propagate the commissioned event.
|
|
|
|
Each consumer has a shared-lock of the event which will hold up the completion
|
|
of the commitment until all consumers release that. The ideal consumer will only
|
|
hold their lock for a single context-slice while they play their part in applying
|
|
the event, like non-blocking copies to sockets etc. These consumers then go on
|
|
to do the rest of their output without the original event data which was memory
|
|
supplied by the evaluator (like an HTTP client). Then all locks acquired on
|
|
the entry side of the core can be released. The evaluator then gets the result
|
|
of the successful commitment.
|
|
|
|
#### Scaling
|
|
|
|
Scaling beyond the limit of a single CPU core can be done with multiple instances
|
|
of IRCd which form a cluster of independent actors. This cluster can extend
|
|
to other machines on the network too. The independent actors leverage the weak
|
|
write consistency and strong ordering of the matrix protocol to scale the same
|
|
way the federation scales.
|
|
|
|
Interference pattern of two IRCd'en:
|
|
|
|
|
|
```
|
|
::::::::::::::::::::::::::::::::::::
|
|
--------\:::::::/--\:::::::/--------
|
|
||||||| |||||||
|
|
\|/ \|/
|
|
! !
|
|
* *
|
|
//|||\\ //|||\\
|
|
//|// \\|\\//|// \\|\\
|
|
/|/|/|\|\|\/|/|/|\|\|\|\
|
|
```
|