2019-01-26 21:29:08 +01:00
|
|
|
## Database
|
2017-03-31 00:57:08 +02:00
|
|
|
|
2019-01-29 20:14:11 +01:00
|
|
|
### Organization
|
|
|
|
|
|
|
|
The database subsystem is split into three interfaces.
|
|
|
|
|
|
|
|
- IRCd developer interface. These classes are for developers to simply use the
|
|
|
|
database. An overview of them is given below. They do not require RocksDB headers
|
|
|
|
and expose no RocksDB internals. These interfaces are included in the standard
|
|
|
|
include stack. The headers are readily found listed in `<ircd/db.h>`. The
|
|
|
|
definitions are in `ircd/db.cc`.
|
|
|
|
|
|
|
|
- Database developer interface. These classes implement various RocksDB virtual
|
|
|
|
interfaces. They are all declared as nested classes of`ircd::db::database` Most of
|
|
|
|
these are not used directly by general users of the database; though they are
|
|
|
|
technically frontend interfaces to RocksDB. They help us create the IRCd
|
|
|
|
developer interface in `ircd/db.cc`. Headers are in `<ircd/db/database/*>`
|
|
|
|
but most are not included in the standard include stack.
|
|
|
|
|
|
|
|
- Database environment interface. These classes implement additional RocksDB
|
|
|
|
virtual interfaces for backend callbacks operations. We connect RocksDB to `ircd::fs`
|
|
|
|
and `ircd::ctx` et al on this surface. Headers are in `<ircd/db/database/env/*>`
|
|
|
|
and are never included in the standard include stack. Definitions are in a
|
|
|
|
separate unit `db_env.cc`.
|
|
|
|
|
|
|
|
- Database port interface (bonus). This hack helps us trick RocksDB into using
|
|
|
|
`ircd::ctx` rather than posix threads. We achieve this by conjuring magics using
|
|
|
|
shared library PLT indirection while praying that the class layout and semantics
|
|
|
|
RocksDB's compiler has assumed does not thwart our plans. These definitions are
|
|
|
|
in `ircd/db_port.cc`.
|
|
|
|
|
|
|
|
|
|
|
|
### Interface
|
2018-09-20 01:11:21 +02:00
|
|
|
The database here is a strictly schematized key-value grid built from the primitives of
|
|
|
|
`column`, `cell` and `row`. We use the database as both a flat-object store (matrix
|
|
|
|
event db) using cells, columns and rows; as well as binary data block store (matrix
|
|
|
|
media db) using just a single column.
|
2017-03-31 00:57:08 +02:00
|
|
|
|
2017-03-31 01:20:01 +02:00
|
|
|
#### Columns
|
2018-09-20 01:11:21 +02:00
|
|
|
A column is a key-value store. We specify columns in the database descriptors when
|
|
|
|
opening the database and (though technically somewhat possible) don't change them
|
|
|
|
during runtime. The database must be opened with the same descriptors in properly
|
|
|
|
aligned positions every single time. All keys and values in a column can be iterated.
|
|
|
|
Columns can also be split into "domains" of keys (see: `db/index.h`) based on the
|
|
|
|
key's prefix, allowing a seek and iteration isolated to just one domain.
|
|
|
|
|
|
|
|
In practice we create a column to present a single property in a JSON object. There
|
|
|
|
is no recursion and we also must know the name of the property in advance to specify
|
2019-01-23 22:07:47 +01:00
|
|
|
a descriptor for it. Applied, we create a column for each property in the Matrix
|
|
|
|
event object† and then optimize that column specifically for that property.
|
2018-09-20 01:11:21 +02:00
|
|
|
|
|
|
|
```
|
|
|
|
{
|
|
|
|
"origin_server_ts": 15373977384823,
|
|
|
|
"content": {"msgtype": "m.text", "body": "hello"}
|
|
|
|
}
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
For the above example consider two columns: first `origin_server_ts` is optimized for
|
|
|
|
timestamps by having values which are fixed 8 byte signed integers; next the content
|
|
|
|
object is stored in whole as JSON text in the content column. Recursion is not yet
|
|
|
|
supported but theoretically we can create more columns to hold nested properties
|
|
|
|
if we want further optimizations.
|
2017-03-31 00:57:08 +02:00
|
|
|
|
2019-01-23 22:07:47 +01:00
|
|
|
† Note that the application in the example has been further optimized; not
|
|
|
|
every property in the Matrix event has a dedicated column, and an additional
|
|
|
|
meta-column is used to store full JSON events (albeit compressed) which is
|
|
|
|
may be selected to improve performance by making a single larger request
|
|
|
|
rather than several smaller requests to compose the same data.
|
|
|
|
|
2017-03-31 01:20:01 +02:00
|
|
|
#### Rows
|
2017-03-31 00:57:08 +02:00
|
|
|
Since `columns` are technically independent key-value stores (they have their own
|
2018-09-20 01:11:21 +02:00
|
|
|
index), when an index key is the same between columns we call this a `row`. In
|
|
|
|
the Matrix event example, each property of the same event is sought together in a
|
|
|
|
`row`. A row seek is optimized and the individual cells are queried concurrently and
|
|
|
|
iterated in lock-step together.
|
2017-03-31 00:57:08 +02:00
|
|
|
|
2019-01-23 22:07:47 +01:00
|
|
|
There is a caveat to rows even though the queries to individual cells are
|
|
|
|
made concurrently: each cell still requires an individual request to the
|
|
|
|
storage device. The device has a fixed number of total requests it can service
|
|
|
|
concurrently.
|
|
|
|
|
2017-03-31 01:20:01 +02:00
|
|
|
#### Cells
|
2018-09-20 01:11:21 +02:00
|
|
|
A `cell` is a gratuitious interface representing of a single value in a `column` with
|
|
|
|
a common key that should be able to form a `row` between columns. A `row` is comprised
|
|
|
|
of cells.
|
2017-03-31 00:57:08 +02:00
|
|
|
|
2017-03-31 01:20:01 +02:00
|
|
|
### Important notes
|
2017-03-31 00:57:08 +02:00
|
|
|
|
|
|
|
!!!
|
|
|
|
The database system is plugged into the userspace context system to facilitate IO. This means
|
|
|
|
that an expensive database call (mostly on the read side) that has to do disk IO will suspend
|
|
|
|
your userspace context. Remember that when your userspace context resumes on the other side
|
|
|
|
of the call, the state of IRCd and even the database itself may have changed. We have a suite
|
|
|
|
of tools to mitigate this.
|
|
|
|
!!!
|
|
|
|
|
|
|
|
* While the database schema is modifiable at runtime (we can add and remove columns on
|
|
|
|
the fly) the database is very picky about opening the exact same way it last closed.
|
|
|
|
This means, for now, we have the full object schema explicitly specified when the DB
|
|
|
|
is first opened. All columns exist for the lifetime of the DB, whether or not you have
|
|
|
|
a handle to them.
|