mirror of
https://github.com/matrix-org/dendrite
synced 2024-12-13 19:53:14 +01:00
Created How p2p.riot.im works (markdown)
parent
986db8a12f
commit
ca0a8e361d
1 changed files with 67 additions and 0 deletions
67
How-p2p.riot.im-works.md
Normal file
67
How-p2p.riot.im-works.md
Normal file
|
@ -0,0 +1,67 @@
|
||||||
|
### How does p2p.riot.im work exactly: a tale of how to run a complex Go server in a browser
|
||||||
|
|
||||||
|
Chances are, if you've heard about running other programming languages in the browser, you might be thinking WebAssembly (WASM). We use it to run Dendrite in the browser, but there's an awful lot more to it which this blog post will explain. This is geared towards programmers and is not Matrix specific. If you want a higher level overview, check out [this blog post](https://matrix.org/blog/2020/06/02/introducing-p-2-p-matrix/).
|
||||||
|
|
||||||
|
#### Databases
|
||||||
|
|
||||||
|
First things first: databases. Dendrite previously only supported Postgres, which has no way of being compiled to WASM. SQLite is well known for its embeddability, and sure enough you can compile SQLite to WASM. We first had to make the server support multiple database engines, which itself was no small undertaking. Aside from SQL syntax differences, 'database is locked' errors were extremely common, but what does this mean exactly? SQLite doesn't allow multiple connections to write to the database at the same time. If you attempt to do so, you'll see 'database is locked'. You can mitigate this slightly by setting a [busy timeout](https://www.sqlite.org/c3ref/busy_timeout.html) which attempts to wait until the table is unlocked - but this is just a plaster over a fundamental structural problem of the code. To resolve the issue completely, you need to painstakingly go through where you make SQL queries and look for any write statements (CREATE, DELETE, DROP, INSERT, or UPDATE) and make sure that you are only doing this sequentially e.g from the same goroutine.
|
||||||
|
|
||||||
|
#### Build tags
|
||||||
|
|
||||||
|
Assuming you now have both postgres/sqlite support, you might try to compile Go to WebAssembly. There's a good instruction guide at https://github.com/golang/go/wiki/WebAssembly but in essence it just equates to `GOOS=js GOARCH=wasm go build ....` which will hit the next problem: not all libraries support being compiled to WASM:
|
||||||
|
|
||||||
|
```
|
||||||
|
pq undefined: userCurrent
|
||||||
|
```
|
||||||
|
The `lib/pq` library for one does not. When built for WASM it doesn't have any idea of unix users so it cannot compile. Runtime switches to only use SQLite won't help here as the unused libraries still get compiled. A quick fix would be to fork the project and stub out the missing functions, but the long term fix is to not import that library at all when being built under WASM. You can do this using build tags, and there's a good explanation of this at [Dave Cheney's blog](https://dave.cheney.net/2013/10/12/how-to-use-conditional-compilation-with-the-go-build-tool).
|
||||||
|
|
||||||
|
In essence, build tags will include or exclude **files** from being picked up by the `go` toolchain. It doesn't work at a package level. This means if you have certain imports you don't want (e.g `lib/pq`) then you need to take the functions that use those imports and put them in a separate file, then define those very same functions in another file and specify the build flags accordingly (`// +build wasm` or `// +build !wasm`). Go provides a shorthand way of doing this based off the filename: `foo_wasm.go` implies `// +build wasm`.
|
||||||
|
|
||||||
|
When writing your own Go code for use with WASM, bear in mind that integer limits on WASM are lower than int64, so operations involving say `math.MaxUint64` [may not behave as you'd expect](https://github.com/matrix-org/dendrite/commit/bfb954519bdf172451d999ac4c654b3d15eff124) and cause the entire runtime to collapse.
|
||||||
|
|
||||||
|
#### CGO
|
||||||
|
|
||||||
|
So you've done `GOOS=js GOARCH=wasm go build ....` and got it to produce a `.wasm` file. The instructions at https://github.com/golang/go/wiki/WebAssembly tell you how to load this into the browser: specifically you need to run `wasm_exec.js` (which comes from `GOROOT`) first, which sets up a `global.Go` object which can run the program (it sets up the runtime). If you do this, you'll find your next problem:
|
||||||
|
```
|
||||||
|
Binary was compiled with 'CGO_ENABLED=0', go-sqlite3 requires cgo to work. This is a stub
|
||||||
|
```
|
||||||
|
This is an error from `mattn/go-sqlite3` with a tantalisingly simple fix. However, running `CGO_ENABLED=1 GOOS=js GOARCH=wasm go build ....` won't do what you want:
|
||||||
|
```
|
||||||
|
unknown ptrSize for $GOARCH "wasm"
|
||||||
|
/usr/local/go/src/os/user/lookup.go:36:9: undefined: lookupUser
|
||||||
|
```
|
||||||
|
The problem is that Go doesn't know how to compile C to WASM. There are of course projects like Emscripten which do this, but Go doesn't know how. There's a few solutions here, which one works depends on when you're reading this:
|
||||||
|
- Use [Dynamic Linking](https://webassembly.org/docs/dynamic-linking/).
|
||||||
|
- Manually transfer SQL requests to an Emscripten form of SQLite.
|
||||||
|
|
||||||
|
Dynamic linking would be the ideal solution: compile all C modules as side modules and link to it when producing `dendrite.wasm`. Unfortunately, the Go toolchain doesn't support this yet. Instead, [we have written a SQL driver](https://github.com/matrix-org/go-sqlite3-js) which will pass SQL queries up to JS via a global variable `_go_sqlite`. We use sql.js - an Emscripten-ised form of SQLite3 - to actually handle the queries. This global variable is expected to be the result of `await initSqlJs(...)` from https://github.com/sql-js/sql.js. There's a few limitations to this technique: we don't support transactions or all data types, just a subset of functionality that Dendrite requires. Using build tags, we only import `mattn/go-sqlite3` when we run in non-WASM mode for the SQL driver, else we use `matrix-org/go-sqlite3-js`.
|
||||||
|
|
||||||
|
At this point, the databases will only be in-memory. Ideally they would persist to IndexedDB. By default, sql.js will not do this. With a [few modifications](https://github.com/sql-js/sql.js/pull/397) you can make the in-memory Emscripten filesystem persist to IndexedDB via IDBFS. Unfortunately, you need to manually flush the filesystem e.g on a timer, which violates ACID's Durability. We flush every 30 seconds.
|
||||||
|
|
||||||
|
#### Service Workers
|
||||||
|
|
||||||
|
Running a server in the browser is one thing: but how do you receive requests? We use service workers, specifically Fetch events, to intercept all outgoing requests from the open tab. Service workers are effectively a process that runs in the background of a browser. The service worker lifecycle is peculiar as it exists *outside* the scope of a tab, meaning the process will still be running even if you close the browser tab. This has privacy implications which is why service workers are disabled when running under incognito mode, and why p2p.riot.im doesn't work incognito. Originally, service workers were designed for offline caching. The idea was to intercept requests and serve up cached resources. We intercept all requests which have the `/_matrix` prefix.
|
||||||
|
|
||||||
|
Once a service worker is set up listening for fetch requests, requests can be passed to Go via another global variable which Dendrite sets up. This function has the signature `function(reqString: string): Promise<{result: string, error: string}>` where `reqString` is the entire stringified HTTP request (including headers) and `result` is the entire stringified HTTP response (including headers). In Go, we use `http.ReadRequest` and `httptest.NewRecorder()` to parse these requests/responses respectively. This [relatively simple process](https://github.com/matrix-org/dendrite/blob/353a5d6fc25cb0e31ccb4cd433fed6112089c0af/cmd/dendritejs/jsServer.go) works well because the function call naturally frames each request so we know when one request ends and another begins.
|
||||||
|
|
||||||
|
Countless issues with updating service workers and differences between browsers has made working with them challenging. A good guide which intros service workers can be [found here](https://blog.sessionstack.com/how-javascript-works-service-workers-their-life-cycle-and-use-cases-52b19ad98b58). p2p.riot.im works in both Firefox and Chrome, which has highlighted a few extra undocumented differences between the two implementations:
|
||||||
|
- There's no guidelines to say when a service worker should be killed to save memory. Firefox will terminate service workers [after 30 seconds of inactivity](https://bugzilla.mozilla.org/show_bug.cgi?id=1378587) (no fetch requests) whereas Chrome tries to keep them around until there is memory pressure. This is a problem on Firefox because it would lead to the server being terminated very early on. This was fixed by forcing a 20 second `/sync` timeout to ensure we send requests to the service worker frequently enough.
|
||||||
|
- Chrome's developer console has better service worker support overall. It consistently shows service worker logs in the console whereas Firefox is sporadic, and usually doesn't. Chrome also has `chrome:serviceworker-internals` to see structured log output, and you can inspect the service worker globals by changing the JavaScript context from `top` to `dendrite_sw.js`.
|
||||||
|
- Firefox will immediately kill service workers in response to `self.skipWaiting()` but Chrome will wait around for several minutes before swapping new service workers in. This is a problem on Chrome as it means updates don't immediately take effect.
|
||||||
|
- The "ready promise" at `navigator.serviceWorker.ready` fires just prior to activation on Firefox, but just after activation on Chrome. This is a problem on Firefox as we use this to determine when we're ready to automatically register users.
|
||||||
|
- The "byte-for-byte" difference mentioned in the docs for [ServiceWorkerRegistration.update](https://developer.mozilla.org/en-US/docs/Web/API/ServiceWorkerRegistration/update) is only applied on the fetched script in Chrome, but extends to the WASM for Firefox. This is mainly a problem during development.
|
||||||
|
- On both browsers, [refreshing a tab is not enough for a new service worker to be swapped in](https://github.com/w3c/ServiceWorker/issues/1238), even when using `self.skipWaiting()`
|
||||||
|
|
||||||
|
Overall, working with service workers has been the hardest part of this entire process: they are difficult to test and to debug when things go wrong.
|
||||||
|
|
||||||
|
#### libp2p
|
||||||
|
|
||||||
|
Finally, we use a [rendezvous server](https://github.com/libp2p/js-libp2p-websocket-star-rendezvous) to act as a relay for peer-to-peer traffic. This means all traffic goes via a central server: not quite as peer-to-peer as we'd like. We'd like to use WebRTC data channels in the future, but currently [service workers do not support them](https://github.com/w3c/webrtc-pc/issues/230). This means that if you have two browsers on the same laptop they'll still bounce via the relay server.
|
||||||
|
|
||||||
|
We implement a custom `http.RoundTripper` in our Federation client which hits out to the p2p network. We actually use the JS version of libp2p rather than the Go version because it was easier to get things set up that way. This code uses similar communication techniques already decribed and the [code can be found here](https://github.com/matrix-org/go-http-js-libp2p). We've found peer discovery can take a while, and latency when sending data scales badly with load - likely because the rendezvous server gets overloaded (it often maxes out a CPU core).
|
||||||
|
|
||||||
|
The libp2p JS libraries themselves are haphazard, with a mix of old and new style JS. Many libraries depend on other libraries, and it's easy to use incompatible versions of certain libraries together (e.g a newer version of `peer-info` with an older version of `peer-id`) which then fails in obscure ways. We've hit various scenarios whereby requests seemingly black hole, even when specifying sensible timeouts. Overall though, it just about works for demo purposes but we wouldn't want to run anything "production grade" on it.
|
||||||
|
|
||||||
|
#### Conclusions
|
||||||
|
|
||||||
|
Running a Go server in a browser requires a lot of work. Most of the tools used to do this have room for improvement: from dynamic linking for WASM to WebRTC and debuggability for service workers. That being said, it's definitely possible and I hope this write-up will encourage others to give it a try, or at the very least avoid some of the obstacles we've encountered. You can give P2P a go by visiting https://p2p.riot.im or [build it yourself](https://github.com/matrix-org/dendrite/blob/dc3338d1f299555148cbf406c5b2bd823f3ca038/build/docker/DendriteJS.Dockerfile).
|
Loading…
Reference in a new issue