How does p2p.riot.im work exactly: a tale of how to run a complex Go server in a browser
Chances are, if you've heard about running other programming languages in the browser, you might be thinking WebAssembly (WASM). We use it to run Dendrite in the browser, but there's an awful lot more to it which this blog post will explain. This is geared towards programmers and is not Matrix specific. If you want a higher level overview, check out this blog post.
Databases
First things first: databases. Dendrite previously only supported Postgres, which has no way of being compiled to WASM. SQLite is well known for its embeddability, and sure enough you can compile SQLite to WASM. We first had to make the server support multiple database engines, which itself was no small undertaking. Aside from SQL syntax differences, 'database is locked' errors were extremely common, but what does this mean exactly? SQLite doesn't allow multiple connections to write to the database at the same time. If you attempt to do so, you'll see 'database is locked'. You can mitigate this slightly by setting a busy timeout which attempts to wait until the table is unlocked - but this is just a plaster over a fundamental structural problem of the code. You could enable WAL, but this isn't supported in WASM builds. To resolve the issue completely, you need to painstakingly go through where you make SQL queries and look for any write statements (CREATE, DELETE, DROP, INSERT, or UPDATE) and make sure that you are only doing this sequentially e.g from the same goroutine.
Build tags
Assuming you now have both postgres/sqlite support, you might try to compile Go to WebAssembly. There's a good instruction guide at https://github.com/golang/go/wiki/WebAssembly but in essence it just equates to GOOS=js GOARCH=wasm go build ....
which will hit the next problem: not all libraries support being compiled to WASM:
pq undefined: userCurrent
The lib/pq
library for one does not. When built for WASM it doesn't have any idea of unix users so it cannot compile. Runtime switches to only use SQLite won't help here as the unused libraries still get compiled. A quick fix would be to fork the project and stub out the missing functions, but the long term fix is to not import that library at all when being built under WASM. You can do this using build tags, and there's a good explanation of this at Dave Cheney's blog.
In essence, build tags will include or exclude files from being picked up by the go
toolchain. It doesn't work at a package level. This means if you have certain imports you don't want (e.g lib/pq
) then you need to take the functions that use those imports and put them in a separate file, then define those very same functions in another file and specify the build flags accordingly (// +build wasm
or // +build !wasm
). Go provides a shorthand way of doing this based off the filename: foo_wasm.go
implies // +build wasm
.
When writing your own Go code for use with WASM, bear in mind that integer limits on WASM are lower than int64, so operations involving say math.MaxUint64
may not behave as you'd expect and cause the entire runtime to collapse.
CGO
So you've done GOOS=js GOARCH=wasm go build ....
and got it to produce a .wasm
file. The instructions at https://github.com/golang/go/wiki/WebAssembly tell you how to load this into the browser: specifically you need to run wasm_exec.js
(which comes from GOROOT
) first, which sets up a global.Go
object which can run the program (it sets up the runtime). If you do this, you'll find your next problem:
Binary was compiled with 'CGO_ENABLED=0', go-sqlite3 requires cgo to work. This is a stub
This is an error from mattn/go-sqlite3
with a tantalisingly simple fix. However, running CGO_ENABLED=1 GOOS=js GOARCH=wasm go build ....
won't do what you want:
unknown ptrSize for $GOARCH "wasm"
/usr/local/go/src/os/user/lookup.go:36:9: undefined: lookupUser
The problem is that Go doesn't know how to compile C to WASM. There are of course projects like Emscripten which do this, but Go doesn't know how. There's a few solutions here, which one works depends on when you're reading this:
- Use Dynamic Linking.
- Manually transfer SQL requests to an Emscripten form of SQLite.
Dynamic linking would be the ideal solution: compile all C modules as side modules and link to it when producing dendrite.wasm
. Unfortunately, the Go toolchain doesn't support this yet. Instead, we have written a SQL driver which will pass SQL queries up to JS via a global variable _go_sqlite
. We use sql.js - an Emscripten-ised form of SQLite3 - to actually handle the queries. This global variable is expected to be the result of await initSqlJs(...)
from https://github.com/sql-js/sql.js. There's a few limitations to this technique: we don't support transactions or all data types, just a subset of functionality that Dendrite requires. Using build tags, we only import mattn/go-sqlite3
when we run in non-WASM mode for the SQL driver, else we use matrix-org/go-sqlite3-js
.
At this point, the databases will only be in-memory. Ideally they would persist to IndexedDB. By default, sql.js will not do this. With a few modifications you can make the in-memory Emscripten filesystem persist to IndexedDB via IDBFS. Unfortunately, you need to manually flush the filesystem e.g on a timer, which violates ACID's Durability. We flush every 30 seconds.
Service Workers
Running a server in the browser is one thing: but how do you receive requests? We use service workers, specifically Fetch events, to intercept all outgoing requests from the open tab. Service workers are effectively a process that runs in the background of a browser. The service worker lifecycle is peculiar as it exists outside the scope of a tab, meaning the process will still be running even if you close the browser tab. This has privacy implications which is why service workers are disabled when running under incognito mode, and why p2p.riot.im doesn't work incognito. Originally, service workers were designed for offline caching. The idea was to intercept requests and serve up cached resources. We intercept all requests which have the /_matrix
prefix.
Once a service worker is set up listening for fetch requests, requests can be passed to Go via another global variable which Dendrite sets up. This function has the signature function(reqString: string): Promise<{result: string, error: string}>
where reqString
is the entire stringified HTTP request (including headers) and result
is the entire stringified HTTP response (including headers). In Go, we use http.ReadRequest
and httptest.NewRecorder()
to parse these requests/responses respectively. This relatively simple process works well because the function call naturally frames each request so we know when one request ends and another begins.
Countless issues with updating service workers and differences between browsers has made working with them challenging. A good guide which intros service workers can be found here. p2p.riot.im works in both Firefox and Chrome, which has highlighted a few extra undocumented differences between the two implementations:
- There's no guidelines to say when a service worker should be killed to save memory. Firefox will terminate service workers after 30 seconds of inactivity (no fetch requests) whereas Chrome tries to keep them around until there is memory pressure. This is a problem on Firefox because it would lead to the server being terminated very early on. This was fixed by forcing a 20 second
/sync
timeout to ensure we send requests to the service worker frequently enough. - Chrome's developer console has better service worker support overall. It consistently shows service worker logs in the console whereas Firefox is sporadic, and usually doesn't. Chrome also has
chrome:serviceworker-internals
to see structured log output, and you can inspect the service worker globals by changing the JavaScript context fromtop
todendrite_sw.js
. - Firefox will immediately kill service workers in response to
self.skipWaiting()
but Chrome will wait around for several minutes before swapping new service workers in. This is a problem on Chrome as it means updates don't immediately take effect. - The "ready promise" at
navigator.serviceWorker.ready
fires just prior to activation on Firefox, but just after activation on Chrome. This is a problem on Firefox as we use this to determine when we're ready to automatically register users. - The "byte-for-byte" difference mentioned in the docs for ServiceWorkerRegistration.update is only applied on the fetched script in Chrome, but extends to the WASM for Firefox. This is mainly a problem during development.
- On both browsers, refreshing a tab is not enough for a new service worker to be swapped in, even when using
self.skipWaiting()
Overall, working with service workers has been the hardest part of this entire process: they are difficult to test and to debug when things go wrong.
libp2p
Finally, we use a rendezvous server to act as a relay for peer-to-peer traffic. This means all traffic goes via a central server: not quite as peer-to-peer as we'd like. We'd like to use WebRTC data channels in the future, but currently service workers do not support them. This means that if you have two browsers on the same laptop they'll still bounce via the relay server.
We implement a custom http.RoundTripper
in our Federation client which hits out to the p2p network. We actually use the JS version of libp2p rather than the Go version to avoid having to shim all the p2p network activity (e.g. websockets or webrtc) through to JS from Go, and to make it easier to debug the libp2p side in-browser. This code uses similar communication techniques already decribed and the code can be found here. We've found peer discovery can take a while, and latency when sending data scales badly with load - likely because the rendezvous server gets overloaded (it often maxes out a CPU core).
The libp2p JS libraries have some quirks - stylistically they are in the middle of a transition from old to new JS codestyles, and the large number of dependencies means that it can be easy to use incompatible versions of certain libraries together (e.g a newer version of peer-info
with an older version of peer-id
) which then fails in obscure ways. We've also hit a few various scenarios whereby requests seemingly black hole, even when specifying sensible timeouts. Overall though, it works okay enough for the purposes of the demo.
Conclusions
Running a Go server in a browser requires a lot of work. Most of the tools used to do this have room for improvement: from dynamic linking for WASM to WebRTC and debuggability for service workers. That being said, it's definitely possible and I hope this write-up will encourage others to give it a try, or at the very least avoid some of the obstacles we've encountered. You can give P2P a go by visiting https://p2p.riot.im or build it yourself.