Use attrs internally for the URL preview code & add documentation. (#10753)

This commit is contained in:
Patrick Cloke 2021-09-07 09:10:34 -04:00 committed by GitHub
parent a23f3abb9b
commit 89ba834818
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
5 changed files with 132 additions and 119 deletions

1
changelog.d/10753.misc Normal file
View file

@ -0,0 +1 @@
Use `attrs` internally for the URL preview code & update documentation.

View file

@ -34,7 +34,7 @@
- [Application Services](application_services.md)
- [Server Notices](server_notices.md)
- [Consent Tracking](consent_tracking.md)
- [URL Previews](url_previews.md)
- [URL Previews](development/url_previews.md)
- [User Directory](user_directory.md)
- [Message Retention Policies](message_retention_policies.md)
- [Pluggable Modules](modules.md)

View file

@ -0,0 +1,51 @@
URL Previews
============
The `GET /_matrix/media/r0/preview_url` endpoint provides a generic preview API
for URLs which outputs [Open Graph](https://ogp.me/) responses (with some Matrix
specific additions).
This does have trade-offs compared to other designs:
* Pros:
* Simple and flexible; can be used by any clients at any point
* Cons:
* If each homeserver provides one of these independently, all the HSes in a
room may needlessly DoS the target URI
* The URL metadata must be stored somewhere, rather than just using Matrix
itself to store the media.
* Matrix cannot be used to distribute the metadata between homeservers.
When Synapse is asked to preview a URL it does the following:
1. Checks against a URL blacklist (defined as `url_preview_url_blacklist` in the
config).
2. Checks the in-memory cache by URLs and returns the result if it exists. (This
is also used to de-duplicate processing of multiple in-flight requests at once.)
3. Kicks off a background process to generate a preview:
1. Checks the database cache by URL and timestamp and returns the result if it
has not expired and was successful (a 2xx return code).
2. Checks if the URL matches an oEmbed pattern. If it does, fetch the oEmbed
response. If this is an image, replace the URL to fetch and continue. If
if it is HTML content, use the HTML as the document and continue.
3. If it doesn't match an oEmbed pattern, downloads the URL and stores it
into a file via the media storage provider and saves the local media
metadata.
5. If the media is an image:
1. Generates thumbnails.
2. Generates an Open Graph response based on image properties.
6. If the media is HTML:
1. Decodes the HTML via the stored file.
2. Generates an Open Graph response from the HTML.
3. If an image exists in the Open Graph response:
1. Downloads the URL and stores it into a file via the media storage
provider and saves the local media metadata.
2. Generates thumbnails.
3. Updates the Open Graph response based on image properties.
7. Stores the result in the database cache.
4. Returns the result.
The in-memory cache expires after 1 hour.
Expired entries in the database cache (and their associated media files) are
deleted every 10 seconds. The default expiration time is 1 hour from download.

View file

@ -1,76 +0,0 @@
URL Previews
============
Design notes on a URL previewing service for Matrix:
Options are:
1. Have an AS which listens for URLs, downloads them, and inserts an event that describes their metadata.
* Pros:
* Decouples the implementation entirely from Synapse.
* Uses existing Matrix events & content repo to store the metadata.
* Cons:
* Which AS should provide this service for a room, and why should you trust it?
* Doesn't work well with E2E; you'd have to cut the AS into every room
* the AS would end up subscribing to every room anyway.
2. Have a generic preview API (nothing to do with Matrix) that provides a previewing service:
* Pros:
* Simple and flexible; can be used by any clients at any point
* Cons:
* If each HS provides one of these independently, all the HSes in a room may needlessly DoS the target URI
* We need somewhere to store the URL metadata rather than just using Matrix itself
* We can't piggyback on matrix to distribute the metadata between HSes.
3. Make the synapse of the sending user responsible for spidering the URL and inserting an event asynchronously which describes the metadata.
* Pros:
* Works transparently for all clients
* Piggy-backs nicely on using Matrix for distributing the metadata.
* No confusion as to which AS
* Cons:
* Doesn't work with E2E
* We might want to decouple the implementation of the spider from the HS, given spider behaviour can be quite complicated and evolve much more rapidly than the HS. It's more like a bot than a core part of the server.
4. Make the sending client use the preview API and insert the event itself when successful.
* Pros:
* Works well with E2E
* No custom server functionality
* Lets the client customise the preview that they send (like on FB)
* Cons:
* Entirely specific to the sending client, whereas it'd be nice if /any/ URL was correctly previewed if clients support it.
5. Have the option of specifying a shared (centralised) previewing service used by a room, to avoid all the different HSes in the room DoSing the target.
Best solution is probably a combination of both 2 and 4.
* Sending clients do their best to create and send a preview at the point of sending the message, perhaps delaying the message until the preview is computed? (This also lets the user validate the preview before sending)
* Receiving clients have the option of going and creating their own preview if one doesn't arrive soon enough (or if the original sender didn't create one)
This is a bit magical though in that the preview could come from two entirely different sources - the sending HS or your local one. However, this can always be exposed to users: "Generate your own URL previews if none are available?"
This is tantamount also to senders calculating their own thumbnails for sending in advance of the main content - we are trusting the sender not to lie about the content in the thumbnail. Whereas currently thumbnails are calculated by the receiving homeserver to avoid this attack.
However, this kind of phishing attack does exist whether we let senders pick their thumbnails or not, in that a malicious sender can send normal text messages around the attachment claiming it to be legitimate. We could rely on (future) reputation/abuse management to punish users who phish (be it with bogus metadata or bogus descriptions). Bogus metadata is particularly bad though, especially if it's avoidable.
As a first cut, let's do #2 and have the receiver hit the API to calculate its own previews (as it does currently for image thumbnails). We can then extend/optimise this to option 4 as a special extra if needed.
API
---
```
GET /_matrix/media/r0/preview_url?url=http://wherever.com
200 OK
{
"og:type" : "article"
"og:url" : "https://twitter.com/matrixdotorg/status/684074366691356672"
"og:title" : "Matrix on Twitter"
"og:image" : "https://pbs.twimg.com/profile_images/500400952029888512/yI0qtFi7_400x400.png"
"og:description" : "“Synapse 0.12 is out! Lots of polishing, performance & bugfixes: /sync API, /r0 prefix, fulltext search, 3PID invites https://t.co/5alhXLLEGP”"
"og:site_name" : "Twitter"
}
```
* Downloads the URL
* If HTML, just stores it in RAM and parses it for OG meta tags
* Download any media OG meta tags to the media repo, and refer to them in the OG via mxc:// URIs.
* If a media filetype we know we can thumbnail: store it on disk, and hand it to the thumbnailer. Generate OG meta tags from the thumbnailer contents.
* Otherwise, don't bother downloading further.

View file

@ -22,9 +22,11 @@ import re
import shutil
import sys
import traceback
from typing import TYPE_CHECKING, Any, Dict, Generator, Iterable, Optional, Union
from typing import TYPE_CHECKING, Dict, Generator, Iterable, Optional, Union
from urllib import parse as urlparse
import attr
from twisted.internet.error import DNSLookupError
from twisted.web.server import Request
@ -42,6 +44,7 @@ from synapse.metrics.background_process_metrics import run_as_background_process
from synapse.rest.media.v1._base import get_filename_from_headers
from synapse.rest.media.v1.media_storage import MediaStorage
from synapse.rest.media.v1.oembed import OEmbedError, OEmbedProvider
from synapse.types import JsonDict
from synapse.util import json_encoder
from synapse.util.async_helpers import ObservableDeferred
from synapse.util.caches.expiringcache import ExpiringCache
@ -71,7 +74,43 @@ OG_TAG_VALUE_MAXLEN = 1000
ONE_HOUR = 60 * 60 * 1000
@attr.s(slots=True, frozen=True, auto_attribs=True)
class MediaInfo:
"""
Information parsed from downloading media being previewed.
"""
# The Content-Type header of the response.
media_type: str
# The length (in bytes) of the downloaded media.
media_length: int
# The media filename, according to the server. This is parsed from the
# returned headers, if possible.
download_name: Optional[str]
# The time of the preview.
created_ts_ms: int
# Information from the media storage provider about where the file is stored
# on disk.
filesystem_id: str
filename: str
# The URI being previewed.
uri: str
# The HTTP response code.
response_code: int
# The timestamp (in milliseconds) of when this preview expires.
expires: int
# The ETag header of the response.
etag: Optional[str]
class PreviewUrlResource(DirectServeJsonResource):
"""
Generating URL previews is a complicated task which many potential pitfalls.
See docs/development/url_previews.md for discussion of the design and
algorithm followed in this module.
"""
isLeaf = True
def __init__(
@ -219,18 +258,17 @@ class PreviewUrlResource(DirectServeJsonResource):
logger.debug("got media_info of '%s'", media_info)
if _is_media(media_info["media_type"]):
file_id = media_info["filesystem_id"]
if _is_media(media_info.media_type):
file_id = media_info.filesystem_id
dims = await self.media_repo._generate_thumbnails(
None, file_id, file_id, media_info["media_type"], url_cache=True
None, file_id, file_id, media_info.media_type, url_cache=True
)
og = {
"og:description": media_info["download_name"],
"og:image": "mxc://%s/%s"
% (self.server_name, media_info["filesystem_id"]),
"og:image:type": media_info["media_type"],
"matrix:image:size": media_info["media_length"],
"og:description": media_info.download_name,
"og:image": f"mxc://{self.server_name}/{media_info.filesystem_id}",
"og:image:type": media_info.media_type,
"matrix:image:size": media_info.media_length,
}
if dims:
@ -240,14 +278,14 @@ class PreviewUrlResource(DirectServeJsonResource):
logger.warning("Couldn't get dims for %s" % url)
# define our OG response for this media
elif _is_html(media_info["media_type"]):
elif _is_html(media_info.media_type):
# TODO: somehow stop a big HTML tree from exploding synapse's RAM
with open(media_info["filename"], "rb") as file:
with open(media_info.filename, "rb") as file:
body = file.read()
encoding = get_html_media_encoding(body, media_info["media_type"])
og = decode_and_calc_og(body, media_info["uri"], encoding)
encoding = get_html_media_encoding(body, media_info.media_type)
og = decode_and_calc_og(body, media_info.uri, encoding)
# pre-cache the image for posterity
# FIXME: it might be cleaner to use the same flow as the main /preview_url
@ -255,14 +293,14 @@ class PreviewUrlResource(DirectServeJsonResource):
# just rely on the caching on the master request to speed things up.
if "og:image" in og and og["og:image"]:
image_info = await self._download_url(
_rebase_url(og["og:image"], media_info["uri"]), user
_rebase_url(og["og:image"], media_info.uri), user
)
if _is_media(image_info["media_type"]):
if _is_media(image_info.media_type):
# TODO: make sure we don't choke on white-on-transparent images
file_id = image_info["filesystem_id"]
file_id = image_info.filesystem_id
dims = await self.media_repo._generate_thumbnails(
None, file_id, file_id, image_info["media_type"], url_cache=True
None, file_id, file_id, image_info.media_type, url_cache=True
)
if dims:
og["og:image:width"] = dims["width"]
@ -270,12 +308,11 @@ class PreviewUrlResource(DirectServeJsonResource):
else:
logger.warning("Couldn't get dims for %s", og["og:image"])
og["og:image"] = "mxc://%s/%s" % (
self.server_name,
image_info["filesystem_id"],
)
og["og:image:type"] = image_info["media_type"]
og["matrix:image:size"] = image_info["media_length"]
og[
"og:image"
] = f"mxc://{self.server_name}/{image_info.filesystem_id}"
og["og:image:type"] = image_info.media_type
og["matrix:image:size"] = image_info.media_length
else:
del og["og:image"]
else:
@ -301,17 +338,17 @@ class PreviewUrlResource(DirectServeJsonResource):
# store OG in history-aware DB cache
await self.store.store_url_cache(
url,
media_info["response_code"],
media_info["etag"],
media_info["expires"] + media_info["created_ts"],
media_info.response_code,
media_info.etag,
media_info.expires + media_info.created_ts_ms,
jsonog,
media_info["filesystem_id"],
media_info["created_ts"],
media_info.filesystem_id,
media_info.created_ts_ms,
)
return jsonog.encode("utf8")
async def _download_url(self, url: str, user: str) -> Dict[str, Any]:
async def _download_url(self, url: str, user: str) -> MediaInfo:
# TODO: we should probably honour robots.txt... except in practice
# we're most likely being explicitly triggered by a human rather than a
# bot, so are we really a robot?
@ -423,18 +460,18 @@ class PreviewUrlResource(DirectServeJsonResource):
# therefore not expire it.
raise
return {
"media_type": media_type,
"media_length": length,
"download_name": download_name,
"created_ts": time_now_ms,
"filesystem_id": file_id,
"filename": fname,
"uri": uri,
"response_code": code,
"expires": expires,
"etag": etag,
}
return MediaInfo(
media_type=media_type,
media_length=length,
download_name=download_name,
created_ts_ms=time_now_ms,
filesystem_id=file_id,
filename=fname,
uri=uri,
response_code=code,
expires=expires,
etag=etag,
)
def _start_expire_url_cache_data(self):
return run_as_background_process(
@ -580,7 +617,7 @@ def get_html_media_encoding(body: bytes, content_type: str) -> str:
def decode_and_calc_og(
body: bytes, media_uri: str, request_encoding: Optional[str] = None
) -> Dict[str, Optional[str]]:
) -> JsonDict:
"""
Calculate metadata for an HTML document.