Merge pull request #688 from matrix-org/matthew/preview_urls

URL previewing support
2024-12-15 14:33:50 +01:00 · 2016-04-11 10:40:29 +01:00 · 2016-04-11 10:40:29 +01:00 · 4bd3d25218
commit 4bd3d25218
parent 79fc4ff6f9 5ffacc5e84
13 changed files with 952 additions and 10 deletions
--- a/README.rst
+++ b/README.rst
@ -104,7 +104,7 @@ Installing prerequisites on Ubuntu or Debian::
    sudo apt-get install build-essential python2.7-dev libffi-dev \
                         python-pip python-setuptools sqlite3 \
-                         libssl-dev python-virtualenv libjpeg-dev
+                         libssl-dev python-virtualenv libjpeg-dev libxslt1-dev
 Installing prerequisites on ArchLinux::
@ -557,6 +557,23 @@ as the primary means of identity and E2E encryption is not complete. As such,
 we are running a single identity server (https://matrix.org) at the current
 time.
 URL Previews
 ============
 Synapse 0.15.0 introduces an experimental new API for previewing URLs at
 /_matrix/media/r0/preview_url.  This is disabled by default.  To turn it on
 you must enable the `url_preview_enabled: True` config parameter and explicitly
 specify the IP ranges that Synapse is not allowed to spider for previewing in
 the `url_preview_ip_range_blacklist` configuration parameter.  This is critical
 from a security perspective to stop arbitrary Matrix users spidering 'internal'
 URLs on your network.  At the very least we recommend that your loopback and
 RFC1918 IP addresses are blacklisted.
 This also requires the optional lxml and netaddr python dependencies to be
 installed.
 Password reset
 ==============
--- a/UPGRADE.rst
+++ b/UPGRADE.rst
@ -30,6 +30,14 @@ running:
    python synapse/python_dependencies.py | xargs -n1 pip install
 Upgrading to v0.15.0
 ====================
 If you want to use the new URL previewing API (/_matrix/media/r0/preview_url)
 then you have to explicitly enable it in the config and update your dependencies
 dependencies.  See README.rst for details.
 Upgrading to v0.11.0
 ====================
--- a/docs/url_previews.rst
+++ b/docs/url_previews.rst
@ -0,0 +1,74 @@
 URL Previews
 ============
 Design notes on a URL previewing service for Matrix:
 Options are:
 1. Have an AS which listens for URLs, downloads them, and inserts an event that describes their metadata.
   * Pros:
     * Decouples the implementation entirely from Synapse.
     * Uses existing Matrix events & content repo to store the metadata.
   * Cons:
     * Which AS should provide this service for a room, and why should you trust it?
     * Doesn't work well with E2E; you'd have to cut the AS into every room
     * the AS would end up subscribing to every room anyway.
 2. Have a generic preview API (nothing to do with Matrix) that provides a previewing service:
   * Pros:
     * Simple and flexible; can be used by any clients at any point
   * Cons:
     * If each HS provides one of these independently, all the HSes in a room may needlessly DoS the target URI
     * We need somewhere to store the URL metadata rather than just using Matrix itself
     * We can't piggyback on matrix to distribute the metadata between HSes.
 3. Make the synapse of the sending user responsible for spidering the URL and inserting an event asynchronously which describes the metadata.
   * Pros:
     * Works transparently for all clients
     * Piggy-backs nicely on using Matrix for distributing the metadata.
     * No confusion as to which AS
   * Cons:
     * Doesn't work with E2E
     * We might want to decouple the implementation of the spider from the HS, given spider behaviour can be quite complicated and evolve much more rapidly than the HS.  It's more like a bot than a core part of the server.
 4. Make the sending client use the preview API and insert the event itself when successful.
   * Pros:
      * Works well with E2E
      * No custom server functionality
      * Lets the client customise the preview that they send (like on FB)
   * Cons:
      * Entirely specific to the sending client, whereas it'd be nice if /any/ URL was correctly previewed if clients support it.
 5. Have the option of specifying a shared (centralised) previewing service used by a room, to avoid all the different HSes in the room DoSing the target.
 Best solution is probably a combination of both 2 and 4.
 * Sending clients do their best to create and send a preview at the point of sending the message, perhaps delaying the message until the preview is computed?  (This also lets the user validate the preview before sending)
 * Receiving clients have the option of going and creating their own preview if one doesn't arrive soon enough (or if the original sender didn't create one)
 This is a bit magical though in that the preview could come from two entirely different sources - the sending HS or your local one.  However, this can always be exposed to users: "Generate your own URL previews if none are available?"
 This is tantamount also to senders calculating their own thumbnails for sending in advance of the main content - we are trusting the sender not to lie about the content in the thumbnail.  Whereas currently thumbnails are calculated by the receiving homeserver to avoid this attack.
 However, this kind of phishing attack does exist whether we let senders pick their thumbnails or not, in that a malicious sender can send normal text messages around the attachment claiming it to be legitimate.  We could rely on (future) reputation/abuse management to punish users who phish (be it with bogus metadata or bogus descriptions).   Bogus metadata is particularly bad though, especially if it's avoidable.
 As a first cut, let's do #2 and have the receiver hit the API to calculate its own previews (as it does currently for image thumbnails).  We can then extend/optimise this to option 4 as a special extra if needed.
 API
 ---
 GET /_matrix/media/r0/preview_url?url=http://wherever.com
 200 OK
 {
    "og:type"        : "article"
    "og:url"         : "https://twitter.com/matrixdotorg/status/684074366691356672"
    "og:title"       : "Matrix on Twitter"
    "og:image"       : "https://pbs.twimg.com/profile_images/500400952029888512/yI0qtFi7_400x400.png"
    "og:description" : "“Synapse 0.12 is out! Lots of polishing, performance &amp;amp; bugfixes: /sync API, /r0 prefix, fulltext search, 3PID invites https://t.co/5alhXLLEGP”"
    "og:site_name"   : "Twitter"
 }
 * Downloads the URL
  * If HTML, just stores it in RAM and parses it for OG meta tags
    * Download any media OG meta tags to the media repo, and refer to them in the OG via mxc:// URIs.
  * If a media filetype we know we can thumbnail: store it on disk, and hand it to the thumbnailer. Generate OG meta tags from the thumbnailer contents.
  * Otherwise, don't bother downloading further.
--- a/synapse/config/repository.py
+++ b/synapse/config/repository.py
@ -16,6 +16,8 @@
 from ._base import Config
 from collections import namedtuple
 import sys
 ThumbnailRequirement = namedtuple(
    "ThumbnailRequirement", ["width", "height", "method", "media_type"]
 )
@ -23,7 +25,7 @@ ThumbnailRequirement = namedtuple(
 def parse_thumbnail_requirements(thumbnail_sizes):
    """ Takes a list of dictionaries with "width", "height", and "method" keys
-    and creates a map from image media types to the thumbnail size, thumnailing
+    and creates a map from image media types to the thumbnail size, thumbnailing
    method, and thumbnail media type to precalculate
    Args:
@ -53,12 +55,25 @@ class ContentRepositoryConfig(Config):
    def read_config(self, config):
        self.max_upload_size = self.parse_size(config["max_upload_size"])
        self.max_image_pixels = self.parse_size(config["max_image_pixels"])
        self.max_spider_size = self.parse_size(config["max_spider_size"])
        self.media_store_path = self.ensure_directory(config["media_store_path"])
        self.uploads_path = self.ensure_directory(config["uploads_path"])
        self.dynamic_thumbnails = config["dynamic_thumbnails"]
        self.thumbnail_requirements = parse_thumbnail_requirements(
            config["thumbnail_sizes"]
        )
        self.url_preview_enabled = config["url_preview_enabled"]
        if self.url_preview_enabled:
            try:
                from netaddr import IPSet
                if "url_preview_ip_range_blacklist" in config:
                    self.url_preview_ip_range_blacklist = IPSet(
                        config["url_preview_ip_range_blacklist"]
                    )
                if "url_preview_url_blacklist" in config:
                    self.url_preview_url_blacklist = config["url_preview_url_blacklist"]
            except ImportError:
                sys.stderr.write("\nmissing netaddr dep - disabling preview_url API\n")
    def default_config(self, **kwargs):
        media_store = self.default_path("media_store")
@ -80,7 +95,7 @@ class ContentRepositoryConfig(Config):
        # the resolution requested by the client. If true then whenever
        # a new resolution is requested by the client the server will
        # generate a new thumbnail. If false the server will pick a thumbnail
-        # from a precalcualted list.
+        # from a precalculated list.
        dynamic_thumbnails: false
        # List of thumbnail to precalculate when an image is uploaded.
@ -100,4 +115,62 @@ class ContentRepositoryConfig(Config):
        - width: 800
          height: 600
          method: scale
        # Is the preview URL API enabled?  If enabled, you *must* specify
        # an explicit url_preview_ip_range_blacklist of IPs that the spider is
        # denied from accessing.
        url_preview_enabled: False
        # List of IP address CIDR ranges that the URL preview spider is denied
        # from accessing.  There are no defaults: you must explicitly
        # specify a list for URL previewing to work.  You should specify any
        # internal services in your network that you do not want synapse to try
        # to connect to, otherwise anyone in any Matrix room could cause your
        # synapse to issue arbitrary GET requests to your internal services,
        # causing serious security issues.
        #
        # url_preview_ip_range_blacklist:
        # - '127.0.0.0/8'
        # - '10.0.0.0/8'
        # - '172.16.0.0/12'
        # - '192.168.0.0/16'
        # Optional list of URL matches that the URL preview spider is
        # denied from accessing.  You should use url_preview_ip_range_blacklist
        # in preference to this, otherwise someone could define a public DNS
        # entry that points to a private IP address and circumvent the blacklist.
        # This is more useful if you know there is an entire shape of URL that
        # you know that will never want synapse to try to spider.
        #
        # Each list entry is a dictionary of url component attributes as returned
        # by urlparse.urlsplit as applied to the absolute form of the URL.  See
        # https://docs.python.org/2/library/urlparse.html#urlparse.urlsplit
        # The values of the dictionary are treated as an filename match pattern
        # applied to that component of URLs, unless they start with a ^ in which
        # case they are treated as a regular expression match.  If all the
        # specified component matches for a given list item succeed, the URL is
        # blacklisted.
        #
        # url_preview_url_blacklist:
        # # blacklist any URL with a username in its URI
        # - username: '*'
        #
        # # blacklist all *.google.com URLs
        # - netloc: 'google.com'
        # - netloc: '*.google.com'
        #
        # # blacklist all plain HTTP URLs
        # - scheme: 'http'
        #
        # # blacklist http(s)://www.acme.com/foo
        # - netloc: 'www.acme.com'
        #   path: '/foo'
        #
        # # blacklist any URL with a literal IPv4 address
        # - netloc: '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'
        # The largest allowed URL preview spidering size in bytes
        max_spider_size: "10M"
        """ % locals()
--- a/synapse/http/client.py
+++ b/synapse/http/client.py
@ -15,17 +15,24 @@
 from OpenSSL import SSL
 from OpenSSL.SSL import VERIFY_NONE
-from synapse.api.errors import CodeMessageException
+from synapse.api.errors import (
    CodeMessageException, SynapseError, Codes,
 )
 from synapse.util.logcontext import preserve_context_over_fn
 import synapse.metrics
 from synapse.http.endpoint import SpiderEndpoint
 from canonicaljson import encode_canonical_json
-from twisted.internet import defer, reactor, ssl
+from twisted.internet import defer, reactor, ssl, protocol
 from twisted.internet.endpoints import SSL4ClientEndpoint, TCP4ClientEndpoint
 from twisted.web.client import (
-    Agent, readBody, FileBodyProducer, PartialDownloadError,
+    BrowserLikeRedirectAgent, ContentDecoderAgent, GzipDecoder, Agent,
    readBody, FileBodyProducer, PartialDownloadError,
 )
 from twisted.web.http import PotentialDataLoss
 from twisted.web.http_headers import Headers
 from twisted.web._newclient import ResponseDone
 from StringIO import StringIO
@ -238,6 +245,107 @@ class SimpleHttpClient(object):
        else:
            raise CodeMessageException(response.code, body)
    # XXX: FIXME: This is horribly copy-pasted from matrixfederationclient.
    # The two should be factored out.
    @defer.inlineCallbacks
    def get_file(self, url, output_stream, max_size=None):
        """GETs a file from a given URL
        Args:
            url (str): The URL to GET
            output_stream (file): File to write the response body to.
        Returns:
            A (int,dict,string,int) tuple of the file length, dict of the response
            headers, absolute URI of the response and HTTP response code.
        """
        response = yield self.request(
            "GET",
            url.encode("ascii"),
            headers=Headers({
                b"User-Agent": [self.user_agent],
            })
        )
        headers = dict(response.headers.getAllRawHeaders())
        if 'Content-Length' in headers and headers['Content-Length'] > max_size:
            logger.warn("Requested URL is too large > %r bytes" % (self.max_size,))
            raise SynapseError(
                502,
                "Requested file is too large > %r bytes" % (self.max_size,),
                Codes.TOO_LARGE,
            )
        if response.code > 299:
            logger.warn("Got %d when downloading %s" % (response.code, url))
            raise SynapseError(
                502,
                "Got error %d" % (response.code,),
                Codes.UNKNOWN,
            )
        # TODO: if our Content-Type is HTML or something, just read the first
        # N bytes into RAM rather than saving it all to disk only to read it
        # straight back in again
        try:
            length = yield preserve_context_over_fn(
                _readBodyToFile,
                response, output_stream, max_size
            )
        except Exception as e:
            logger.exception("Failed to download body")
            raise SynapseError(
                502,
                ("Failed to download remote body: %s" % e),
                Codes.UNKNOWN,
            )
        defer.returnValue((length, headers, response.request.absoluteURI, response.code))
 # XXX: FIXME: This is horribly copy-pasted from matrixfederationclient.
 # The two should be factored out.
 class _ReadBodyToFileProtocol(protocol.Protocol):
    def __init__(self, stream, deferred, max_size):
        self.stream = stream
        self.deferred = deferred
        self.length = 0
        self.max_size = max_size
    def dataReceived(self, data):
        self.stream.write(data)
        self.length += len(data)
        if self.max_size is not None and self.length >= self.max_size:
            self.deferred.errback(SynapseError(
                502,
                "Requested file is too large > %r bytes" % (self.max_size,),
                Codes.TOO_LARGE,
            ))
            self.deferred = defer.Deferred()
            self.transport.loseConnection()
    def connectionLost(self, reason):
        if reason.check(ResponseDone):
            self.deferred.callback(self.length)
        elif reason.check(PotentialDataLoss):
            # stolen from https://github.com/twisted/treq/pull/49/files
            # http://twistedmatrix.com/trac/ticket/4840
            self.deferred.callback(self.length)
        else:
            self.deferred.errback(reason)
 # XXX: FIXME: This is horribly copy-pasted from matrixfederationclient.
 # The two should be factored out.
 def _readBodyToFile(response, stream, max_size):
    d = defer.Deferred()
    response.deliverBody(_ReadBodyToFileProtocol(stream, d, max_size))
    return d
 class CaptchaServerHttpClient(SimpleHttpClient):
    """
@ -269,6 +377,59 @@ class CaptchaServerHttpClient(SimpleHttpClient):
            defer.returnValue(e.response)
 class SpiderEndpointFactory(object):
    def __init__(self, hs):
        self.blacklist = hs.config.url_preview_ip_range_blacklist
        self.policyForHTTPS = hs.get_http_client_context_factory()
    def endpointForURI(self, uri):
        logger.info("Getting endpoint for %s", uri.toBytes())
        if uri.scheme == "http":
            return SpiderEndpoint(
                reactor, uri.host, uri.port, self.blacklist,
                endpoint=TCP4ClientEndpoint,
                endpoint_kw_args={
                    'timeout': 15
                },
            )
        elif uri.scheme == "https":
            tlsPolicy = self.policyForHTTPS.creatorForNetloc(uri.host, uri.port)
            return SpiderEndpoint(
                reactor, uri.host, uri.port, self.blacklist,
                endpoint=SSL4ClientEndpoint,
                endpoint_kw_args={
                    'sslContextFactory': tlsPolicy,
                    'timeout': 15
                },
            )
        else:
            logger.warn("Can't get endpoint for unrecognised scheme %s", uri.scheme)
 class SpiderHttpClient(SimpleHttpClient):
    """
    Separate HTTP client for spidering arbitrary URLs.
    Special in that it follows retries and has a UA that looks
    like a browser.
    used by the preview_url endpoint in the content repo.
    """
    def __init__(self, hs):
        SimpleHttpClient.__init__(self, hs)
        # clobber the base class's agent and UA:
        self.agent = ContentDecoderAgent(
            BrowserLikeRedirectAgent(
                Agent.usingEndpointFactory(
                    reactor,
                    SpiderEndpointFactory(hs)
                )
            ), [('gzip', GzipDecoder)]
        )
        # We could look like Chrome:
        # self.user_agent = ("Mozilla/5.0 (%s) (KHTML, like Gecko)
        #                   Chrome Safari" % hs.version_string)
 def encode_urlencode_args(args):
    return {k: encode_urlencode_arg(v) for k, v in args.items()}
--- a/synapse/http/endpoint.py
+++ b/synapse/http/endpoint.py
@ -75,6 +75,37 @@ def matrix_federation_endpoint(reactor, destination, ssl_context_factory=None,
        return transport_endpoint(reactor, domain, port, **endpoint_kw_args)
 class SpiderEndpoint(object):
    """An endpoint which refuses to connect to blacklisted IP addresses
    Implements twisted.internet.interfaces.IStreamClientEndpoint.
    """
    def __init__(self, reactor, host, port, blacklist,
                 endpoint=TCP4ClientEndpoint, endpoint_kw_args={}):
        self.reactor = reactor
        self.host = host
        self.port = port
        self.blacklist = blacklist
        self.endpoint = endpoint
        self.endpoint_kw_args = endpoint_kw_args
    @defer.inlineCallbacks
    def connect(self, protocolFactory):
        address = yield self.reactor.resolve(self.host)
        from netaddr import IPAddress
        if IPAddress(address) in self.blacklist:
            raise ConnectError(
                "Refusing to spider blacklisted IP address %s" % address
            )
        logger.info("Connecting to %s:%s", address, self.port)
        endpoint = self.endpoint(
            self.reactor, address, self.port, **self.endpoint_kw_args
        )
        connection = yield endpoint.connect(protocolFactory)
        defer.returnValue(connection)
 class SRVClientEndpoint(object):
    """An endpoint which looks up SRV records for a service.
    Cycles through the list of servers starting with each call to connect
@ -120,7 +151,7 @@ class SRVClientEndpoint(object):
                return self.default_server
            else:
                raise ConnectError(
-                    "Not server available for %s", self.service_name
+                    "Not server available for %s" % self.service_name
                )
        min_priority = self.servers[0].priority
@ -174,7 +205,7 @@ def resolve_service(service_name, dns_client=client, cache=SERVER_CACHE, clock=t
                and answers[0].type == dns.SRV
                and answers[0].payload
                and answers[0].payload.target == dns.Name('.')):
-            raise ConnectError("Service %s unavailable", service_name)
+            raise ConnectError("Service %s unavailable" % service_name)
        for answer in answers:
            if answer.type != dns.SRV or not answer.payload:
--- a/synapse/python_dependencies.py
+++ b/synapse/python_dependencies.py
@ -41,7 +41,11 @@ REQUIREMENTS = {
 CONDITIONAL_REQUIREMENTS = {
    "web_client": {
        "matrix_angular_sdk>=0.6.8": ["syweb>=0.6.8"],
-    }
+    },
    "preview_url": {
        "lxml>=3.6.0": ["lxml"],
        "netaddr>=0.7.18": ["netaddr"],
    },
 }
--- a/synapse/rest/media/v1/base_resource.py
+++ b/synapse/rest/media/v1/base_resource.py
@ -72,6 +72,7 @@ class BaseMediaResource(Resource):
        self.store = hs.get_datastore()
        self.max_upload_size = hs.config.max_upload_size
        self.max_image_pixels = hs.config.max_image_pixels
        self.max_spider_size = hs.config.max_spider_size
        self.filepaths = filepaths
        self.version_string = hs.version_string
        self.downloads = {}
--- a/synapse/rest/media/v1/media_repository.py
+++ b/synapse/rest/media/v1/media_repository.py
@ -17,6 +17,7 @@ from .upload_resource import UploadResource
 from .download_resource import DownloadResource
 from .thumbnail_resource import ThumbnailResource
 from .identicon_resource import IdenticonResource
 from .preview_url_resource import PreviewUrlResource
 from .filepath import MediaFilePaths
 from twisted.web.resource import Resource
@ -78,3 +79,9 @@ class MediaRepositoryResource(Resource):
        self.putChild("download", DownloadResource(hs, filepaths))
        self.putChild("thumbnail", ThumbnailResource(hs, filepaths))
        self.putChild("identicon", IdenticonResource())
        if hs.config.url_preview_enabled:
            try:
                self.putChild("preview_url", PreviewUrlResource(hs, filepaths))
            except Exception as e:
                logger.warn("Failed to mount preview_url")
                logger.exception(e)
--- a/synapse/rest/media/v1/preview_url_resource.py
+++ b/synapse/rest/media/v1/preview_url_resource.py
@ -0,0 +1,462 @@
 # -*- coding: utf-8 -*-
 # Copyright 2016 OpenMarket Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .base_resource import BaseMediaResource
 from twisted.web.server import NOT_DONE_YET
 from twisted.internet import defer
 from urlparse import urlparse, urlsplit, urlunparse
 from synapse.api.errors import (
    SynapseError, Codes,
 )
 from synapse.util.stringutils import random_string
 from synapse.util.caches.expiringcache import ExpiringCache
 from synapse.http.client import SpiderHttpClient
 from synapse.http.server import (
    request_handler, respond_with_json_bytes
 )
 from synapse.util.async import ObservableDeferred
 from synapse.util.stringutils import is_ascii
 import os
 import re
 import fnmatch
 import cgi
 import ujson as json
 import logging
 logger = logging.getLogger(__name__)
 try:
    from lxml import html
 except ImportError:
    pass
 class PreviewUrlResource(BaseMediaResource):
    isLeaf = True
    def __init__(self, hs, filepaths):
        try:
            if html:
                pass
        except:
            raise RuntimeError("Disabling PreviewUrlResource as lxml not available")
        if not hasattr(hs.config, "url_preview_ip_range_blacklist"):
            logger.warn(
                "For security, you must specify an explicit target IP address "
                "blacklist in url_preview_ip_range_blacklist for url previewing "
                "to work"
            )
            raise RuntimeError(
                "Disabling PreviewUrlResource as "
                "url_preview_ip_range_blacklist not specified"
            )
        BaseMediaResource.__init__(self, hs, filepaths)
        self.client = SpiderHttpClient(hs)
        if hasattr(hs.config, "url_preview_url_blacklist"):
            self.url_preview_url_blacklist = hs.config.url_preview_url_blacklist
        # simple memory cache mapping urls to OG metadata
        self.cache = ExpiringCache(
            cache_name="url_previews",
            clock=self.clock,
            # don't spider URLs more often than once an hour
            expiry_ms=60 * 60 * 1000,
        )
        self.cache.start()
        self.downloads = {}
    def render_GET(self, request):
        self._async_render_GET(request)
        return NOT_DONE_YET
    @request_handler
    @defer.inlineCallbacks
    def _async_render_GET(self, request):
        # XXX: if get_user_by_req fails, what should we do in an async render?
        requester = yield self.auth.get_user_by_req(request)
        url = request.args.get("url")[0]
        if "ts" in request.args:
            ts = int(request.args.get("ts")[0])
        else:
            ts = self.clock.time_msec()
        # impose the URL pattern blacklist
        if hasattr(self, "url_preview_url_blacklist"):
            url_tuple = urlsplit(url)
            for entry in self.url_preview_url_blacklist:
                match = True
                for attrib in entry:
                    pattern = entry[attrib]
                    value = getattr(url_tuple, attrib)
                    logger.debug((
                        "Matching attrib '%s' with value '%s' against"
                        " pattern '%s'"
                    ) % (attrib, value, pattern))
                    if value is None:
                        match = False
                        continue
                    if pattern.startswith('^'):
                        if not re.match(pattern, getattr(url_tuple, attrib)):
                            match = False
                            continue
                    else:
                        if not fnmatch.fnmatch(getattr(url_tuple, attrib), pattern):
                            match = False
                            continue
                if match:
                    logger.warn(
                        "URL %s blocked by url_blacklist entry %s", url, entry
                    )
                    raise SynapseError(
                        403, "URL blocked by url pattern blacklist entry",
                        Codes.UNKNOWN
                    )
        # first check the memory cache - good to handle all the clients on this
        # HS thundering away to preview the same URL at the same time.
        og = self.cache.get(url)
        if og:
            respond_with_json_bytes(request, 200, json.dumps(og), send_cors=True)
            return
        # then check the URL cache in the DB (which will also provide us with
        # historical previews, if we have any)
        cache_result = yield self.store.get_url_cache(url, ts)
        if (
            cache_result and
            cache_result["download_ts"] + cache_result["expires"] > ts and
            cache_result["response_code"] / 100 == 2
        ):
            respond_with_json_bytes(
                request, 200, cache_result["og"].encode('utf-8'),
                send_cors=True
            )
            return
        # Ensure only one download for a given URL is active at a time
        download = self.downloads.get(url)
        if download is None:
            download = self._download_url(url, requester.user)
            download = ObservableDeferred(
                download,
                consumeErrors=True
            )
            self.downloads[url] = download
            @download.addBoth
            def callback(media_info):
                del self.downloads[url]
                return media_info
        media_info = yield download.observe()
        # FIXME: we should probably update our cache now anyway, so that
        # even if the OG calculation raises, we don't keep hammering on the
        # remote server.  For now, leave it uncached to aid debugging OG
        # calculation problems
        logger.debug("got media_info of '%s'" % media_info)
        if self._is_media(media_info['media_type']):
            dims = yield self._generate_local_thumbnails(
                media_info['filesystem_id'], media_info
            )
            og = {
                "og:description": media_info['download_name'],
                "og:image": "mxc://%s/%s" % (
                    self.server_name, media_info['filesystem_id']
                ),
                "og:image:type": media_info['media_type'],
                "matrix:image:size": media_info['media_length'],
            }
            if dims:
                og["og:image:width"] = dims['width']
                og["og:image:height"] = dims['height']
            else:
                logger.warn("Couldn't get dims for %s" % url)
            # define our OG response for this media
        elif self._is_html(media_info['media_type']):
            # TODO: somehow stop a big HTML tree from exploding synapse's RAM
            try:
                tree = html.parse(media_info['filename'])
                og = yield self._calc_og(tree, media_info, requester)
            except UnicodeDecodeError:
                # XXX: evil evil bodge
                # Empirically, sites like google.com mix Latin-1 and utf-8
                # encodings in the same page.  The rogue Latin-1 characters
                # cause lxml to choke with a UnicodeDecodeError, so if we
                # see this we go and do a manual decode of the HTML before
                # handing it to lxml as utf-8 encoding, counter-intuitively,
                # which seems to make it happier...
                file = open(media_info['filename'])
                body = file.read()
                file.close()
                tree = html.fromstring(body.decode('utf-8', 'ignore'))
                og = yield self._calc_og(tree, media_info, requester)
        else:
            logger.warn("Failed to find any OG data in %s", url)
            og = {}
        logger.debug("Calculated OG for %s as %s" % (url, og))
        # store OG in ephemeral in-memory cache
        self.cache[url] = og
        # store OG in history-aware DB cache
        yield self.store.store_url_cache(
            url,
            media_info["response_code"],
            media_info["etag"],
            media_info["expires"],
            json.dumps(og),
            media_info["filesystem_id"],
            media_info["created_ts"],
        )
        respond_with_json_bytes(request, 200, json.dumps(og), send_cors=True)
    @defer.inlineCallbacks
    def _calc_og(self, tree, media_info, requester):
        # suck our tree into lxml and define our OG response.
        # if we see any image URLs in the OG response, then spider them
        # (although the client could choose to do this by asking for previews of those
        # URLs to avoid DoSing the server)
        # "og:type"         : "video",
        # "og:url"          : "https://www.youtube.com/watch?v=LXDBoHyjmtw",
        # "og:site_name"    : "YouTube",
        # "og:video:type"   : "application/x-shockwave-flash",
        # "og:description"  : "Fun stuff happening here",
        # "og:title"        : "RemoteJam - Matrix team hack for Disrupt Europe Hackathon",
        # "og:image"        : "https://i.ytimg.com/vi/LXDBoHyjmtw/maxresdefault.jpg",
        # "og:video:url"    : "http://www.youtube.com/v/LXDBoHyjmtw?version=3&autohide=1",
        # "og:video:width"  : "1280"
        # "og:video:height" : "720",
        # "og:video:secure_url": "https://www.youtube.com/v/LXDBoHyjmtw?version=3",
        og = {}
        for tag in tree.xpath("//*/meta[starts-with(@property, 'og:')]"):
            og[tag.attrib['property']] = tag.attrib['content']
        # TODO: grab article: meta tags too, e.g.:
        # "article:publisher" : "https://www.facebook.com/thethudonline" />
        # "article:author" content="https://www.facebook.com/thethudonline" />
        # "article:tag" content="baby" />
        # "article:section" content="Breaking News" />
        # "article:published_time" content="2016-03-31T19:58:24+00:00" />
        # "article:modified_time" content="2016-04-01T18:31:53+00:00" />
        if 'og:title' not in og:
            # do some basic spidering of the HTML
            title = tree.xpath("(//title)[1] | (//h1)[1] | (//h2)[1] | (//h3)[1]")
            og['og:title'] = title[0].text.strip() if title else None
        if 'og:image' not in og:
            # TODO: extract a favicon failing all else
            meta_image = tree.xpath(
                "//*/meta[translate(@itemprop, 'IMAGE', 'image')='image']/@content"
            )
            if meta_image:
                og['og:image'] = self._rebase_url(meta_image[0], media_info['uri'])
            else:
                # TODO: consider inlined CSS styles as well as width & height attribs
                images = tree.xpath("//img[@src][number(@width)>10][number(@height)>10]")
                images = sorted(images, key=lambda i: (
                    -1 * int(i.attrib['width']) * int(i.attrib['height'])
                ))
                if not images:
                    images = tree.xpath("//img[@src]")
                if images:
                    og['og:image'] = images[0].attrib['src']
        # pre-cache the image for posterity
        # FIXME: it might be cleaner to use the same flow as the main /preview_url request
        # itself and benefit from the same caching etc.  But for now we just rely on the
        # caching on the master request to speed things up.
        if 'og:image' in og and og['og:image']:
            image_info = yield self._download_url(
                self._rebase_url(og['og:image'], media_info['uri']), requester.user
            )
            if self._is_media(image_info['media_type']):
                # TODO: make sure we don't choke on white-on-transparent images
                dims = yield self._generate_local_thumbnails(
                    image_info['filesystem_id'], image_info
                )
                if dims:
                    og["og:image:width"] = dims['width']
                    og["og:image:height"] = dims['height']
                else:
                    logger.warn("Couldn't get dims for %s" % og["og:image"])
                og["og:image"] = "mxc://%s/%s" % (
                    self.server_name, image_info['filesystem_id']
                )
                og["og:image:type"] = image_info['media_type']
                og["matrix:image:size"] = image_info['media_length']
            else:
                del og["og:image"]
        if 'og:description' not in og:
            meta_description = tree.xpath(
                "//*/meta"
                "[translate(@name, 'DESCRIPTION', 'description')='description']"
                "/@content")
            if meta_description:
                og['og:description'] = meta_description[0]
            else:
                # grab any text nodes which are inside the <body/> tag...
                # unless they are within an HTML5 semantic markup tag...
                # <header/>, <nav/>, <aside/>, <footer/>
                # ...or if they are within a <script/> or <style/> tag.
                # This is a very very very coarse approximation to a plain text
                # render of the page.
                text_nodes = tree.xpath("//text()[not(ancestor::header | ancestor::nav | "
                                        "ancestor::aside | ancestor::footer | "
                                        "ancestor::script | ancestor::style)]" +
                                        "[ancestor::body]")
                text = ''
                for text_node in text_nodes:
                    if len(text) < 500:
                        text += text_node + ' '
                    else:
                        break
                text = re.sub(r'[\t ]+', ' ', text)
                text = re.sub(r'[\t \r\n]*[\r\n]+', '\n', text)
                text = text.strip()[:500]
                og['og:description'] = text if text else None
        # TODO: delete the url downloads to stop diskfilling,
        # as we only ever cared about its OG
        defer.returnValue(og)
    def _rebase_url(self, url, base):
        base = list(urlparse(base))
        url = list(urlparse(url))
        if not url[0]:  # fix up schema
            url[0] = base[0] or "http"
        if not url[1]:  # fix up hostname
            url[1] = base[1]
            if not url[2].startswith('/'):
                url[2] = re.sub(r'/[^/]+$', '/', base[2]) + url[2]
        return urlunparse(url)
    @defer.inlineCallbacks
    def _download_url(self, url, user):
        # TODO: we should probably honour robots.txt... except in practice
        # we're most likely being explicitly triggered by a human rather than a
        # bot, so are we really a robot?
        # XXX: horrible duplication with base_resource's _download_remote_file()
        file_id = random_string(24)
        fname = self.filepaths.local_media_filepath(file_id)
        self._makedirs(fname)
        try:
            with open(fname, "wb") as f:
                logger.debug("Trying to get url '%s'" % url)
                length, headers, uri, code = yield self.client.get_file(
                    url, output_stream=f, max_size=self.max_spider_size,
                )
                # FIXME: pass through 404s and other error messages nicely
            media_type = headers["Content-Type"][0]
            time_now_ms = self.clock.time_msec()
            content_disposition = headers.get("Content-Disposition", None)
            if content_disposition:
                _, params = cgi.parse_header(content_disposition[0],)
                download_name = None
                # First check if there is a valid UTF-8 filename
                download_name_utf8 = params.get("filename*", None)
                if download_name_utf8:
                    if download_name_utf8.lower().startswith("utf-8''"):
                        download_name = download_name_utf8[7:]
                # If there isn't check for an ascii name.
                if not download_name:
                    download_name_ascii = params.get("filename", None)
                    if download_name_ascii and is_ascii(download_name_ascii):
                        download_name = download_name_ascii
                if download_name:
                    download_name = urlparse.unquote(download_name)
                    try:
                        download_name = download_name.decode("utf-8")
                    except UnicodeDecodeError:
                        download_name = None
            else:
                download_name = None
            yield self.store.store_local_media(
                media_id=file_id,
                media_type=media_type,
                time_now_ms=self.clock.time_msec(),
                upload_name=download_name,
                media_length=length,
                user_id=user,
            )
        except Exception as e:
            os.remove(fname)
            raise SynapseError(
                500, ("Failed to download content: %s" % e),
                Codes.UNKNOWN
            )
        defer.returnValue({
            "media_type": media_type,
            "media_length": length,
            "download_name": download_name,
            "created_ts": time_now_ms,
            "filesystem_id": file_id,
            "filename": fname,
            "uri": uri,
            "response_code": code,
            # FIXME: we should calculate a proper expiration based on the
            # Cache-Control and Expire headers.  But for now, assume 1 hour.
            "expires": 60 * 60 * 1000,
            "etag": headers["ETag"][0] if "ETag" in headers else None,
        })
    def _is_media(self, content_type):
        if content_type.lower().startswith("image/"):
            return True
    def _is_html(self, content_type):
        content_type = content_type.lower()
        if (
            content_type.startswith("text/html") or
            content_type.startswith("application/xhtml")
        ):
            return True
--- a/synapse/rest/media/v1/thumbnail_resource.py
+++ b/synapse/rest/media/v1/thumbnail_resource.py
@ -72,6 +72,11 @@ class ThumbnailResource(BaseMediaResource):
            self._respond_404(request)
            return
        if media_info["media_type"] == "image/svg+xml":
            file_path = self.filepaths.local_media_filepath(media_id)
            yield self._respond_with_file(request, media_info["media_type"], file_path)
            return
        thumbnail_infos = yield self.store.get_local_media_thumbnails(media_id)
        if thumbnail_infos:
@ -103,6 +108,11 @@ class ThumbnailResource(BaseMediaResource):
            self._respond_404(request)
            return
        if media_info["media_type"] == "image/svg+xml":
            file_path = self.filepaths.local_media_filepath(media_id)
            yield self._respond_with_file(request, media_info["media_type"], file_path)
            return
        thumbnail_infos = yield self.store.get_local_media_thumbnails(media_id)
        for info in thumbnail_infos:
            t_w = info["thumbnail_width"] == desired_width
@ -138,6 +148,11 @@ class ThumbnailResource(BaseMediaResource):
                                             desired_method, desired_type):
        media_info = yield self._get_remote_media(server_name, media_id)
        if media_info["media_type"] == "image/svg+xml":
            file_path = self.filepaths.remote_media_filepath(server_name, media_id)
            yield self._respond_with_file(request, media_info["media_type"], file_path)
            return
        thumbnail_infos = yield self.store.get_remote_media_thumbnails(
            server_name, media_id,
        )
@ -181,6 +196,11 @@ class ThumbnailResource(BaseMediaResource):
        # We should proxy the thumbnail from the remote server instead.
        media_info = yield self._get_remote_media(server_name, media_id)
        if media_info["media_type"] == "image/svg+xml":
            file_path = self.filepaths.remote_media_filepath(server_name, media_id)
            yield self._respond_with_file(request, media_info["media_type"], file_path)
            return
        thumbnail_infos = yield self.store.get_remote_media_thumbnails(
            server_name, media_id,
        )
@ -208,6 +228,8 @@ class ThumbnailResource(BaseMediaResource):
    @defer.inlineCallbacks
    def _respond_default_thumbnail(self, request, media_info, width, height,
                                   method, m_type):
        # XXX: how is this meant to work? store.get_default_thumbnails
        # appears to always return [] so won't this always 404?
        media_type = media_info["media_type"]
        top_level_type = media_type.split("/")[0]
        sub_type = media_type.split("/")[-1].split(";")[0]
--- a/synapse/storage/media_repository.py
+++ b/synapse/storage/media_repository.py
@ -25,7 +25,7 @@ class MediaRepositoryStore(SQLBaseStore):
    def get_local_media(self, media_id):
        """Get the metadata for a local piece of media
        Returns:
-            None if the meia_id doesn't exist.
+            None if the media_id doesn't exist.
        """
        return self._simple_select_one(
            "local_media_repository",
@ -50,6 +50,61 @@ class MediaRepositoryStore(SQLBaseStore):
            desc="store_local_media",
        )
    def get_url_cache(self, url, ts):
        """Get the media_id and ts for a cached URL as of the given timestamp
        Returns:
            None if the URL isn't cached.
        """
        def get_url_cache_txn(txn):
            # get the most recently cached result (relative to the given ts)
            sql = (
                "SELECT response_code, etag, expires, og, media_id, download_ts"
                " FROM local_media_repository_url_cache"
                " WHERE url = ? AND download_ts <= ?"
                " ORDER BY download_ts DESC LIMIT 1"
            )
            txn.execute(sql, (url, ts))
            row = txn.fetchone()
            if not row:
                # ...or if we've requested a timestamp older than the oldest
                # copy in the cache, return the oldest copy (if any)
                sql = (
                    "SELECT response_code, etag, expires, og, media_id, download_ts"
                    " FROM local_media_repository_url_cache"
                    " WHERE url = ? AND download_ts > ?"
                    " ORDER BY download_ts ASC LIMIT 1"
                )
                txn.execute(sql, (url, ts))
                row = txn.fetchone()
            if not row:
                return None
            return dict(zip((
                'response_code', 'etag', 'expires', 'og', 'media_id', 'download_ts'
            ), row))
        return self.runInteraction(
            "get_url_cache", get_url_cache_txn
        )
    def store_url_cache(self, url, response_code, etag, expires, og, media_id,
                        download_ts):
        return self._simple_insert(
            "local_media_repository_url_cache",
            {
                "url": url,
                "response_code": response_code,
                "etag": etag,
                "expires": expires,
                "og": og,
                "media_id": media_id,
                "download_ts": download_ts,
            },
            desc="store_url_cache",
        )
    def get_local_media_thumbnails(self, media_id):
        return self._simple_select_list(
            "local_media_repository_thumbnails",
--- a/synapse/storage/schema/delta/31/local_media_repository_url_cache.sql
+++ b/synapse/storage/schema/delta/31/local_media_repository_url_cache.sql
@ -0,0 +1,27 @@
 /* Copyright 2016 OpenMarket Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 CREATE TABLE local_media_repository_url_cache(
    url TEXT,              -- the URL being cached
    response_code INTEGER, -- the HTTP response code of this download attempt
    etag TEXT,             -- the etag header of this response
    expires INTEGER,       -- the number of ms this response was valid for
    og TEXT,               -- cache of the OG metadata of this URL as JSON
    media_id TEXT,         -- the media_id, if any, of the URL's content in the repo
    download_ts BIGINT     -- the timestamp of this download attempt
 );
 CREATE INDEX local_media_repository_url_cache_by_url_download_ts
    ON local_media_repository_url_cache(url, download_ts);