Using Varnish to capitalize on image hotlinkers and retain referred visitors

by Rudd-O published 2008/12/09 22:35:00 GMT+0, last modified 2013-06-26T03:24:20+00:00
Hotlinkers suck your bandwidth dry, but referrals are valuable to you. Learn how to use Varnish to stop image hotlinkers while at the same time welcoming referred visitors and gaining a big performance edge.

By now, most of us who have run a site for a long time have already run into the problem of image hotlinkers -- people who take the URLs of your images to embed in their own sites using the <IMG SRC="..."> tag.  The usual approach to dealing with this problem is to use Apache URL rewriting rules to stop requests for images if they come from non-approved sites.

The usual problems with the Apache URL rewrite approach are these:

  1. You have to wake an Apache (or, in general, Web server) instance every time a hotlinked request is made.  This is expensive if your site is high-volume and your Web server processes take up a lot of RAM (PHP- and Python-configured Web servers do).
  2. You have no way of distinguishing an image from a page, other than by URL (file extension / site section).  Thus, if you serve dynamically-generated images, or the hotlinker tacks a query string to your image, your strategy fails unless you write extremely complicated regular expressions to offset their cunning.
  3. Finally, if someone links to an image in your site instead of embedding it, you generally want to capitalize on that by showing your Web site (embedding the requested image, of course) instead of showing an unwelcoming Forbidden message to visitors who otherwise would stay on your site and browse around.

Today, we're going to learn a different approach that uses Varnish, a high-performance HTTP front-end for your sites.  If you're running a high-traffic site, you are using either Varnish or Squid.  And you should be using Varnish -- with Squid, you're out of luck if you want to apply advanced URL rewriting techniques like these.

Heads up: these instructions require you to have Varnish 2 installed.  Varnish 1 does not have the advanced processing capabilities that Varnish 2 has.

How Varnish works

I'll keep this part simple:

  1. Varnish receives an HTTP request on port 80 from a Web browser.
  2. It runs the receive handler, which sets the appropriate backend and performs a number of processes that you specify.
  3. Depending on the receive handler, it passes the request to your real Web server, or looks the content up in its cache.
  4. Once the request has been fetched, Varnish either shoves the content in the cache, or it doesn't.
  5. Varnish returns the content to the Web browser.

If you have setup Varnish properly, you can serve thousands of requests per second by letting Varnish take the hit for static content service, all the while letting dynamic content stay 100% fresh directly from your real Web server.  However, that lesson is out of scope here -- we'll just focus on getting hotlinkers out, and getting referred visitors in.

A default Varnish configuration file

The approach involves modifying your Varnish configuration file.  Here's an example file that we'll use as a starter -- one that defines a backend and a receive handler:

backend zope {
        .host = "127.0.0.1";
        .port = "82";
}
sub vcl_recv {
        if (req.http.host == "rudd-o.com") {
                        set req.backend = zope;
        if (req.request == "POST")
        {
                pipe;
        }
}

This is pretty standard -- the host name is detected in the receive handler, and Varnish is told to dispatch that processing to my Zope backend (with an exception for POST requests -- those are piped directly without any extra processing).  Then, the default receive handler in Varnish picks it up after the custom receive handler has finished.

Adding the code

What we're going to do does not involve touching those parts.  Instead, we're going to add the following stanzas to the configuration file:

sub detect_hotlinking {
        if (
                req.http.host == "rudd-o.com"
                &&
                obj.http.Content-Type ~ "image/"
                &&
                req.http.referer ~ "^http"
                &&
              ! req.http.referer ~ "^http(|s)://(([a-z-]+\.|)rudd-o\.com|([a-z]+\.|)turbochargedcms\.com|favic$
                &&
              ! req.url ~ "uploads/images/logos"
                &&
              ! req.url ~ "gravatar-picture"
                &&
              ! req.url ~ "^/favicon.ico$"
           )
        {
                set req.http.New-Location = regsub(req.url,"$","/view");
                error 307 "Redirecting you to an alternate representation...";
        }
}

sub vcl_hit {
        call detect_hotlinking;
}

sub vcl_fetch {
        call detect_hotlinking;
}

sub vcl_error {
        if (req.http.New-Location) {
                set obj.http.Location = req.http.New-Location;
        }
}

But what does that gibberish mean?

Let's explain what this does, step by step.  I'll repeat the code but this time I will go, line by line, explaining what each one of them does:

sub detect_hotlinking {
# this is a custom subroutine that will be invoked later
        if (    # if (duh!)

           # the host name of the request is rudd-o.com
                req.http.host == "rudd-o.com"
           # and the content type of the content that is going to be served
           # is an image
                &&
                obj.http.Content-Type ~ "image/"
           # and the referrer is another Web site
                &&
                req.http.referer ~ "^http"
           # and the referrer is not in this white list
           # of sites I trust in advance
                &&
              ! req.http.referer ~ "^http(|s)://(([a-z-]+\.|)rudd-o\.com|([a-z]+\.|)turbochargedcms\.com|favic$
                &&
           # and the URL does not match a white list of
           # URLs that may be explicitly hotlinked 
              ! req.url ~ "uploads/images/logos"
                &&
              ! req.url ~ "gravatar-picture"
                &&
              ! req.url ~ "^/favicon.ico$"
           )
        {
           # tag the request with a new URL
           # in this example, we will tack "/view" at the end
           # because in Zope, tacking "/view" at the end of an image URL
           # causes the image to be displayed in a nice Web page frame
           # in your site's template
                set req.http.New-Location = regsub(req.url,"$","/view");
           # trigger an HTTP 307 redirect
                error 307 "Redirecting you to an alternate representation...";
        }
}

As you can see, this is pretty straightforward -- any requests not matching your white list of referrers and linkable images are automatically redirected to a new URL that you get to specify.  That URL may not just be an image, it may be a Web page too!  The reason behind this particular decision will be evident shortly.

Now we need to hook our subroutine into different parts of Varnish that actually process requests.  For that we use three special handlers:

sub vcl_hit {
# runs when an image is in the cache
        call detect_hotlinking;
}

sub vcl_fetch {
# runs when an image is fetched from your real Web server
        call detect_hotlinking;
}

sub vcl_error {
# traps triggered 307 redirects and adds a Location:
# to the URL set in detect_hotlinking
        if (req.http.New-Location) {
                set obj.http.Location = req.http.New-Location;
        }
}

As you can see, we've got all bases covered.  Varnish runs the detector if there's a cache hit, runs the detector if there's a cache miss, and traps the error routine to redirect if our detector has set a new destination URL.  Most importantly, the detector will only operate on images because it looks for an image content type (this is obviously something you can change if you're inclined to).

Benefits of this solution

Let's go through the benefits:

  1. Redirects are content-dependent instead of URL-dependent. Since the redirect is performed after the request to your backend server has been completed, Varnish has the know-how that will help you prevent hotlinkers in all of your site's images, regardless of file extension or path in your Web site.
  2. Redirects are very high-performance.  After the requested file has been already fetched by Varnish from your Web server, this solution does not require your Web server to be contacted to decide whether to redirect or not -- the cache itself is consulted to perform the decision.  This is an unbelievably fast operation (microseconds!) because Varnish actually compiles this configuration to C and then to machine code; Varnish can sustain well over thousands of requests per second in this manner, without waking up Apache or Zope -- which is very important if your site is busy generating pages for memory-hungry Web apps based on Python or PHP.
  3. Redirects capitalize on your content.  The well-known problem with hotlinking avoidance strategies is that your Web server has no way of knowing whether the requested image was embedded in the hotlinker's site, or merely linked to your site.  By redirecting to a Web page presenting an image (in this example, the default Zope /view handler for images), you can capitalize on images linked from other sites, while embedded images will still show as broken images on hotlinkers' sites.  This ensures that people can freely link to your site's images (but not embed them) and you'll still get to show them your site's user interface, leading them to stay on your site for more interesting content.  You, of course, have ultimate flexibility in how you mangle the requested URLs to present your images within your site's user interface.

Other strategies

Varnish allows you to pursue other strategies.  Here are two:

  1. Serve a generic image if the image is hotlinked.  This image can come from your Web server itself -- you just need to move the hotlinking detection logic to the vcl_recv handler, and modify the req.url parameter if hotlinking was detected.
  2. Serve a dynamic image with your logo or instructions pasted on top of it.  As laid out in the bullet point before this one, you can modify req.url to point to a PHP script in your Web server or some other artifice that will download the good image from your own site and dynamically paste text or your logo using GD or other image manipulation library.  This trick lets you use your images on your site without having to watermark them, while automatically watermarking them if they are misused by hotlinkers.
  3. Simply respond with a 412 precondition failed HTTP status code.  This alternative has the disadvantage of making links to images from other sites display a generic error page, but it would be the best-performing choice if you do not mind visitors just looking at the image and then hitting the Back button to return to the original site.  You can combine this status code with a synthetic page generated directly in the VCL configuration file -- that way, you can still present something nice (yes, even images, provided you set the Content-Type header in the VCL object response) to the user without involving a round-trip to your Web server.
  4. Use the same strategy you're using for images or videos to embed MP3 or streaming videos in your site's media player.  That way, other sites cannot embed your videos or podcasts -- if they want to feature your content, by necessity they will have to link to the page with a hyperlink, which sends more visitors on their way.

This is just one of the many ways the Varnish HTTP front-end can help you.  We'll be continuing this story and revisiting other related topics in future posts.