Creating and serving a static archive from a Drupal 6 site

Back in 2012, I was on a team of people who put together a website to commemorate the 100th anniversary of a historic journey to North America undertaken by 'Abdu'l-Bahá, then the head of the Bahá'í Faith. A religious prisoner of the Ottoman empire for almost 40 years, 'Abdu'l-Bahá made the journey to America at almost 70 years of age. The trip made news headlines around the world. If you were alive and following the news at the time, you probably would have heard of it.

Anyhow, we built the site in Drupal 6 and it turned out great; and when 2013 came along we just let it sit there for a couple of years, humming along happily for the most part. A couple of years later, the site is still pretty cool--as long as you're not trying to browse it with a phone--but we're not adding any new information, and maintaining the Drupal 6 codebase is increasingly difficult. So, we're retiring the code, but keeping the site, and I was tasked with creating and serving a static archive from a Drupal 6 site.

Setting up the Server

I eventually decided to create a new server on DigitalOcean (though if you're stuck on Amazon, I've got some helpful hints for downsizing an EC2 instance), and after ensuring that auto-updates were enabled, I installed the only two tools that I would need for this: nginx and httrack. sudo apt-get install nginx httrack

I also decided to rsync the rather large files folder, so that httrack wouldn't have to fetch them through http. For this site, I put the files folder at nginx/centenary/centenary.bahai.us/sites/default/files, and I made sure to remove sensitive files like the scheduled database backups performed by backup_migrate.

Setting up a Drupal site for archival

Here are a few things that I had to do before attempting to archive this Drupal 6 site:

Block all accounts except mine, so that no changes would be made during the archival process. Perhaps an abundance of caution.
Turn off ajax for all views.
Remove login forms and menu links to /user/login.
Turn off search module.
Make sure there are no error or status messages on any of the pages. You may even want to remove the line that prints $messages from your page.tpl.php file.
Turn on aggregation for js and css.
Turn off cron and poormanscron.
Examine a few pages very carefully, especially the <head> section. I had to disable a few modules (poormanscron, block_edit) because they left crufty tags in the head.

Archive static HTML from a Drupal site

To save every Drupal page as a static html file, I used httrack. I created a folder at nginx/centenary, and then from inside that folder ran the following httrack command: httrack http://centenary.bahai.us -%p -%q -s0 -X0 -%P0 -N "%h%p/%n%[page].%t" -*sites/default/files* -*.jpg -*.gif -*.png -*.mp3 -*soundFile* +*.xml -%v

This downloaded a static archive of the entire site in flat html files, and put it in nginx/centenary/centenary.bahai.us. Note that this particular command makes files of the format [server]/[path]/[resource][page-query].html, or for example:

centenary.bahai.us/encounter/howard-colby-ives.html
    [server]        [path]       [resource]   .html

I chose this format because it made sense to me to match Drupal's url paths fairly closely, but all the links still point to [server]/[path]/[resource], which will not work unless you are able to configure your web server appropriately (see below for nginx config). If you are trying to make a truly static site, such as might be served from github pages or similar, then you'll want a format more like [server]/[path]/[resource]/index[page-query].html,
which in httrack would be something more like
-N "%h%p/%n/index%[page].%t.
I believe that was the approach presented in the excellent tutorial by Karen Stevenson of Lullabot, which I wish I had found before figuring a lot of the same stuff out for myself.

Links to paged views

You can't save html files with their querystrings as part of the filename, so in the httrack command I asked paged views to be saved with the "page" query as part of the filename (i.e. "/news?page=5" gets translated to "/news5.html"). Owing to some other requirements, I also asked httrack to leave all links exactly as they were, so I had to go in and change all those links that still pointed to paged views with a query string. Sed worked well for this:

sed -i -re "s|(href=.[^'\"\?]+)\?page=([0-9%cC]+)[^'\"]*|\1\2|g" *.html

Getting quasi-dynamic content from daily and weekly views

The site owes much of its dynamism to the time-sensitive nature of its content, with the newspaper clippings and recordings of talks all shown on the same day as they appeared a century ago. To maintain this dynamic element of the pages as simply as possible, I decided to use the server side include (SSI) functionality that is standard in nginx.

The first step was to get exactly the elements I needed from each of the dynamic views. So, I created a little module with throw-away urls for each day of the year; then for each day I used views_embed_view($name], $display, $argument); to get the output from the view and had php die() with that output. I then set httrack to crawl those pages, giving me a cached result of each view for each day, in a format suitable for including in the page via SSI.

The .module file is pretty simple:

<?php
function dnotes_siterip_menu() {
  $items['dnotes-siterip-daily'] = array(
    'title' => 'Daily Centenarian',
    'type' => MENU_CALLBACK,
    'access callback' => TRUE,
    'page callback' => 'dnotes_siterip_daily',
  );
  $items['dnotes-siterip-weekly'] = array(
    'title' => 'This Week',
    'type' => MENU_CALLBACK,
    'access callback' => TRUE,
    'page callback' => 'dnotes_siterip_weekly',
  );
  return $items;
}

function dnotes_siterip_daily($date = NULL) {
  $all_dates = dnotes_siterip_getdates();
  if (empty($date) || !in_array($date, $all_dates)) {
    $out = '';
    foreach ($all_dates as $d) {
      $out .= l($d, 'dnotes-siterip-daily/' . $d) . '<br />';
    }
    return $out;
  }
  else {
    $out = views_embed_view('daily', 'block_1', '1912-' . $date);
    die($out);
  }
}

function dnotes_siterip_weekly($date = NULL) {
  $all_dates = dnotes_siterip_getdates();
  if (empty($date) || !in_array($date, $all_dates)) {
    $out = '';
    foreach ($all_dates as $d) {
      $out .= l($d, 'dnotes-siterip-weekly/' . $d) . '<br />';
    }
    return $out;
  }
  else {
    $out = views_embed_view('front_weekly', 'page_1', '1912-' . $date);
    // This view has a static footer that we don't need to save for each day; instead we can insert only the contents of div.view-content
    $new_out = explode('<div class="view-footer">', $out);
    $out = $new_out[0] . '</div>';
    $out = str_replace('<div class="views_view view view-front-weekly view-id-front_weekly view-display-id-page_1 view-dom-id-3">', '', $out);
    die($out);
  }
}

function dnotes_siterip_getdates() {
  $range = array();
  $start = strtotime('2012-01-01'); 
  $end = strtotime('2012-12-31');
  do {
   $range[] = date('m-d',$start);
   $start = strtotime("+ 1 day",$start);
  } while ( $start <= $end );
 return $range;
}
?>

Once I enabled this module, I was able to have httrack scrape the provided urls and download only the snippets that I wanted to include in the static pages of the archive:

httrack http://centenary.bahai.us/dnotes-siterip-daily -%p -%F "" -s0 -X0 -* +*dnotes-siterip-daily* +*.jpg +*.png httrack http://centenary.bahai.us/dnotes-siterip-weekly -%p -%F "" -s0 -X0 -* +*dnotes-siterip-weekly* +*.jpg +*.png

Inserting SSI commands into HTML files generated from Drupal

So now I had a snapshot of the content for each of two views for each day of the year, and I needed to include that content in the archived html files.

The weekly view was easy; since it only appeared on the front page, I just edited the nginx/centenary/centenary.bahai.us/index.html file and replaced the <div class="view-content">...</div> with the following SSI directives:

The daily view, on the other hand, was scattered on every page of the site. So it was a recursive find-and-replace on a folder full of files: I had to replace each occurrence of that 30-line-long view with SSI commands, throughout every saved HTML file in the directory tree. For this I eventually figured out how to use sed, and the command ended up looking maddeningly simple after I had spent hours re-learning sed:

sed -i '/<div class="views_view view view-daily view-id-daily view-display-id-block_1/,+30c ' *.html

Nginx configuration for serving an archived Drupal 6 site

I'm not going to post my entire nginx configuration here, but I'll post the parts of the server declaration that are relevant to serving a Drupal site archive.

The first bits are exceedingly standard. Note that the document root is the nginx/centenary/centenary.bahai.us folder that httrack created.

    server {
        listen  80 default_server;
        root    /usr/share/nginx/centenary/centenary.bahai.us;
        index   index.html index.htm;
        server_name     centenary.bahai.us;

The interesting parts happen within the location declarations. First, we have to deal with the fact that all our beautiful clean urls are now translated into .html files, and thanks to the try_files directive of nginx, this problem becomes trivial:

        # This is the default location; it should only be used for requests that would 
        # normally be handled by drupal clean urls, which are now .html files.
        location / {
                try_files       $uri.html =404;
                ssi             on;
        }

So by default, for any request, nginx will first try to find a file at $uri.html, and then it will return a 404 file not found error. A request to /news will return /news.html, and a request to /news.html will return a 404 because /news.html.html does not exist. Note also that SSI is turned on for these requests.

This works for all the html files in the folders that httrack translated from Drupal's virtual paths, like /news and /photos, but it doesn't work for actual files from the server: a request for /sites/default/files/image.jpg would return 404 because /sites/default/files/image.jpg.html does not exist. So we must create an nginx location for the actual files as well. I only had two folders in which such files were kept: the vast majority in /sites (and its subfolders), and a few images in /misc. I used a fairly standard way to do this in nginx, which is to test the extension of the request using a regular expression for the location name. This also lets me tell browsers to cache these files, since there is no dynamic content involved:

        # Match files with an extension located in standard locations. 
        # These should all be static files: images, mp3s, the html book, etc.
        location ~* (/sites/|/misc/).+\.(css|js|gif|html|ico|jpe?g|png|mp3|swf|xml)$ {
                expires         max;
                add_header      Pragma public;
                add_header      Cache-Control "public, must-revalidate, proxy-revalidate";
        }

There are a few other locations for specific needs of this site, but the above configuration covers 90% of what was needed.