Blog pages and META tags

There are a number of files that you want search engines to use for finding other files but you don't want them indexing. These files are the following:

blog/blog.html
blog/files/category-*.html
blog/files/archive-*.html
blog/files/[0-9][0-9]-\*-20[0-9][0-9].html

Assuming, of course that you've set up "blog" as your blog page. These files change too often to be of any use to a search engine when indexing. So you want to add the following meta tag to each one:

<meta name="robots" content="noindex,follow">

But you simply can't do it. If you use the blog page inspector, and set these there, they will be set for every file in the blog. Not good.

The only way to do this is to massage the files after they have been updated in your site. I use Dreamhost and they give a shell account as a standard feature. So, I can put something into the cron that runs every five minutes, and this thing that runs (through a framework I've built for this type of thing) will change the "robots" meta tag from "all" to "noindex,follow" as I'd like.

The sourcecode to the script follows, but you can also download it: fixrobots source code.

#!/usr/bin/perl -w

use strict;
use File::Find();

my $search = 'meta name="robots"';
my $replace = 'meta name="robots" content="noindex,follow"';
my @filelist = ();
sub wanted
{
  push @filelist, $File::Find::name if /^blog\.html\z/s or
                                       /^archive-.*\.html\z/s or
                                       /^category-.*\.html\z/s or
                                       /^\d\d-.*-20\d\d\.html\z/s;
}

if ($#ARGV == -1)
{
  File::Find::find({wanted => \&wanted},
     '<my home direcdtory>/derekwyatt.org/public/blog');
}
else
{
  @filelist = @ARGV;
}

foreach my $f (@filelist)
{
  open INFILE, $f;
  $/ = undef;
  my $file = <INFILE>;
  $/ = "\n";
  close INFILE;
  $file =~ s/<$search.*?>/<$replace\/>/sig;
  open OUTFILE, ">$f";
  print OUTFILE $file;
  close OUTFILE;
}

Feel free to take it, and change it to suit your tastes.

Note that it plugs in very nicely into the changetrigger framework I put together with the following plugin file (the above code being called "fixrobots"):

The sourcecode for the plugin aspect follows but you can also download it: blog_nasties source.

#!/bin/bash

CTfilelist()
{
  find ~/derekwyatt.org/public/blog        \
               -name blog.html -o          \
               -name archive-\*.html -o    \
               -name category-\*.html -o   \
               -name [0-9][0-9]-\*-20[0-9][0-9].html
}

CTaction()
{
  local filelist=

  while getopts A:D:C: opt
  do
    case $opt in
      A) filelist="$filelist $(<$OPTARG)"
         ;;
      C) filelist="$filelist $(<$OPTARG)"
         ;;
    esac
  done

  ~/bin/fixrobots $filelist

  return 3
}

Oh, and guys at RapidWeaver, please fix this bug :)