Sunday 14 September 2008

Moving Atoms

Having now found an apartment in Malaysia, I'm now faced with the logistics of moving. One problem I'm facing is that for the last 5 years I've run my own server, which now supports several blogs, a forum and my mobile phone's OTA backup. If I knew it was just for a week or two, I could take it offline temporarily, but from what I can tell, internet connectivity in Malaysia is probably not going to be reliable enough to keep all this online for a reasonable proportion of the time, and unless I pay serious money I'll be stuck with the hassle of a dynamic IP address. So I'm looking to migrate at least the blogs and forum off to other services. I've just finished migrating the posts and comments from my first blog (there are still hard links back to my server that need fixing), which was no easy task, as although draft.blogger.com supports Atom based import, it is very particular about some things, with very little documentation and completely useless error handling - it just seems to stop processing as soon as it finds something it doesn't like, and if nothing has been imported yet you get a general meaningless error message, but if one or more posts were successfully imported, it just silently fails to import the posts starting with where it failed. So here is what I found.



In addition to needing to be a valid Atom 1.0 feed, each entry needs a unique self referencing link: <link rel="self" type="application/atom+xml" href="post-id"/> The href does not have to be real, just unique, so I used example.com in my export from roller. The import process also does not accept empty tags for the post or comment author's email, name or uri (according to the rnc schema, only email cannot be empty).

The following is the template I used to export posts and comments from Apache Roller 4.0. It is based on an earlier export template for JRoller by Damien Bonvillain that output Atom 0.3, with updates for Roller 4.0, Atom 1.0 and blogger.com's undocumented quirks. The exported content still needs some post processing; removing or filling in empty author child tags, checking the truncation of comment titles or misformed content has not broken anything, and replacing relative references (which shouldn't be there in the first place), but generally it works for short blogs. There seems to be a hard-coded limit in getRecentWeblogEntries of 100 posts, so it needs rework to use a pager and an external script to fetch all the pages.

As with the original, paste the contents below into a new roller template, then use the template to access your blog. If you have too many posts for blogger.com or the export script to handle, you could try using a pager in the export template, or break up the export file manually after extracting everything.


#set($entries = $model.weblog.getRecentWeblogEntries('', 100))
<?xml version="1.0" encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" thr="http://purl.org/syndication/thread/1.0">
    <id>$model.weblog.id</id>
    <title>$utils.escapeXML($model.weblog.name)</title>
    <subtitle>$utils.escapeXML($model.weblog.description)</subtitle>
    <updated>$utils.formatIso8601Date($model.weblog.lastModified)</updated>

    #foreach( $entry in $entries )
    <entry>
        <id>$entry.id</id>
        <title>$utils.escapeXML($entry.title)</title>

        <author>
          <name>$entry.creator.fullName</name>
        </author>

        <published>$utils.formatIso8601Date($entry.pubTime)</published>
        <updated>$utils.formatIso8601Date($entry.updateTime)</updated>

        <content type="html"><![CDATA[$entry.text]]></content>
        <category scheme="http://schemas.google.com/g/2005#kind"
              term="http://schemas.google.com/blogger/2008/kind#post"/>
    <link rel="self" type="application/atom+xml" href="http://example.com/$entry.id"/>
    </entry>
        ## use Atom threading extensions for comment annotation
        #foreach( $comment in $entry.comments )
        <entry>
            <id>$comment.id</id>
            <title>$utils.escapeXML($utils.truncate($utils.removeHTML($comment.content), 40, 50, "..."))</title>

            <author>
              <name>$utils.escapeXML($comment.name)</name>
              <uri>$utils.escapeXML($comment.url)</uri>
              <email>$comment.email</email>
            </author>
            <published>$utils.formatIso8601Date($comment.postTime)</published>
            <updated>$utils.formatIso8601Date($comment.postTime)</updated>
            <content>$utils.escapeXML($utils.removeHTML($comment.content))></content>
            <thr:in-reply-to ref="$entry.id" type="application/atom+xml" href="$entry.permalink">
            <category scheme="http://schemas.google.com/g/2005#kind"
              term="http://schemas.google.com/blogger/2008/kind#comment"/>
        <link rel="self" type="application/atom+xml" href="http://example.com/$entry.id"/>
        </entry>
        #end
    #end
</feed>

2 comments:

Anonymous said...

You had the pager along my template :-D I'll take your modifications, I still haven't imported my old posts ^^;

Jason Rumney said...

Hi Damien,

There were a few bugs in the template in this post, see the next post, after I had subjected it to more testing.