Courant Institute New York University FAS CAS GSAS

Dumping a WordPress site

Wednesday, July 23, 2008 - 2:02pm

I had a WordPress blog set up at my last job, and it's definitely fun to run a dynamic website. But once I left I didn't feel like maintaining it (in particular, watching for crackers and comment spammers), so I decided to make it static. It's definitely doable with a little with a little MySQL- and unix-fu.

I found an excellent howto on the subject by Ammon. There were a few tweaks (which I left in the comments) but the major work is there.

One big annoyance was that I had moved out before archiving. That was particularly troublesome because the database server for this particular site was a FreeBSD box under my desk. That box isn't connected to the internet at the moment, and if it were, it wouldn't have the same IP address anymore. Luckily, I routinely backed up that database, and so I fired up a temporary MySQL server on my laptop and changed the wordpress configuration files (nice DRY--I only had to change three lines, and two of them were avoidable had I kept the same username and password).

The part I couldn't figure out next was how to change the site to insert boilerplate text in every page explaining it was an archive and they should go to my new site. I think it's a testimony to the flexibility of drupal, my new content management system, that I already know how to do such a thing in that framework. Just make a block and make sure it goes in the content region. But WordPress keeps its custom widgets and gewgaws on the sidebar, which wasn't prominent enough for me. I ended up having to hack the theme file to hard-code the message in the page header. I still believe there's a "right" way to do this, but all I was going for was a dump of the site so I'm not going to apologize for forking. :-)

(A solution outside of WordPress would be to parse every HTML page after dumping [with XSLT or even a regular expression-enabled scripting language like perl] and insert the boilerplate on every page. I've done that with dynamic sites that have already been archived. But that seems unfair to ask since I had an active, dynamic CMS working for me already.)

After that, wget is your friend. This little command line tool has dozens of options to fetch web pages and even whole websites. Luckily Ammon had a sample usage that I could copy. Although adding the "-k" flag to correct links was needed for me.

Then you get rid (after backing up) of the dynamic scripts and moved the archived material into their place. And that's it!

This is the kind of task that you almost feel justified spending five hours on if you can spend another 30 minutes writing a blog post about it. But I'm glad it's done.

automated website dumps

Mon, 03/02/2009 - 23:39
Travis Johnson (not verified)

Thanks for the twitter follow. Somehow I missed it a while ago.

How is Drupal comparing to WordPress for the amount of energy managing comment spam/etc? Also, I just activated the Akismet plugin on my blog... did you try any kind of aggregated, automated spam filtering like that?

These kind of tools are interesting... I have a mediawiki to HTML converter called mw2html set up on a weekly cron job at my last job, and I believe MoinMoin comes with one built in(moindump). It's almost surprising WordPress doesn't include a similar facility, or that someone hasn't written a Wordpress backend that reads the XML file instead of a database(since WP makes it so easy to export XML files).

Best, Travis

Hi

Wed, 03/18/2009 - 12:48
leingang

Thanks for the comment. Somehow I missed it a while ago. :-)

I used Akismet for the wordpress blog and it worked very well. I haven't gotten around to installing automated spam comment filtering here...right now everything's automatically put in a moderation queue and I have to approve it. Drupal has an Akismet module but I'm also looking at Mollom.

I really like drupal so far. There's a lot of stuff out there to enhance your site, and updates are pretty easy (not automatic, but they have a nice checklist.) What I'm particularly excited about is using the FeedAPI module to automatically import content I upload to other sites (i.e., slideshare, scribd).