Technology Solutions for Everyday Folks

Moving an Old Website to Github Pages: Getting Started

Nearly 25 years ago I spun up a website featuring transcriptions of Monty Python material collected from college students in the 1980s. I don't remember how I came across the archive of these text files, but I still have the originals in part of my personal digital archive. Around the year 2000 I moved the website (and subsequently added more content) into a Wiki system (TWiki, to be exact). The TWiki platform made it substantially easier for (at the time) handling the markup necessary in converting these flat transcription files into something a little more presentable, but it required a lot of hand-editing for the markup.

Around 2007 I made some hosting system changes and one resuling change was a move away from TWiki. I used a scraping system to capture the TWiki content of the site as flat HTML files with CSS, and reproduced the structure (without the actual TWiki engine) on the new host. This eliminated the need to maintain the software and allowed a "lift and shift" to a new host without having to structurally change the flat files.

That's where things ended for the site. Since 2007 the site has remained in this clunky and decidedly early-2000's look and feel. I really cringed at it. Due to the way the site was nested among others in a similar directory structure, it's periodically caused issues with access/rewrites or necessitated other fun .htaccess patterns. This is what I call 'technical debt' but also something I wasn't exactly interested in changing simply due to the "what should I do with this site content?" question.

So, What Should I Do?

Several months back I was thinking about "new" side/investigation/tinkering projects to start. Having recently 'de-coupled' some other site and tech debt to atone for past sins, and having never finished porting the Python material to the old site I decided it would be as good a time as any to look into moving the old site into something modern but with lower overhead of long-term maintenance.

As the original files are simply transcriptions/flat text files, there's not really much in the way of necessary formatting. BINGO! Markdown would be perfect for this sort of content!

The content itself is pretty static. The Pythons haven't collectively put out new content in over three decades. I don't need a CMS or any other sort of "system" to keep and manage the material. BINGO! Github Pages would be a perfect host for the content!

I've used Github Pages for other projects but always with the built-in templates provided by Github. This project would necessitate its own customized template...and would be a learning and tinkering process. For added tinkering I decided against running Jekyll locally to test, instead pushing commits directly to Github to deploy. I gave this move a pause at first ("should I deploy locally to test?"), but it didn't take me long to appreciate moving right to Github (more about that below).

The Layout Conversion

To start, I needed to spin up a new Github repo for the project and create the Github Pages deployment/site. I needed to select an appropriate "base" template/theme for the site. In the end I chose to roll with the So Simple Jekyll Theme. It had the core bits I desired, a clean look, and seemed to be easy enough to extend/modify as I needed.

Creating the Base Structure

I started by stubbing out a landing page, basic navigation, site logo, and doing some of the basic configuration changes in _config.yml for the site. Wasn't too bad, since I wasn't really screwing around with the actual layout of templates. I was happy to see documentation on extending the template, so adding the requisite bits for collections, taxonomy, and front matter defaults was pretty straightforward.

Next came some basic layout and theme modifications beyond the "out of box" experience. This is where I first started running into some issues, primarily as I hadn't really wrapped my head around how the template system is structured and works (overrides). The TL;DR version: Modified copies of theme/template files in the "local" repo/structure take precedence, but actual files to modify aren't always obvious.

Adding Basic/Test Content

It was time to add a few bits of content to get the basics functioning as expected (collections, tags, categories). The new site structure boils down to two distinct "page" types (and matching collections), where one is a pretty straight series of "the same thing" (scenes from Monty Python and the Holy Grail) and the other is a more variable set of skits, sketches, and other materials that are organized by tag and category. I created four testing files: two for each core collection.

The markdown formatting for the files is pretty simple but takes some time to "get right" (it's tedious), so I skipped the full formatting for some files to get started. The most important bits in this test content roll into the front matter of the various files:

title: "The Lumberjack Song"
categories:
  - flying circus
tags:
  - sketch
  - song
---
---
title: "Introduction"
categories:
  - holy grail
---

Loading up the pages worked pretty seamlessly right out of the box. There were some sorting and layout/style issues, but the basics were in place and generally working!

Categorization and Tagging

Collections == Easy

As set in _config.yml, the collections I'd stubbed out worked pretty well right out of the box. I was able to segregate the content accordingly and pretty painlessly. But that was really the end of the easy road.

Tags == Terrible

Next step was to click through the tags on some of the content. Queue an immediate barf.

In retrospect, I spent an inordinate amount of time trying to get tags to behave. Googling things didn't really help -- I was clearly doing the "right things" and they should be working...so what the hell was wrong?

Turns out that Github Pages Jekyll doesn't "do tags" out of the box...and certainly not in the way standard Jekyll handles them.

Categories == Tags == Terrible

I switched gears to try using categories in lieu of tags, given what I'd discovered. Same problem.

I was at a crossroads and at this point I stepped away for a couple weeks...but in that time I thought a bit about whether or not categories and tags would be useful/necessary. The easy answer would be to ignore them and remove the functionality, but it'd be nice to have. I settled on a mode of "I'm coming back to this and going to give it one more go. If I can get one/both of them working in relatively short order, I will...but if not I'm cutting the loss and moving on."

I found a couple of useful walkthroughs for handling tags on Github Pages: Jekyll Tags on Github Pages and Use Tags and Categories in your Jekyll based Github Pages without plugins.

Each fundamentally does the same thing, but I settled on the latter method as my base since it more closely aligned with what I doing and didn't require the additional overhead of adding an _includes file or fiddling with stuff in head.html of the site (and being executed on all pages). After a few minor iterations of the aforementioned, I had tags working as expected! The layout was shit, but the functionality was working!

I ported the tag functionality over to categories and Voila! It too was working as expected. Layout and styling will be fixed later...

Tales of Sorting Woes

For the landing/collection/category/tag pages, sorting was behaving as expected with things falling into alphabetical order. This is perfect for sketches content because there's not a much better default sorting order for said content. Out of the box, it just worked.

For the film content, however, there is a definite order, and it's not alphabetic. I struggled with this at first since I didn't want to be locked into creating the individual files in their natural order.

There are plenty of ideas out on the Internet to fix this nuance with Github Pages; I just wanted something simple that could be added to the front matter of files.

My first go was to add a filmorder front matter value, like so:

filmorder: 1

I added support for filmorder by injecting the following among the other sort options in the documents-collection.html override, like so:

{% elsif include.sort_by == 'filmorder' %}
  {% if include.sort_order == 'reverse' %}
    {% assign entries = entries | sort: 'filmorder' | reverse %}
  {% else %}
    {% assign entries = entries | sort: 'filmorder' %}
  {% endif %}
{% endif %}

Add the proper sort_by value to the collection front matter, and things worked as intended!

sort_by: filmorder

This solved the sorting for the film landing/collection page, and had no impact on any of the other content of the site due to it being 'segmented' out with a front matter item.

But Wait: What About Pagination?!?

Pagination on individual files among the sketches behaved "as expected" in that it was being handled in alphabetic order like the other collection/category/tag pages. Not a problem. However, the "weird" sort order on the film content proved to be an issue.

After several failed attempts and dead-end Google searches, I found a sort of "hack" to make this behave in the way I wanted. One way to address this is to name the files in a specific manner (e.g. leading numbers for sort order in the file name), which could work for my purposes. I was already dealing with the collection custom sort order in the front matter, though, so I'd prefer to stick with the same "type" of mechanism if possible.

This led to the "hack" I chose: a "fake" date entry. Since this material is so old and not going to change, the date doesn't really matter. Simply adding and using a common date  (I chose November 4, 2021 since it was the day I was working on this) and time value aligned with the filmorder value, I could have both. Even better, since the Holy Grail script is only about 45 pages, I could easily do this with hour increments and not have to deal with other units of time!

As an example, the front matter of the first page included:

filmorder: 0
date: 2021-11-04 00:00:00

The front matter of the 22nd page similarly includes:

filmorder: 22
date: 2021-11-04 00:22:00

This little date hack made the individual file pagination behave as I expected! Bits of the script show up in the proper flow...and there was much rejoicing!

Converting the Content

With all the requisite bits in place, it was time to "mass convert" the content. There are about 45 "pages" of content in the Holy Grail series, about 75 sketches, and about 10 other bits. It's an undertaking, but it's mostly mechanical. I spent downtime around the Thanksgiving holiday to do the sketches markup from originals.

I'd love to have scripted it out, but there's just enough variability due to how the material was sourced or generated that it's not worth the effort. Fortunately, however, almost all of the content is pretty straightforward to format, especially with Markdown. It's tedious to go through each file/script, but with Markdown it's substantially easier than when I converted some of the material to straight HTML. Lots. Of. Tags.

A Visual Studio Code plug: without VSCode the process would've taken a lot longer. It only took a few hours' time to convert all the sketches, spread out over several days/efforts. Lots of find/replace by pattern matching and addition of basic markdown.

The Holy Grail Content

I decided to handle conversion of the Holy Grail screenplay a little differently than the sketches. The screenplay has notes related to material struck from and added to the film, and I had done some manipulation there for clarity on the old TWiki-based site. To that end, I decided instead to do a copy/paste for each scene from the old site into VSCode, where I could then do the more mechanical conversion to Markdown as I'd done with the sketches. A couple hours later, shortly after Christmas, I was done with the conversion.

More To Come

In the next post I review the other "getting ready to go live" steps to complete this long-awaited transition.