Sometimes I joke that I do web development to support my railfanning habit. It’s not entirely true, but it’s always pleasant when the two intersect.
I’m a Trains subscriber. Trains is a monthly publication which serves both those who actually work in the railroad industry and enthusiasts (railfans) like me. Beyond the monthly print publication (which I get electronically, but never mind), Trains publishes a daily news feed called News Wire. There’s lots of good information here on the various comings and goings in the industry, though US-centric.
I use Zotero to index information for research projects–mostly railroading, but other topics as well. There’s an extension for Chrome, Zotero Connector, which lets you import web content directly into Zotero, saving a lot of manual entry. Many publications like The Atlantic and The Washington Post are natively supported. When one isn’t, Zotero makes a best guess based on page structure and metadata. How well that works depends on how well-formed the page here.
This is where we have a problem. The News Wire postings don’t have proper metadata at all–you need to scrape the page to find all the relevant parts. Zotero doesn’t know how to do that. The result is that one article imported with the following values:
- Title: Amtrak AEM-7 arrives in Strasburg | Trains Magazine
- Author: 12, Wayne Laepple | June
- Author: 2015
- Website Title: TrainsMag.com
- URL: http://trn.trains.com/news/news-wire/2015/06/12strasburg
Not so much. This is all one page. The title comes from the page title, probably because there are multiple h1 headers defined. The date and the author are commingled in a “byline” div.
Fortunately you can roll your own definition; Zotero calls these “Translators“. There’s a primer which I found useful, though it omitted some steps. The easiest way to proceed is to use Zotero Scaffold, which is an IDE for Mozilla Firefox. I use standalone Zotero with Chrome so I didn’t have Zotero for Firefox installed. If you don’t Scaffold will install but will not work. No error messages; it just sits there. This was incredibly frustrating until I realized my error.
Zotero Scaffold will write out a completed definition to the translators directory inside your Zotero data directory. On OSX mine was in /Users/foo/Library/Application Support/Zotero/Profiles/random string/zotero/translators
. I’ve posted one to github as a gist: https://gist.github.com/mackensen/981b1d5393e07e8435798eaee843e3fc. A few comments on this:
- Scaffold takes care of all the front matter, including the GUID.
- detectWeb and doWeb can be more complex if there are different types of data (such as a search results page). I deliberately provided a narrow page target so we’re only handling single posts.
- All the terror is in scrape, where I used xpath queries to extract the parts of the page I needed and then format them appropriately. Don’t overlook the utility function cleanAuthor(), which takes an author string and breaks out component parts. In my first iteration I was reading the author and everything seemed fine but it didn’t import into Zotero.
- Translators are loaded into memory by the browser. If you make a change, you’ll need to reload the browser (boo) or disable and reenable the extension (yay).
New result, same article:
- Title: Amtrak AEM-7 arrives in Strasburg
- Author: Laepple, Wayne
- Blog Title: Trains News Wire
- Date: June 12, 2015
- URL: http://trn.trains.com/news/news-wire/2015/06/12strasburg
Yeah, that’s much better.