I want to learn about linked data, ideally for use with online texts and library data, but that’s a big and complicated task and it makes sense to start with something smaller and better defined. So having an interest in family history I thought I’d try with data from 18th century West Country parish registers. Many of these were transcribed in the 19th century in a series published by Phillimore and are now available as public domain scans on archive.org. Marriages, only, unfortunately, which limits what you can do with them, but still more than enough to get started.

If this were a database then each marriage would be a row, so it shouldn’t be too hard to get the data into a database as a starting point. Unfortunately there are only automatically generated ocr versions of the text, so there’ll be a lot of tedious proofing to do first. And to judge from previous experience of parish registers there will also be a lot of variability in the quality of the data - what to do about entries that can’t quite all be read? Random comments from the vicar? Julian dates?

Once the database is designed and populated fortunately there’s a recipe-style howto by the great Jeni Tennison on creating linked data which I’m going to try to follow through. That should give me a schema for the linked data version of the marriages. The only aspect I know of that isn’t really covered there - since she’s dealing with government data which must be true (TM) - is named graphs. There is already a huge amount of genealogical information on the web, much of completely unknown
reliability, and I really don’t want to add to that. As I understand it named graphs combined with the Semantic Web Publishing framework should allow me to describe the source of the data, not just the data.

And then I need a host to provide a sparql endpoint; D2R sounds like it should do everything I need. I’ll give it a try.

Maybe biting off a bit more than I can chew in very limited time; we’ll see. I’ll report back here with Part 2.