I’ve been struggling recently with how I want to encode the transcriptions of the very many digital facsimiles of my documents. I decided even before I wanted to construct a digital project with my current work that I wanted to move away from rtf, doc, and pdf files as much as possible for my research. There are a few reasons for this, not the least of which that txt files are much smaller, more resilient, and, in a certain manner, more flexible. This has affected my workflow and presented a series of decisions relative to project-specific files.
I’ve spent some time playing with TextMate for transcribing using multimarkdown, which I like very much and is integrated into Scrivener (my favorite app for writing). I’ve used simple and free TextEdit to make plain txt files, which ultimately isn’t a solution because I do want/need some markup of the files. What I want to do is really mark these files up with both descriptive and, potentially, analytical bits that will ultimately be query-able. And, so, I keep finding myself drawn back to TEI, an XML schema designed specifically for markup of humanist texts.
What I like about MultiMarkdown is the ease of transcription, especially using the bundle in TextMate, and the ability to transform to a variety of file types– xhtml, pdf, LaTeX, etc. But, even given its ease, it’s not designed to do the type of manuscript description and qualitative markup I’m looking for. The TEI is the opposite of mmd– it is completely overwhelming in its potential complexities, and as a result doesn’t leave me with a feeling of ease during transcription. Am I validated? Will I ever learn the elements, and their attributes? What do I really need in that TEI Header, and what can I omit? At any rate, I bought <oxygen/> a few months ago because there is such a deep academic discount and I kept feeling this tug towards TEI. That, and I’ve been reading all the online tutorials and information I can find on TEI (lots of tutorials here, and I like these here and here).
The potential for TEI documents, as an xml data set, goes far beyond my personal technical skill. But, I’m planning for the future. I’m developing a digital archive that I want to be for the long term. So, what to do in the short term? I’m determined to get the work I’m doing up in an attractive and useful manner in the meantime while I’m developing my personal technical skills and a community of digital historians and humanists.
So, where does that leave me for now? If you’ve ever glanced at this blog, you’ll know that I’m most familiar and comfortable with WordPress, which I’ve been using as a CMS for my teaching and other professional activities for a little while now. WordPress does so many things easily and well that work for edu deployment. But, I’m under no illusion that WordPress provides a framework for serious text analysis of a manuscript corpus like the one I’m developing. Which brings me back to transformations. As with other xml formats, it’s not THAT difficult to transform tei texts into other formats– xhtml, html, pdf, docx, ood, etc. And, using an xml editor like <oxygen/>, one could transform the documents with the built in xslt scenarios, save and upload to wordpress pages. Or, one could use this plugin, which allows you to embed shortcode into a page and have the xml transformed directly in wordpress. Nice. It took me a little playing around to get it to work. What I ended up doing was copying a whole xsl package from the tei consortium into wp-content to locally host the whole set of stylesheets for transforming into valid xhtml. What I’m thinking is that I can also hack the code from that plugin a bit to make a form to allow visitors to access the transcriptions in the format of their choice– as a pdf, or a docx, or the raw xml.
At any rate, here’s a sneak preview of what’s to come on the wordpress front end:
I’m hoping to have site up before the summer is over, adding files I’ve been working on for the past year or two. But, I like the aesthetic of it as it is on the local dev right now. I’m also going to do a longer post specifically on how these decisions have affected my academic workflow.
TEI does get complex – my studnets usually have no problem with transcribing and applying structural markup – pagination and column breaks and so on- to documents. However when they have to markup something like “John Doe, RIC District Inspector in Dumanaway” where there is a name, with family name and first name parts, a job title (‘District Inspector’), an organisation (RIC – the Royal Irish Constabulary) and two geographic identifiers (‘Irish’in RIC and ‘Dunmanway’) then they end up with more markup than text and with nested markup. At that point, even grad students quickly find they can no longer follow what is going on. This level of markup is universally useful; but the next layer, where one applies some sort of ontology to reflect the particular research questions you are pursuing, markup of themes or topics becomes more personal and of less value to people who want to pursue completely different research questions through the text.
Can’t beat oxygen though – I’ve tried a bunch of XML tools and it is the best.
Hi Mike– Thanks for the comment. The potential complexity is what has scared me away from really diving in to the TEI for really the two years I’ve been playing around with it. For now, my intention is to stick with a defined set of objectives for transcription, structural markup, names, places, etc. because, as you say, that sort of thing is universally useful. The case set I’m transcribing consists of criminal and civil litigation, which contain very predictable procedural sections. So, what I’m marking up are those sections, and then names, dates, places, job titles, and ethnic markers.
I’ve been doing the next-layer type of analysis with QDA software and tag markup on plain txt files. In that case, if the structure of the document is relevant to the the analysis I’m wanting to do, i code it.
As for oxygen– the continual validation, element wrapping, and built-in tools are great. The UI (at least for this Mac user) could use some serious improvement.
[…] tool, at least right now, can run into problems with shortcode. This has real implications for the post I wrote on importing TEI documents into a wordpress post or page using an xml processing plugin. […]