Why Version Control
In the long, long ago days of March 2010, Julie Meloni posted a piece on the ProfHacker blog on version control systems (VCS) and what they might offer to the everyday academic trying to hack her workflow. If you don’t know what version control is, take a look at that post. VCSs originated with the specific needs of software programmers to save, access, restore, share, collaborate asynchronously on code. Version control has become a cornerstone of software development, and even more so of the social aspects that mark open source/free software communities. Sites such as Github have flourished because of the ease with which individuals can share, fork, and otherwise review and distribute code for open source projects super small and super large. (That’s right– the linux kernel.) There are a number of open source VCSs that are available free of charge, including Subversion (svn), Git, and Mercurial (hg).
That’s all well and good for code, but what does it have to do with an historian’s research repository? Well, the basics of version control boil down to this this –> documents placed under revision control are tracked by the VCS. Changes to those documents are committed, together with a message summarizing the changes, to a repository. Repositories can be replicated, or cloned, on any machine that has the VCS installed. In essence, every change to a document that you save and commit to the repository is eternally there, and can be returned to. Differences between versions of documents can be visualized. There are other useful aspects for academics as well– including branching, which allows you to create a parallel version of any document and develop it independently. Just as an example, say you have a single CV that you want to adapt for specific venues (grant applications, job applications to different types of universities/colleges, etc.)– with varying degrees of ease depending on the VCS, you could branch that CV into different forms tagged by their intended use. With a few simple commands, these changes can then be synced across your computers.
Suffice to say, a VCS offers an academic the ability to maintained versioned backups of his or her research and other personal files. And, for my own DIY predilections, I must admit that I like repurposing tools intended for wholly different audiences and putting them to work for historical research.
Choosing a VCS for research materials
Following Meloni’s advice back in March of 2010, I began to play with svn for maintaining a repo of my research and writing. (This post, for example, will be committed to my svn repo, which I haven’t yet migrated away from completely.) My webhost offers svn repos as part of its standard hosting package, and so I went that direction first. I was attracted by the idea in part because as a commuter, I have a variety of personal and work machines that I use on a regular basis — and I wanted all of my research files to sync on all those different computers. I also am attracted to the idea of redundancy in backup methods. I currently have offsite backups of my main laptop on backblaze. I have some of my stuff in Dropbox. I also clone the hard drive of my laptop once a week to an external drive. Paranoid, I know. But I’ve come by that paranoia honestly.
When I started in earnest to collect, transcribe, and analyze sources for my next project, I decided I wanted a project-specific, version-controlled repository for my sources and python scripts related to the project. I could have done this with my svn repo on my server. With svn, a single central repo tracks folders and files. You checkout either all or part of the repo, work locally, and then commit everything back to that central repo. Works great, though svn makes some things like branching overly difficult.
After putting a small bit of code up on Github, I discovered that I liked working with git, which operates on a different model than svn. Git clones repositories to wherever you want them in their entirety. Changes to documents are made to a local working copy of the document, committed locally, and then pushed/pulled to/from other repositories. Branching is much easier, as is merging changes back in. But, these are my research files, many of which aren’t ready for prime time yet. Github offers private repositories, but only for pay. I’m a cheapskate historian, so I went looking for a service that offered even a limited number of private repos for free. Enter Bitbucket, which offers free public and private repos for accounts with up to five users. In fact, BitBucket also offers an unlimited license to academics with a verifiable .edu address. I like companies that offer educator discounts. Given the availability of free private repos, I decided to go with Bitbucket for my research archive.
Setting Up A Bitbucket Repo
Bitbucket now supports git, but when I started this project, it didn’t. So, the rest of this post will be a tutorial on using Mercurial with Bitbucket. As long as you’re not scared of using the command line a little (and would you be reading this if you were?), basic usage is very straight forward.
The most important first step is to install Mercurial on your machine. There are a couple of options for doing this. The easiest is to download the appropriate package installer for your operating system (OS X or Windows) just as you would any other piece of software. On Linux, install with your distribution’s package installer.
Alternatively on a Mac, if you use
homebrew you can install with:
$ brew install mercurial
Finally, if you’re familiar with python and use either
pip, you can install that way. Mercurial is written in python, and is available as a python package. So, for a systemwide install, you simply need:
$ sudo easy_install mercurial
$ sudo pip install mercurial
hg (the chemical symbol for mercury) as its alias on the command line. So, to verify that the installation worked, open a terminal prompt at enter:
$ hg --version
As of this writing, you should see something like:
Mercurial Distributed SCM (version 2.0)
(see http://mercurial.selenic.com for more information)
Copyright (C) 2005-2011 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Setting Up a BitBucket Repo
hg is installed, the next step is to set up the private repository for your research materials on Bitbucket. It’s a painless process. Fill out the information on the sign up page. Please note that your username will define the urls for your repositories, so choose it carefully. If you may want an educator’s unlimited account in the future, provide your university email address.
Once your account is set up, log in to your account. From the Repositories tab, click on
create repository. One the next form, create a new repository from scratch by providing a name (like, say, “Research”). Make sure that Private is selected, along with Mercurial as the repository type. You can also choose to use either an issue tracker or wiki for Project Management. This is actually a cool feature. For humanities projects, I like to use a wiki, which I then use for project planning, large project ideas, and the like. Click on
Create repository, and voila, you have an empty repository. Navigate on over to your new repository page, which will be in this pattern:
https://bitbucket.org/<username>/<repo>. Nothing to see there yet, but now we’ll fix that.
You Need A Local Copy
OK, so now we need to set up a repo on your local machine to work with your bitbucket repo. We need to make a folder for the repo, we need to clone it, and then we can start to populate it with research materials that we’ll push back to bitbucket. This is the same process you’ll follow on any other computers you’d like to have your materials on as well. So, here we go.
First up, make a folder to put your repo in. I like to use a
projects folder in my home directory. YMMV. Time to get a little used to the terminal, so let’s do it there. From your home directory, which is the directory Terminal will probably open in (if not you can always get there with a simple
$ cd), enter this command at the prompt:
$ mkdir projects
The road to version control forks here. When cloning your remote repo, you have two options. The first, using
https requires you to enter your bitbucket password each time. The second uses
ssh and requires that you enter an ssh key once in your bitbucket account settings. Let’s look at the latter. On a Mac, this is pretty easy. We need to change into a hidden folder. You should already be in your home directory, and if you are, enter this:
$ cd .ssh
This will create an ssh public/private key pair. Open the public key with vim or nano or your favorite text editor. It will most likely be named
id_rsa.pub. You can check with
$ ls. Highlight the entire key and copy it to your clipboard. Now, go back to your bitbucket account page and paste the key in the Add Key box. Now, back to your terminal. Change into your
projects directory and clone the repo:
$ hg clone ssh://firstname.lastname@example.org/<username>/<repo>
Of course, substitute your username and repo name there. This will create a new folder with the name of your repo, and where your files will reside. The
https method is similar, but requires no keygen, so you would have gone straight to cloning the repo for step two:
That said, I prefer the ssh method.
Either way, you’re ready to go!
Now that your repos are set up and linked, we can start putting files in them. Mercurial tracks files, and not folders. So empty folders won’t show up. Also, binary files (like, for example, .docx files) are more difficult than plain text files for VCSs to track. Mercurial will detect changes to a .doc file, but will store a whole copy of the new version rather than just the changes to the file like it does with txt. This can cause, over time, the repo to get large. But, as I’ve argued here before, plain text files are more durable, platform independent, and compact. I’ve switched to taking all my notes in .txt files, using markdown/multimarkdown syntax. I’ve also started switching excel spreadsheets to .csv files wherever I can. OK, back to the workflow.
On the most basic level, working with Mercurial will involve the following steps: 1. adding new files to version control; 2. committing changes in those files to the respository; 3. pushing changes from the local repo to bitbucket.
On a day-to-day basis, the workflow looks like this. If I’m transcribing a new document, I save that file to one of the folders in my local repo. The folder structure is normal– however you like to set up your research files. At some point, I add the new document to version control with the command:
$ hg add. This will add any file in the repo to version control that isn’t already. As I accumulate substantial work on a version controlled file, I commit it to the local repo with
$ hg commit -m "A message about the work." Note that hg requires a commit message with any commit. I find the log of messages a nice way to track my progress. At least once a day (depending on how much work I’ve gotten done), I push the local repo changes to bitbucket with
$ hg push. I do this almost entirely using D-Term, which rules and provides a dropdown command line already in the directory you’re working from. And, that’s it for the most basic workflow. I now have an iterative back-up of all my research files both on my local machine at in the private repo at bitbucket. There are more advanced commands that are useful as well, but I’ll save those for another time.
A couple of final caveats — putting a file in the trash doesn’t remove it from your version control history. The proper way to erase a file is with the command
$ hg remove <filename>. Likewise, the proper way to rename a file is with
$ hg rename <filename>. Otherwise, a file will show up as missing, and will need to be removed at the next commit. You can also get into trouble with conflicting versions in different repos, which require merging.
There is, obviously, much more to Mercurial than I’ve listed above. In the future, I’ll cover branching and merging scenarios for typical humanities work. Luckily, in the mean time there is some excellent documentation available. Brian O’Sullivan has a free online version of his **Mercurial: The Definitive Guide. There’s a nice tutorial by Joel Spolsky. And Bitbucket’s support pages are also useful.
Other version-control-like options
If you’re not ready to make the plunge into a full-fledged VCS, there are other options likely already at your finger tips. That’s certainly the case if you use a Mac. With the release of OS X Lion, Apple has integrated version control more explicitly into its operating system. If you look at the save dialogue from the File tab on any program, you’ll see that Save now says Save Version. This form of save is actually tracking changes, and earlier versions can be browsed if the application supports the feature. Together with Apple’s TimeMachine, that takes you a pretty long way towards version control of your documents. Additionally, for Mac users, programs like ForeverSave offer version control for pretty much any program and their binary files. If you use DropBox, you can pay a little bit extra to turn on version control that can be accessed from their web interface for anything you keep in your DropBox.
Dropbox isn’t platform specific, but I must admit that I don’t know what the ForeverSave or OS-Specific options are on Windows or Linux. Feel free to share such in the comments.