playing with python

Posted on December 8, 2010 by ctb — 1 Comment

A few months ago I started a concerted effort to learn to code some Python. The last time I wrote a program for a computer was in 1985 in 8th grade, and it involved BASIC and some blip moving across the screen. It’s intimidating to figure out where to start. Initially I wasn’t sure if I should learn Python, or do what all the other DH kewl kids were doing, and opt for Ruby. Ultimately, my decision to go with Python was influenced advice from a trusted coder friend and by the existence of two things: The Programming Historian, and the Natural Language Tool Kit for natural language processing (nlp). Both impacted my decision because I could see immediate payoff for things I want to do with the manuscript collection for my second book/digital project. As I am developing a very large body of file transcriptions, I want to use text mining and nlp tools to look at language use and trends with the case set. Because this case set is tightly determined by a particular type of litigation (criminal prosecutions of sexuality), I think it will be very useful for familiarizing myself with manipulating and analyzing content through scripts.

Ultimately, my goals are much larger. I want to do a DH project that utilizes nlp in order to analyze the relationship between prescriptive legal texts and actual legal practice across a broad range of the early modern Spanish judicial state. Law, legal reason, and jurisdictional thinking inflected the entirety of Spain’s imperial bureaucracy and, at least in my theory, of everyday life as well. So, the longer vision of what I want to do will use nlp and topic modeling to test that thesis across a broad range of documentary evidence.

In the meantime, I’m just getting my feet wet and figured it may be useful to write about the experience. I’m finding learning to code to be a very exciting, frustrating, and at times an overwhelming experience. It’s exciting, and rewarding, because the payoff when things work is very quick. For someone who is used to writing articles, researching books, and the like, the payoff is usually years down the road. But, writing a script and making it work, that’s an immediate return (and one that builds toward the straight history project payoff anyway). The frustrations come, of course, in trying to figure out how to tell the computer what I want it to do, and then trying to figure out what it’s telling me when I’m wrong. Learning a new language, whether formal or natural, is never easy. Doing it on one’s own is even harder, especially with the quickness with which a programming language can become complex. That, and I’m rapidly approaching 40 years old, so my synapses in the language-acquisition part of my brain are tired and just trying to hold on to Spanish! The whole thing can be overwhelming because, well, it seems like every piece of the puzzle leads down a new rabbit hole of bewildering choices. Trying to figure out best practices on one’s own is difficult. Luckily, there are a ton of resources on the web for the autodidact interested in python, from MIT courses to beginner tutorials. And, for my purposes, one of the best of those tutorials is The Programming Historian.

I liked it because I was immediately typing in code that worked, and did useful things for an academic humanist. For me, actually typing the code in (instead of cut-paste) is one of the most important elements in learning. Somehow, reading the code and understanding what it’s doing works much better for me if I type it in. That site is geared towards using scripts to research on the web, and to make wordclouds or find kwics and n-grams from sources on the web. That doesn’t interest me so much, but what I quickly figured out was how to point the scripts to my local file structure so that I could play with my own files. And this is where it got more fun. I got tired of manually entering local file names, and decided it would be really cool to use the system file open/save dialog boxes. Python makes this so easy, simply by importing a library called easygui. So, in just a couple of weeks, I’d figured out how to open a file on the disk, read it, do some simple pre-processing of the text (strip tags, return everything lowercase, pull out stopwords), and do some simple analysis. Immediately rewarding, and now I want more.

To get more, I started looking at the tools the nltk offered for more sophisticated mining of texts. I’m very much at the beginning on this still, but already it’s paying off. For example, the nltk allows you to plot a word frequency distribution across the length of a single document. I have a release-candidate version of the pdf of the criminales section of the Archivo Nacional del Ecuador. It’s password protected, but it turns out the Skim will still allow me to highlight sections, and export those highlights as a .txt file. So, I did that and produced a single .txt that spanned the years 1601 to 1800, and then did a distribution plot of a handful of terms. Interesting trends were immediately identifiable. Here’s one result:

I know that’s a bit hard to see, but the blue marks represent frequencies of the terms across the length of the document (and, therefore, across time). You’ll note that arrests for murder (muerte) and robbery (robo) remain fairly constant across the two centuries, but that concubinage (concubinato) and adultery (adulterio) greatly intensify in the later decades. Violence and property crime- always a state concern. But, enforcement of sex norms through police power, very much an 18th century phenomenon.

I’m now in the midst of reproducing this process. I’ve produced files for each 10 year period, starting in 1601 and going through 1834 – the length of the Criminales section. I’m going to mine the descriptions of the crimes both in decade chunks, and all together to look for what amount to policing trends. Right now, I’m just putting together a few simple scripts to clean up the files, and combine them. Skim puts a heading on each “note’ – which is written as a single string – that says “* Highlighted, page xxx” where xxx is the page number of the pdf. To remove those, I wrote a script using a regex. Is this the best, most efficient, or easiest means to process the files? I doubt it, but it does help with learning. Here’s that script:

#! /usr/bin/python

import re, easygui

# chose criminales file to open.
f = open(easygui.fileopenbox(msg='open', default='/path/to/prefered
/directory/', filetypes=['*.txt']), 'r')
text = f.read()

# uncomment next line to check if file is being read correctly
# print text

# compile regex -- not a necessary step, but useful later when 
# batch processing
p = re.compile('\* Highlight\, page [0-9]\n|\* Highlight\, page [0-9][0-9]\n|
\* Highlight\, page [0-9][0-9][0-9]\n')

# uncomment next line to test if regex is being compiled
# print p

# check for matches in the text for the regex, and return info
m = p.search(text)

if m:
	print 'Match(es) found.'
else:
	print 'No matches.'
	
# delete matching regex strings	
newText = re.sub(p, '', text)
f.close()

# chose file to save new text to, and write that file.
saveas = open(easygui.filesavebox(msg='save as', default='/path/to/preferred/
directory/', filetypes=['*.txt']), 'w')

saveas.write(newText)

saveas.close()

Another useful script– open a directory, combine all of the files in that directory into a single .txt file:

# open, read, return from series of files in a directory

import os, easygui, time 

# chose directory to read files from
directory = easygui.diropenbox(title='Choose a directory', default='/Users/*/')

fileList = os.listdir(directory)
singleFile = ''

# walk through the files, adding text from each into single file.
for file in fileList:
    f = open(file, 'r')
    newFile = f.read()
    f.close()
    singleFile += newFile

# uncomment the next line to check the length of the new file, as a test to make
# sure it was combined.
#print len(singleFile)

# chose place to save the new file, and name it.
f = open(easygui.filesavebox(msg='save as', default='/Users/*/', filetypes=['*.txt'] 'w'))
f.write(singleFile)
f.close

#confirmation the process finished.
print 'File written.'

Both of those scripts require the easygui library.

Which brings me to a closing note on what I’ve found useful in learning python so far:
1. Read/do both a task or project oriented book together with a standard book that provides a formal overview of the language.
2. Learn early how to install libraries/packages (in python I use either pip or easyinstall).
3. Learn how to call things from those libraries.
4. Take other people’s code, and screw around with it till it does what you want it to do.
5. Utilize google searches constantly while you’re trying to figure out a problem.
6. Don’t get caught up, initially, if you don’t understand why something works. Just run with it, and reproduce later. When you start changing code is when you start understanding why it works.
7. If you can, find someone to learn this stuff with.
8. Watch and read as many tutorials as you can find.
9. Constantly think in terms of how the things you’re learning can connect with or augment your traditional academic work.
10. If you’re not a math person, stick to examples/tutorials/etc that work with strings, or you might get quickly discouraged. I know I did at first.

I’m hoping that five years from now, I can actually call myself a coder as well as a historian. It took me a long while to get very good with 18th century Spanish and its paleography. I expect the same will be true with programming. For now, I’m just looking forward to playing around with orange, and further developing the case set I have to mess with this stuff.

About ctb

Associate Professor of Early Latin America Department of History University of Tennessee-Knoxville

Tagged with: code, nltk, programming historian, python, text mining
Posted in Digital History, Miscellaneous

One comment on “playing with python”

Cameron Blevins says:

December 20, 2010 at 2:41 pm

I like your list of useful things when learning to program, agree with all of them. I especially relate to Number 6, although I go back and forth on whether this is a good or bad thing.

playing with python

Share this:

Related

One comment on “playing with python”

Leave a comment