Using the Stanford Named Entity Recognizer to extract data from texts

In our lesson on regular expressions, we were able to extract some of the metadata from the document because it was more or less already formatted in such a way that we could write a pattern to find it. Sometimes however clear-cut patterns are not quite as easy to apply. For instance, what if we were interested in the place names that appear in the documents? What if we suspected that the focus of diplomatic activity shifted over time? This is where ‘named entity recognition’ can be useful. Named entity recognition covers a broad range of techniques, based on machine learning and statistical models of language to laboriously trained classifiers using dictionaries. One of the easiest to use out-of-the-box is the Stanford Named Entity Recognizer.[1] In essence, we tell it ‘here is a block of text – classify!’ It will then process through the text, looking at the structure of your text and matching it against its statistical models of word use to identify person, organization, and locations. One can also expand that classification to extract time, money, percent, and date.

Let us use the NER to extract person, organization, and locations.2 First, download the Stanford NER from http://nlp.stanford.edu/software/CRF-NER.shtml and extract it to your machine. Open the location where you extracted the files. On a Mac, double-click on the one called ‘ner-gui.command’. On PC, double-click on ner-gui.bat. This opens up a new window (using Java) with ‘Stanford Named Entity Recognizer’ and also a terminal window. Don’t touch the terminal window for now. (PC users, hang on a moment – there is a bit more that you need to know before you can use this tool successfully. You will have to use the command line in order to get the output out).

In the ‘Stanford Named Entity Recognizer’ window there is some default text. Click inside this window and delete the text. Then, click on ‘File,’ then ‘Open,’ and select your text for the diplomatic correspondence of the Republic of Texas. Since this text file contains a lot of extraneous information in it – information which we are not currently interested in, including the publishing information and the index table of letters – you should open the file in a text editor first and delete that information. Save with a new name and then open it using ‘File > open’ in the Stanford NER. The file will open within the window. In the Stanford NER window, click on ‘classifier’ then ‘load CRF from file’. Navigate to where you unzipped the Stanford NER folder. Click on the ‘classifier’ folder. There are a number of files here; the ones that you are interested in end with .gz:

english.all.3class.distsim.crf.ser.gz

english.all.4class.distsim.crf.ser.gz

eglish.muc.7class.distsim.crf.ser.gz

These files correspond to these entities to extract:

3 classLocation, Person, Organization

4 classLocation, Person, Organization, Misc

7 classTime, Location, Organization, Person, Money, Percent, Date

Select the location, person, and organization classifier and then press ‘Run NER.’ At this point, the program will appear to ‘hang’ – nothing much will seem to be happening. However, in the background, the program has started to process your text. Depending on the size of your text, this could take anywhere from a few minutes to a few hours. Be patient! Watch the terminal window – once the program has results for you, these will start to scroll by in the terminal window. In the main program window, once the entire text has processed, the text will appear with colour-coded highlighting showing which words are location words, which ones are persons, which ones are organizations. You have now classified a text. Note: sometimes your computer may run out of memory – in that case, you’ll see an error referring to “Out of Heap Space” in your terminal window. That’s OK – just copy and paste a smaller bit of the document, say the first 10,000 lines or so. Then try again.

Notice also that there are blanks periodically in the output that scrolled by in the terminal window. These blanks correspond to the blanks between the letters in the original document. Once the program has finished processing, we will want to grab that output and visualize it, or perhaps count items, or do something else. On a Mac, you can copy and paste the output from the terminal window into your text editor of choice. It will look something like:

LOCATION: Texas

PERSON: Moore

ORGANIZATION: Suprema

And so forth. You can now use regular expressions to manipulate the text further.

Things are not so simple for PC users

On a PC, things are not so simple because the command window only shows a small fraction of the complete output – you cannot copy and paste it all! What we have to do instead is type in a command at the command prompt, rather than using the graphical interface and then redirect the output into a new text file.

Open a command prompt in the Stanford NER folder on your Windows machine (you can right-click on the folder in your windows explorer, and select ‘open command prompt here’).

Type the following as a single line:

java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile texas-letters.txt -outputFormat inlineXML > “my-ner-output.txt”

The first bit, java –mx500m says how much memory to use. If you have 1gb of memory available, you can type java –mx 1g (or 2g, or 3g, etc). The next part of the command calls the NER programme itself. You can set which classifier to use after the –loadClassifier classifiers/ by typing in the exact file name for the classifier you wish to use (you are telling ‘loadClassifier’ the exact path to the classifier). At –textFile you give it the name of your input file (on our machine, called ‘texas-letters.txt’, and then specify the outputFormat. The > character sends the output to a new text file, here called “my-ner-output.txt”. Hit enter, and a few moments later the programme will tell you something along the lines of

CRFCLassifier tagged 375281 words in 13745 documents at 10833.43 words per second

Open the text file in a text editor, and you’ll see output like this:

In the name of the <LOCATION>Republic of Texas</LOCATION>, Free, Sovereign and Independent. To all whom these Presents shall come or may in any wise concern. I <PERSON>Sam Houston</PERSON> President thereof send Greeting

Congratulations – you’ve successfully tagged a document using a named entity recognizer. Please see Michelle Moravec’s post on using the NER for further guidance using this package; if you’re up for the challenge, Bill Turkel shows us how to do this from the command line in Linux.