Data Mining Tools: Techniques, and Visualizations

In this chapter, we set up and explore some basic text mining tools, and consider the kinds of things these can tell us. We move on to more complex tools (including how to set up some of them on your own machine rather than using the web-based versions). Regular expressions are an important concept to learn that will aid you greatly; you will need to spend some time on that section. Finally, you will learn some of the principles of visualization, in order to make your results and your argument clear and effective.

Now that we have our data – whether through wget or by using the Outwit Hub or some other tool – now we have to start thinking about what to do with it! Luckily, we have many tools that will help us take a large quantity of information and “mine” it for the information that we might be looking for. These can be as simple as a word cloud, as we begin our chapter with, or as complicated as sophisticated topic modeling (the subject of Chapter Four) or network analysis (Chapter Five and Six). Some tools are as easy to run as clicking a button on your computer, and others require some under-the-hood investigation. This chapter aims to introduce you to the main contours of the field, providing a range of options, and also to give you the tools to participate more broadly in this exciting field of research. A key rule to remember is that there is no ‘right’ or ‘wrong’ way to do these forms of analysis: they are tools and for most historians, the real lifting will come once you have the results. Yet we do need to realize that these tools shape our research: they can occasionally occlude context, or mislead us, and these questions are at the forefront of this chapter.

Basic Text Mining: Word Clouds, their Limitations, and Moving Beyond

Having large datasets does not mean that you need to jump into programming right away to extract meaning from them: far from it. There are three approaches that, while each have their limits, can shed light on your research question very quickly and easily. In many ways, these are “gateway drugs” into the deeper reaches of data visualization. In this section then, we briefly explore word clouds (via Wordle), as well as the comprehensive data analysis suites Voyant-Tools and IBM’s Many Eyes.

The simplest data visualization is a word cloud. In brief, they are generated through the following process. First, a computer program takes a text and counts how frequent each word is. In many cases, it will normalize the text to some degree, or at least give the user options: if “racing” appears 80 times and “Racing” appears 5 times, you may want it to register as a total of 85 times to that term. Of course, there may be other times when you do not, for example if there was a character named “Dog” as well as generic animals referred to as “dog.” You may also want to remove stop words, which add little to the final visualization in many cases. Second, after generating a word frequency list and incorporating these modifications, the program then puts them into order, sizing by frequency and begins to print them. The word that appears the most frequently is placed as the largest (and usually, in the centre). The second most frequent a bit smaller, the third most frequent a bit smaller than that, and continuing on to dozens of words. While, as we shall see, word clouds have strong critics, they are a very useful entryway into the world of basic text mining.

Try to create one yourself using the web site Wordle.net. You simply click on ‘Create,’ paste in a bunch of text, and see the results. For example, you could take the plain text of Leo Tolstoy’s novel War and Peace (a synonym for an overly long book, with over half a million words) and paste it in: you would see major character names (Pierre, Prince, Natasha who recur throughout), locations (Moscow), themes (warfare and nations such as France and Russia), and at a glance get a sense of what this novel might be about (Figure 3.1).

[insert Figure 3.1 War and Peace as a word cloud]

Yet with such a visualization the main downside becomes clear: we lose context. Who are the protagonists? Who are the villains? As adjectives are separated from other concepts, we lose the ability to derive meaning. For example, a politician speaks of “taxes” frequently: but from a word cloud, it is difficult to learn whether they are positive or negative references. With these shortcomings in mind, however, historians can find utility in word clouds. If we are concerned with change over time, we can trace how words evolved in historical documents. While this is fraught with issues - words change meaning over time, different terms are used to describe similar concepts, and we still face the issues outlined above - we can arguably still learn something from this.

Take the example of a dramatically evolving political party in Canada: the New Democratic Party (NDP), which occupies similar space on the political spectrum as Britain’s Labour Party. While we understand that most of our readers are not Canadian, this will help with the example. It had its origins in agrarian and labour movements, being formed in 1933 as the Co-Operative Commonwealth Federation (CCF). It’s defining and founding document was the ‘Regina Manifesto’ of that same year, drafted at the height of the Great Depression. Let’s visualize it as a word cloud (figure 3.2):

[insert Figure 3.2 The Regina Manifesto as word cloud]

Stop! What do you think that this document was about? When you have a few ideas, read on for our interpretation.

At a glance, we argue that you can see the main elements of the political spirit of the movement. We see the urgency of the need to profoundly change Canada’s economic system, with the boldness of “must,” an evocative call that change needed to come immediately. Other main words such as “public,” “system,” “economic,” and “power” speak to their critique of the economic system, and words such as capitalist, socialized, ownership, and worker speaking to a socialist frame of analysis. You can piece together key components of their platform, from this visualization alone.

For historians, though, the important element comes in change over time. Remember, we need to keep in mind that words might change. Let’s take two other major documents within this singular political tradition. In 1956, the CCF, in the context of the Cold War released its second major political declaration in Winnipeg. Again, via a word cloud, figure 3.3:

[Insert Figure 3.3 The CCF’s Winnipeg Declaration of 1956, as word cloud]

New words appear, representing a new thrust: “opportunity,” “freedom,” “international,” “democratic,” “world,” “resources,” and even “equality.” Compared to the more focused, trenchant words found in the previous declaration, we see a different direction here. There is more focus on the international, Canada has received more attention than before, and most importantly, words like socialized have disappeared. Indeed, the CCF here was beginning to change its emphasis, backing away from overt calls for socialism. But the limitations of the word cloud also rear their head: for example, the word “private.” Is private good? Bad? Is opportunity good, bad? Is freedom good, or bad? Without context, we cannot know from the image alone. But the changing words are useful.

For a more dramatic change, let’s compare it to modern platforms. Today’s New Democratic Party grew out of the CCF in 1961 and continues largely as an opposition party (traditionally a third party, although in 2012 was propelled to the second-party Official Opposition status). By 2012, what do party platforms speak of? Figure 3.4 gives us a sense:

[insert Figure 3.4 NDP platform for the 2012 Canadian General Election]

Taxes, families, Canada (which keeps building in significance over the period), work, employment, funding, insurance, homes, and so forth. In three small images, we have seen the evolution of a political party morph from an explicitly socialist party in 1933, to a waning of that during the Cold War climate of 1956, to the mainstream political party that it is today.

Word clouds can tell us something. On his blog, digital historian Adam Crymble ran a quick study to see if historians would be able to reconstruct the contents of documents from these word clouds – could they look at a word cloud of a trial, for example, and correctly ascertain what the trial was about. He noted that while substantial guesswork is involved, “an expert in the source material can, with reasonable accuracy, reconstruct some of the more basic details of what’s going on.”1 It also represents the inversion of the traditional historical process: rather than looking at documents that we think may be important to our project and pre-existing thesis, we are looking at documents more generally to see what they might be about. With Big Data, it is sometimes important to let the sources speak to you, rather than looking at them with pre-conceptions of what you might find.

Word clouds need to be used cautiously. They do not explain context – so you can see that “taxes,” for example, is mentioned repeatedly in a speech but you would not be able to learn whether the author liked taxes, hated taxes or was simply telling the audience about them. This is the biggest shortcoming of word clouds, but we still believe that they are a useful entryway into the world of data visualization. Complementary to wider reading and other forms of inquiry, they present a quick and easy way into the world of data visualization. Once we are using these, we’re visualizing data, and it’s only a matter of how. In the pages that follow, we move from this very basic stage, to other basic techniques including AntConc and Voyant Tools, before moving into more sophisticated methods involving text patterns (or regular expressions), spatial techniques, and programs that can detect significant patterns and phrases within your corpus.

AntConc

AntConc is an invaluable way to carry out some forms of textual analysis on data sets. While it does not scale to the largest datasets terribly well, if you have somewhere in the ballpark of 500 or even 1,000 newspaper-length articles you should be able to crunch data and receive tangible results. AntConc can be downloaded online from Dr. Laurence Anthony's personal webpage.2 Anthony, a researcher in corpus linguistics among many other varied pursuits, has created this software to carry out detailed textual analysis. Let's take a quick tour.

Installation, on all three operating systems, is a snap: one downloads the executables directly for OS X or Windows, and on Linux the user needs to change the file permissions to allow it to run as an executable. Let's explore a quick example to see what we can do with AntConc.

Once AntConc is running, you can import files by going to the File menu, and clicking on either Import File(s) or Import Dir, which would allow you to import all the files within a directory. In the screenshot below, we opened up a directory containing plain text files of Toronto heritage plaques. The first visualization panel is 'Concordance.' we type in the search term 'York,' the old name of Toronto (pre-1834) and visualize the results (figure 3.5):

[insert Figure 3.5 The AntConc interface]

Later in this book, we will explore various ways that you could do this yourself using the Programming Historian - but, for the rest of your career, quick and dirty programs like this can get you to your results very quickly! In this case, we can see various contexts in which ‘York’ is being used: North York (a later municipality until 1998), ties to New York state and city, various companies, other boroughs, and so forth. A simple search for the keyword 'York' would reveal many plaques that might not fit our specific query.

The other possibilities are even more exciting. The Concordance Plot traces where various keywords appear in files, which can be useful to see the overall density of a certain term. For example, in the below visualization of newspaper articles, we trace when frequent media references to 'community' in the old Internet website GeoCities declined (figure 3.6):

[insert Figure 3.6 Concordance plot tool in AntConc]

It was dense in 1998 and 1999, but declined dramatically by 2000 - and even more dramatically as that year went on. It turns out, upon some close reading, that this is borne out by the archival record: Yahoo! acquired GeoCities, and discontinued the neighbourhood and many internal community functions that had defined that archival community.

Collocates are an especially fruitful realm of exploration. Returning to our Toronto plaque example, if we look for the colocates of ‘York’ we see several interesting results: "Fort" (referring to the military installation Fort York), "Infantry," "Mills" (the area of York Mills), "radial" (referring to the York Radial Railway), and even slang such as "Muddy" ("Muddy York" being a Toronto nickname). With several documents, one could trace how collocates change over time: perhaps early documents refer to Fort York and subsequently we see more collocates referring to North York? Finally, AntConc also provides options for overall word and phrase frequency, as well as specific n-gram searching.

A free, powerful program, AntConc deserves to be the next step beyond Wordle for many undergraduates. It takes textual analysis to the next level. Finally, let's move to the last of our three tools that we explore in this section: Voyant Tools. This set of tools takes some of the graphical sheen of Wordle and weds it to the underlying, sophisticated, textual analysis of AntConc.

Voyant Tools

With your tongue whetted, you might want to have a more sophisticated way to explore large quantities of information. The suite of tools known as Voyant (previously known as Voyeur) provides this. It provides complicated output with simple input. Growing out of the Hermeneuti.ca project3, Voyant is an integrated textual analysis platform. Getting started is quick. Simply navigate to http://voyant-tools.org/ and either paste a large body of text into the box, provide a website address, or click on the ‘upload’ button to load text or pdf files into the system.

[Insert Figure 3.7 The standard Voyant Tools interface screen]

Voyant works on a single document or on a larger corpus. For the former, just upload one file or paste the text in; for the latter, upload multiple files at that initial stage. After uploading, the workbench will appear as demonstrated in figure 3.7. The workbench provides an array of basic visualization and text analysis tools at your disposal. For customization or more options, remember that for each of the smaller panes you can click on the ‘gear’ icon to get advanced icons, including how you want to treat cases (do you want upper and lower case to be treated the same) and whether you want to include or exclude common stop words.

With a large corpus, you can do the following things:

If you press ctrl and click on multiple words, you can compare words in each of these windows.

These are all useful ways to interpret documents, and a low barrier to entering this sort of textual analysis works. Voyant is ideal for smaller corpuses of information or classroom purposes.

Voyant, however, is best understood - like Wordle, albeit far more sophisticated - as a “gateway drug” when it comes to textual analysis. This default version is hosted on the McGill University servers, which limits the ability to process very large datasets. They do offer a home server installation as well, under development at the time of writing, and so at this point we recommend that learning the basics of the Programming Historian 2 can help you achieve similar things while learning some code along the way.

None of this however is to minimize the importance and utility of Voyant Tools, arguably the best research portal in existence. Even the most seasoned Big Data humanist can turn to Voyant for quick checks, or when they are dealing with smaller (yet still large) repositories. A few megabytes of textual data is no issue for Voyant, and the lack of programming expertise required is a good thing: even for old hands. We have several years of programming experience amongst us, and often use Voyant for both specialized and generalized inquiries: if a corpus is small enough, Voyant is the right tool to use.4

Intermediate Text Analysis: Clustering Data to Find Powerful Patterns

Journalists have recently taken to big data and data visualization, as a response to the massive data dumps occasioned by such things as Wikileaks, or the Snowden revelations. The Knight Foundation for instance has been promoting the development of new tools to help journalists and communities deal with the deluge. Past winners of the Knight News Challenge have included crowd-mapping applications such as Ushahidi (ushahidi.com), or DocumentCloud (documentcloud.org). Indeed, some historians have used these applications in their own work (see Graham, Massie, Feurherm 2013, for instance), and historians could usefully repurpose these projects to their own ends.

A recent project to emerge from the Knight News Challenge is ‘Overview’. Overview has some affinities with topic modeling, discussed in the following chapter, and so we suggest it here as a more user-friendly approach to exploring themes within your data.5 Overview can be installed for free on your own computer.6 However, if your data warrants it (i.e. there are no privacy concerns), you can go to overviewproject.org and upload your materials to their servers and begin exploring.

Overview explores word patterns in your text using a rather different process than topic modeling. It looks at the occurrence of words in every pair of documents. ‘If a word appears twice in one document, it’s counted twice… we multiply the frequencies of corresponding words, then add up the results’.7 (The technical phrase is ‘term frequency-inverse document frequency.) Documents are then grouped together using a clustering algorithm based on this similarity of scores. It categorizes your documents into folders, sub-folders, sub-sub-folders. Let’s say that we are interested in the ways historical sites and monuments are recorded in the city of Toronto. We upload the full text of these historical plaques (614 texts) into Overview (figure 3.8).

[insert Figure 3.8 The Overview interface sorting the text of historical plaques from Toronto}

Overview divides the historical plaques, at the broadest level, of similarity into the following groups:

‘church, school, building, toronto, canada, street, first, house, canadian, college (545 plaques),

‘road, john_graves, humber, graves_simcoe, lake, river, trail, plant’ (41 plaques)

‘community’ with ‘italian, north_york,  lansing, store, shepard, dempsey, sheppard_avenue’, 13 documents

‘: years’ with ‘years_ago, glacier, ice, temperance, transported, found, clay, excavation’, 11 documents.

That’s interesting information to know. There appears to be a large division between what might be termed ‘architectural history’ - first school, first church, first street - and ‘social history’. Why does this division exist? That would be an interesting question to explore.

Overview is great for visualizing patterns of similar word use across sets of documents. Once you’ve examined those patterns, assigning tags for different patterns, Overview can export your texts with those descriptive lables as a csv file, that is, as a table. Thus, you could use a spreadsheet program to create bar charts of the count of documents with an ‘architecture’ tag or a ‘union history’ tag, or ‘children’, ‘women’, ‘agriculture,’ etc. We might wonder how plaques concerned with ‘children’, ‘women’, ‘agriculture’, ‘industry’, etc might be grouped, so we could use Overview’s search function to identify these plaques by search for a word or phrase, and applying that word or phrase as a tag to everything that is found. One could then visually explore the way various tags correspond with particular folders of similar documents.8

Using tags in this fashion, comparing with the visual structure presented by Overview, is a dialogue between a close and distant reading. Overview thus does what it set out to do: it provides a quite quick and relatively painless way to get a broad sense of what is going on within your documents.

Installing Voyant-Tools on your own Machine: A Quick Aside

As of July 2014, it is possible to install Voyant Tools on your own machine. You might wish to do this to keep control of your documents. It could be a condition of the ethics review at your institution, for instance, that all oral history interview files are stored on a local machine without access to the Internet. You might like to run text analysis on the transcriptions, but you cannot upload them to the regular Voyant-Tools server. If you had Voyant-Tools on your own machine, this would not be a problem.

The instructions could change with newer versions of Voyant, but for the moment if you go to http://docs.voyant-tools.org/resources/run-your-own/voyant-server/, you will find all of the information and files that you need. In essence, Voyant-Tools installs itself on your machine as a ‘server’ – it will serve up files and the results of its analysis to you via your web browser, even though you are not in fact pulling anything over the Internet.

It is a very easy installation process. If you download the server software, unzip it and then execute the VoyantServer.jar file. A control console will open, and when you click ‘start server,’ your browser will also appear. In the control console, you can change how much memory Voyant-Tools can access to perform its operations. By default, it will use one gigabyte of memory. This should be enough for most of your text analysis, but if you get an error saying you need more memory – if you are putting in lots of data, for example – you can increase this. When you are done with Voyant, you can click ‘stop server’ and it will close.

In the browser, you will see this address in the address bar: http://127.0.0.1:8888. This means that the page is being served to you locally, through port 8888. You do not need to worry about that, unless you are running other servers on your machine at the same time.

Advanced Text Analysis: Unlocking the Power of Regular Expressions

A regular expression (also called regex) is a powerful tool for finding and manipulating text.9 At its simplest, a regular expression is just a way of looking through texts to locate patterns. A regular expression can help you find every line that begins with a number, or every instance of an email address, or whenever a word is used even if there are slight variations in how it's spelled. As long as you can describe the pattern you're looking for, regular expressions can help you find it. Once you've found your patterns, they can then help you manipulate your text so that it fits just what you need. Beware: this is a relatively difficult section. But the rewards are worth it.

This section will explain how to take a book scanned and available on the Internet Archive, "Diplomatic correspondence of the Republic of Texas," and manipulate the raw text (for instance into just the right format that you can clean the data and load into the Gephi network visualization package, as a correspondence network. The data cleaning and network stages will appear in later chapters using the same data). In this section, you'll start with a simple unstructured index of letters, use regular expressions, and turn the text into a spreadsheet that can be edited in Excel.

Regular expressions can look pretty complex, but once you know the basic syntax and vocabulary, simple ‘regexes’ will be easy. Regular expressions can often be used right inside the 'Find and Replace' box in many text and document editors, such as Notepad++ on Windows, or TextWrangler on OS X. Do not use Microsoft Word, however! To find these text editors, you can find Notepad++ at http://notepad-plus-plus.org/ or TextWrangler at http://www.barebones.com/products/textwrangler/. Both are free and well worth downloading.

You type the regular expression in the search bar, press 'find', and any words that match the pattern you're looking for will appear on the screen. As you proceed through this section, you may want to look for other things. In addition to the basics provided here, you will also be able to simply search regular expression libraries online: for example, if you want to find all postal codes, you can search “regular expression Canadian postal code” and learn what ‘formula’ to search for to find them.

Let's say you're looking for all the instances of "cat" or "dog" in your document. When you type the vertical bar on your keyboard (it looks like |, shift+backslash on windows keyboards), that means 'or' in regular expressions. So, if your query is dog|cat and you press 'find', it will show you the first time either dog or cat appears in your text.

If you want to replace every instance of either "cat" or "dog" in your document with the world "animal", you would open your find-and-replace box, put dog|cat in the search query, put animal in the 'replace' box, hit 'replace all', and watch your entire document fill up with references to animals instead of dogs and cats.

The astute reader will have noticed a problem with the instructions above; simply replacing every instance of "dog" or "cat" with "animal" is bound to create problems. Simple searches don't differentiate between letters and spaces, so every time "cat" or "dog" appear within words, they'll also be replaced with "animal". "catch" will become "animalch"; "dogma" will become "animalma"; "certificate" will become "certifianimale". In this case, the solution appears simple; put a space before and after your search query, so now it reads:

dog | cat

With the spaces, "animal" replace "dog" or "cat" only in those instances where they're definitely complete words; that is, when they're separated by spaces.

The even more astute reader will notice that this still does not solve our problem of replacing every instance of "dog" or "cat". What if the word comes at the beginning of a line, so it is not in front of a space? What if the word is at the end of a sentence or a clause, and thus followed by a punctuation? Luckily, in the language of regex, you can represent the beginning or end of a word using special characters.

\<

means the beginning of a word. In some programs, like TextWrangler, this is used instead:

\b

so if you search for \<cat , (or, in TextWrangler, \bcat )it will find "cat", "catch", and "catsup", but not "copycat", because your query searched for words beginning with "cat". For patterns at the end of the line, you would use:

\>

or in TextWrangler,

\b

again. The remainder of this walk-through imagines that you are using Notepad++, but if you’re using Textwrangler, keep this quirk in mind. If you search for

cat\>

it will find "cat" and "copycat", but not "catch," because your query searched for words ending with -"cat".

Regular expressions can be mixed, so if you wanted to find words only matching "cat", no matter where in the sentence, you'd search for

\<cat\>

, which would find every instance. And, because all regular expressions can be mixed, if you searched for (in Notepad++; what would you change, if you were using TextWrangler?)

\<cat|dog\>

and replaced all with "animal", you would have a document that replaced all instances of "dog" or "cat" with "animal", no matter where in the sentence they appear.

You can also search for variations within a single word using parentheses. For example if you were looking for instances of "gray" or "grey", instead of the search query

gray|grey

you could type

gr(a|e)y

instead. The parentheses signify a group, and like the order of operations in arithmetic, regular expressions read the parentheses before anything else. Similarly, if you wanted to find instances of either "that dog" or "that cat", you would search for:

(that dog)|(that cat)

Notice that the vertical bar | can appear either inside or outside the parentheses, depending on what you want to search for.

The period character . in regular expressions directs the search to just find any character at all. For example, if we searched for:

d.g

the search would return "dig", "dog", "dug", and so forth.

Another special character from our cheat sheet, the plus + instructs the program to find any number of the previous character. If we search for

do+g

it would return any words that looked like "dog", "doog", "dooog", and so forth. Adding parentheses before the plus would make a search for repetitions of whatever is in the parentheses, for example querying

(do)+g

would return "dog", "dodog", "dododog", and so forth.

Combining the plus '+' and period '.' characters can be particularly powerful in regular expressions, instructing the program to find any amount of any characters within your search. A search for

d.+g

for example, might return "dried fruits are g", because the string begins with "d" and ends with "g", and has various characters in the middle. Searching for simply ".+" will yield query results that are entire lines of text, because you are searching for any character, and any amount of them.

Parentheses in regular expressions are also very useful when replacing text. The text within a regular expression forms what's called a group, and the software you use to search remembers which groups you queried in order of their appearance. For example, if you search for

(dogs)( and )(cats)

which would find all instances of "dogs and cats" in your document, your program would remember "dogs" is group 1, " and " is group 2, and "cats" is group 3. Notepad++ remembers them as "\1", "\2", and "\3" for each group respectively.

If you wanted to switch the order of "dogs" and "cats" every time the phrase "dogs and cats" appeared in your document, you would type

(dogs)( and )(cats)

in the 'find' box, and

\3\2\1

in the 'replace' box. That would replace the entire string with group 3 ("cats") in the first spot, group 2 (" and ") in the second spot, and group 1 ("dogs") in the last spot, thus changing the result to "cats and dogs".

The vocabulary of regular expressions is pretty large, but there are many cheat sheets for regex online (one that we sometimes use is http://regexlib.com/CheatSheet.aspx. Another good one is at http://docs.activestate.com/komodo/4.4/regex-intro.html)

To help, we've included a workflow for searching using regular expressions that draws from the cheat sheet, to provide a sense of how you would use it to form your own regular expressions. It is an example of diplomatic correspondence in 19th century Texas, and we will turn an unformatted index of correspondences, drawn from a book, into a structured file that can be read in Excel or any of a number of network analysis tools. A portion of the original file, drawn from the Internet Archive, looks like this:

Sam Houston to A. B. Roman, September 12, 1842 101

Sam Houston to A. B. Roman, October 29, 1842 101

Correspondence for 1843-1846 —

Isaac Van Zandt to Anson Jones, January 11, 1843 103

By the end of this workflow, it will look like this:

Sam Houston, A. B. Roman, September 12 1842

Sam Houston, A. B. Roman, October 29 1842

Isaac Van Zandt, Anson Jones, January 11 1843

While the changes appear insignificant, making sure a data is properly formatted is essential for allowing computers to follow sets of instructions.

Begin by pointing your browser to the document listed below. http://archive.org/stream/diplomaticcorre33statgoog/diplomaticcorre33statgoog_djvu.txt

Copy the text into Notepad++ or TextWrangler.10 Remember to save a spare copy of your file before you begin - this is very important, because you're going to make mistakes that you won't be sure how to fix. Now delete everything but the index where it has the list of letters. Look for this in the text, as depicted in figure 3.9, and delete everything that comes before it:

[insert Figure 3.9 Screenshot of the metadata to delete in the archive.org file on the diplomatic correspondence of the Republic of Texas]

That is, you’re looking for the table of letters, starting with ‘Sam Houston to J. Pinckney Henderson, December 31, 1836 51’ and ending with ‘Wm. Henry Daingerfield to Ebenezer Allen, February 2, 1846 1582’. There are, before we clean them, approximately 2000 lines’ worth of letters indexed! The file you create in this step is also available in the online appendix as raw-correspondence.txt.

Notice is are a lot of text that we are not interested in at the moment: page numbers, headers, footers, or categories. We're going to use regular expressions to get rid of them. What we want to end up with is a spreadsheet that looks like:

Sender, Recipient, Date

We are not really concerned about dates for this example, but they might be useful at some point so we'll still include them. We're eventually going to use another program, called OpenRefine, to fix things up further.

Scroll down through the text; notice there are many lines which don't include a letter, because they're either header info, or blank, or some other extraneous text. We're going to get rid of all of those lines. We want to keep every line that looks like this:

Sender to Recipient, Month, Date, Year, Page

This is a complex process, so first we'll outline exactly what we are going to do, and then walk you through how to do it. We start by finding every line that looks like a reference to a letter, and put a tilde (a ~ symbol) at the beginning of it so we know to save it for later. Next, we get rid of all the lines that don't start with tildes, so that we're left with only the relevant text. After this is done, we format the remaining text by putting commas in appropriate places, so we can import it into a spreadsheet and do further edits there.

There are lots of ways we can do this, but for the sake of clarity we're going to just delete every line that doesn't have the word "to" in it (as in sender TO recipient). We will walk you through the seven-step plan of how to manipulate these documents. At the end of each section, the regular expressions and commands are summarized.

Step One: Identifying Lines that have Correspondence Senders and Receivers in them

In Notepad++, press ctrl-f or search->find to open the find dialogue box.11 In that box, go to the 'Replace' tab, and check the radio box for 'Regular expression' at the bottom of the search box. In TextWrangler, hit command+f to open the find and replace dialogue box. Tick off the ‘grep’ radio button (which tells TextWrangler that we want to do a regex search) and the ‘wraparound’ button (which tells TextWrangler to search everywhere).

Remember from earlier that there's a way to see if the word "to" appears in full. Type

\<to\>
in the search bar. Recall in TextWrangler we would look for \bto\b instead. This will find every instance of the word "to" (and not, for instance, also ‘potato’ or ‘tomorrow’).12

We don't just want to find "to", but the entire line that contains it. We assume that every line that contains the word “to” in full is a line that has relevant letter information, and every line that does not is one we do not need. You learned earlier that the query ".+" returns any amount of text, no matter what it says. Assuming you are using Notepad++, if your query is

.+\<to\>.+

your search will return every line which includes the word "to" in full, no matter what comes before or after it, and none of the lines which don't. In TextWrangler, your query would be .+\bto\b.+.

As mentioned earlier, we want to add a tilde ~ before each of the lines that look like letters, so we can save them for later. This involves the find-and-replace function, and a query identical to the one before, but with parentheses around it, so it looks like

(.+\<to\>)

and the entire line is placed within a parenthetical group. In the 'replace' box, enter

~\1

which just means replace the line with itself (group 1), placing a tilde before it. In short, that's:

STEP ONE COMMANDS:

Find: (.+\<to\>)
(in Textwrangler, find this instead): (.+\bto\b)

Replace: ~\1

Click 'Replace All'.

Step Two: Removing Lines that Aren’t Relevant

After running the find-and-replace, you should note your document now has most of the lines with tildes in front of it, and a few which do not. The next step is to remove all the lines that do not include a tilde. The search string to find all lines which don't begin with tildes is

\n[^~].+

A \n at the beginning of a query searches for a new line, which means it's going to start searching at the first character of each new line. However, given the evolution of computing, it may well be that this won’t quite work on your system. Linux based systems use \n for a new line, while Windows often uses \r\n, and older Macs just use \r. These are the sorts of things that digital historians need to keep in mind! Since this will likely cause much frustration, your safest bet will be to save a copy of what you are working on, and then experiment to see what gives you the best result. In most cases, this will be:

\r\n[^~].+

Within a set of square brackets the carrot ^ means search for anything that isn't within these brackets; in this case, the tilde ~. The .+ as before means search for all the rest of the characters in the line as well. All together, the query returns any full line which does not begin with a tilde; that is, the lines we did not mark as looking like letters.

STEP TWO COMMANDS:

Find: \r\n[^~].+

Replace:

Click 'Replace All'.

By finding all \r\n[^~].+ and replacing it with nothing, you effectively delete all the lines that don't look like letters. What you're left with is a series of letters, and a series of blank lines.

Step Three: Removing the Blank Lines

We need to remove those surplus blank lines. The find-and-replace query for that is:

STEP THREE COMMANDS:

Find: \n\r

(In textwrangler): ^\r

Replace:
Click 'Replace All'.

Step Four: Beginning the Transformation into a Spreadsheet

Now that all the extraneous lines have been deleted, it's time to format the text document into something you can import into and manipulate with Excel as a *.csv, or a comma-seperated-value file. A *.csv is a text file which spreadsheet programs like Microsoft Excel can read, where every comma denotes a new column, and every line denotes a new row.

To turn this text file into a spreadsheet, we'll want to separate it out into one column for sender, one for recipient, and one for date, each separated by a single comma. Notice that most lines have extraneous page numbers attached to them; we can get rid of those with regular expressions. There's also usually a comma separating the month-date and the year, which we'll get rid of as well. In the end, the first line should go from looking like:

~Sam Houston to J. Pinckney Henderson, December 31, 1836 51

to

Sam Houston, J. Pinckney Henderson, December 31 1836

such that each data point is in its own column.

Start by removing the page number after the year and the comma between the year and the month-date. To do this, first locate the year on each line by using the regex:

[0-9]{4}

As a cheat sheet shows, [0-9] finds any digit between 0 and 9, and {4} will find four of them together. Now extend that search out by appending .+ to the end of the query; as seen before, it will capture the entire rest of the line. The query

[0-9]{4}.+

will return, for example, "1836 51", "1839 52", and "1839 53" from the first three lines of the text. We also want to capture the comma preceding the year, so add a comma and a space before the query, resulting in

, [0-9]{4}.+

which will return ", 1836 51", ", 1839 52", etc.

The next step is making the parenthetical groups which will be used to remove parts of the text with find-and-replace. In this case, we want to remove the comma and everything after year, but not the year or the space before it. Thus our query will look like:

(,)( [0-9]{4})(.+)

with the comma as the first group "\1", the space and the year as the second "\2", and the rest of the line as the third "\3". Given that all we care about retaining is the second group (we want to keep the year, but not the comma or the page number), the find-and-replace will look like this:

STEP FOUR COMMANDS:

Find: (,)( [0-9]{4})(.+)

Replace: \2

Click 'Replace All'.

Step Five: Removing Tildes

The next step is easy; remove the tildes we added at the beginning of each line, and replace them with nothing to delete them.

STEP FIVE COMMANDS:

Find: ~

Replace:

Click 'Replace All'.

Step Six: Separating Senders and Receivers

Finally, to separate the sender and recipient by a comma, we find all instances of the word "to" and replace it with a comma. Although we used \< and \> (in TextWrangler, \b) to denote the beginning and end of a word earlier in the lesson, we don't exactly do that here. We include the space preceding “to” in the regular expression, as well as the \> (\b in TextWrangler) to denote the word ending. Once we find instances of the word and the space preceding it, " to\>", we replace it with a comma ",".

STEP SIX COMMANDS:

Find: to\> <- remember, there’s a space in front of “ to\>

(In textwrangler): to\b

Replace: ,

Click 'Replace All'.

Step Seven: Cleaning up messy data

You may notice that some lines still do not fit our criteria. Line 22, for example, reads "Abner S. Lipscomb, James Hamilton and A. T. Bumley, AugUHt 15, ". It has an incomplete date; these we don't need to worry about for our purposes. More worrisome are lines, like 61 "Copy and summary of instructions United States Department of State, " which include none of the information we want. We can get rid of these lines later in Excel.

The only non-standard lines we need to worry about with regular expressions are the ones with more than 2 commas, like line 178, "A. J. Donelson, Secretary of State [Allen,. arf interim], December 10 1844". Notice that our second column, the name of the recipient, has a comma inside of it. If you were to import this directly into Excel, you would get four columns, one for sender, two for recipient, and one for date, which would break any analysis you would then like to run. Unfortunately these lines need to be fixed by hand, but happily regular expressions make finding them easy. The query:

.+,.+,.+,

will show you every line with more than 2 commas, because it finds any line that has any set of characters, then a comma, then any other set, then another comma, and so forth.

STEP SEVEN COMMANDS:

Find: .+,.+,.+,

Click 'Find Next'.

Celebrate! You’re almost done!

After using this query, just find each occurrence (there will be 13 of them), and replace the appropriate comma with another character that signifies it was there, like a semicolon. While you're searching, you may find some other lines, like 387, "Barnard E. Bee, James Treat, April 28, 1»40 665", which are still not quite perfect. If you see them, go ahead and fix them by hand so they fit the proper format, deleting the lines that are not relevant. There will also be leftover lines that are clearly not letters; delete those lines. Finally, there may be snippets of text left over at the bottom of the file. Highlight these and delete them.

At the top of the file, add a new line that simply reads "Sender, Recipient, Date". These will be the column headers.

Go to file->save as, and save the file as cleaned-correspondence.csv.

Congratulations! You have used regular expressions to extract and clean data. This skill alone will save you valuable time. A copy of the cleaned correspondence file is available in the online appendix, named cleaned-correspondence.csv. Note that the file online has been fixed by hand, so it is well-formatted. The file you worked on in this walkthrough still needs some additional cleaning, such as the removal of like 61 "Copy and summary of instructions United States Department of State, ".

Open Refine

Open Refine is a powerful tool that originated within Google. Since 2012, it has become a free, open source tool under continual community development. It allows users to get a quick overview of their data, find the messy bits, and start to transform the data into a useable format for further analyses. At the Open Refine webpage, there are numerous tutorials, background information, and manuals that we strongly recommend any historian to explore. Here, we continue our example from above and use Open Refine to clean up the data that we extracted using the regex (like in our sidebar above about installing Voyant-Tools locally, Open Refine runs as a server on your own computer).

Download Open Refine from http://openrefine.org/download.html. Follow the installation instructions. Start Open Refine by double clicking on its icon. This will open a new browser window, pointing to http://127.0.0.1:3333. This location is your own computer, so even though it looks like it’s running on the internet, it isn’t. The ‘3333’ is a ‘port’, meaning that Open Refine is running much like a server, serving up a webpage via that port to the browser.

Start a new project by clicking on the ‘create project’ tab on the left hand side of the screen. Click on ‘choose files’ and select the CSV file created in the previous section. This will give you a preview of your data. Name your project in the box on the top right side and then click ‘create project’. It may take a few minutes.

[insert Figure 3.10 The regex search results imported into Open Refine]

Once your project has started, one of the columns that should be visible in your data is ‘sender’. Click on the arrow to the left of "Sender" in OpenRefine and select Facet->Text Facet. Do the same with the arrow next to "Recipient". A box will appear on the left side of the browser showing all 189 names listed as senders in the spreadsheet (Figure 3.10). The spreadsheet itself is nearly a thousand rows, so immediately we see that, in this correspondence collection, some names are used multiple times. You may also have noticed that many of the names suffered from errors in the text scan (OCR or Optical Character Recognition errors), rendering some identical names from the book as similar, but not the same, in the spreadsheet. For example the recipient "Juan de Dios Cafiedo" is occasionally listed as "Juan de Dios CaAedo". Any subsequent analysis will need these errors to be cleared up, and OpenRefine will help fix them.

Within the "Sender" facet box on the left-hand side, click on the button labeled "Cluster". This feature presents various automatic ways of merging values that appear to be the same.13 Play with the values in the drop-down boxes and notice how the number of clusters found change depending on which method is used. Because the methods are slightly different, each will present different matches that may or may not be useful. If you see two values which should be merged, e.g. "Ashbel Smith" and ". Ashbel Smith", check the box to the right in the 'Merge' column and click the 'Merge Selected & Re-Cluster' button below.

Go through the various cluster methods one-at-a-time, including changing number values, and merge the values which appear to be the same. "Juan de Dios CaAedo" clearly should be merged with "Juan de Dios Cafiedo", however "Correspondent in Tampico" probably should not be merged with "Correspondent at Vera Cruz." If you are not an expert of this period, use your best judgement. By the end, you should have reduced the number of unique Senders from 189 to around 150. Repeat these steps with Recipients, reducing unique Recipients from 192 to about 160. To finish the automatic cleaning of the data, click the arrow next to "Sender" and select 'Edit Cells->Common transforms->Trim leading and trailing whitespace'. Repeat for "Recipient". The resulting spreadsheet will not be perfect, but it will be much easier to clean by hand than it would have been before taking this step (figure 3.11). Click on ‘export’ at the top right of the window to get your data back out as a .csv file.

[insert Figure 3.11 The output of using Open Refine to clean our data]

We discuss the network analysis and visualization tool Gephi in more depth in Chapter 5 and 6, but right now you will see how easy it can be to visualize a network using Gephi. Before we talk about how to import data for Gephi, there is one more step to do in Open Refine. In order to get this data into Gephi, we will have to rename the "Sender" column to "source" and the "Recipient" column to "target". In the arrow to the left of Sender in the main OpenRefine window, select Edit column->Rename this column, and rename the column "source". Now in the top right of the window, select 'Export->Custom tabular exporter'. Notice that "source", "target", and "Date" are checked in the content tab; uncheck "Date", as it will not be used in Gephi. Go to the download tab and change the download option from 'Tab-separated values (TSV)' to 'Comma-separated values (CSV)' and press download.14 The file will likely download to your automatic download directory.

Download and install Gephi from http://gephi.org. Now open Gephi by double clicking its icon. Click ‘new project’. The middle pane of the interface window is the ‘Data Laboratory’, where you can interact with your network data in spreadsheet format. This is where we can import the data cleaned up in Open Refine. In the Data Laboratory, select 'Import Spreadsheet'. Press the ellipsis '...' and locate the csv you just saved in the download directory. Make sure that the Separator is listed as 'Comma' and the 'As table' is listed as 'Edges table'. Press 'Next', then 'Finish'.

Your data should load up. Click on the ‘overview’ tab and you will be presented with a tangled network graph. You can save your work here and return to it after reading chapter 5! Don’t worry, we will explain what to do with it then.

Congratulations. You’ve downloaded a body of historical materials, used regular expressions to parse this material, trimming it to remove extraneous information that wasn’t of use to you, used Open Refine to clean it further, and loaded it into a software package for further analysis and visualization. With these skills, you will be able to handle very big historical data indeed!

Using the Stanford Named Entity Recognizer to extract data from texts

In our regular expressions example, we were able to extract some of the metadata from the document because it was more or less already formatted in such a way that we could write a pattern to find it. Sometimes however clear-cut patterns are not quite as easy to apply. For instance, what if we were interested in the place names that appear in the documents? What if we suspected that the focus of diplomatic activity shifted over time? This is where ‘named entity recognition’ can be useful. Named entity recognition covers a broad range of techniques, based on machine learning and statistical models of language to laboriously trained classifiers using dictionaries. One of the easiest to use out-of-the-box is the Stanford Named Entity Recognizer.15 In essence, we tell it ‘here is a block of text – classify!’ It will then process through the text, looking at the structure of your text and matching it against its statistical models of word use to identify person, organization, and locations. One can also expand that classification to extract time, money, percent, and date.

Let us use the NER to extract person, organization, and locations.16 First, download the Stanford NER from http://nlp.stanford.edu/software/CRF-NER.shtml and extract it to your machine. Open the location where you extracted the files. On a Mac, double-click on the one called ‘ner-gui.command’. On PC, double-click on ner-gui.bat. This opens up a new window (using Java) with ‘Stanford Named Entity Recognizer’ and also a terminal window. Don’t touch the terminal window for now. (PC users, hang on a moment – there is a bit more that you need to know before you can use this tool successfully. You will have to use the command line in order to get the output out).

In the ‘Stanford Named Entity Recognizer’ window there is some default text. Click inside this window and delete the text. Then, click on ‘File,’ then ‘Open,’ and select your text for the diplomatic correspondence of the Republic of Texas. Since this text file contains a lot of extraneous information in it – information which we are not currently interested in, including the publishing information and the index table of letters – you should open the file in a text editor first and delete that information. Save with a new name and then open it using ‘File > open’ in the Stanford NER. The file will open within the window. In the Stanford NER window, click on ‘classifier’ then ‘load CRF from file’. Navigate to where you unzipped the Stanford NER folder. Click on the ‘classifier’ folder. There are a number of files here; the ones that you are interested in end with .gz:

english.all.3class.distsim.crf.ser.gz

english.all.4class.distsim.crf.ser.gz

eglish.muc.7class.distsim.crf.ser.gz

These files correspond to these entities to extract:

3 class Location, Person, Organization

4 class Location, Person, Organization, Misc

7 class Time, Location, Organization, Person, Money, Percent, Date

Select the location, person, and organization classifier and then press ‘Run NER.’ At this point, the program will appear to ‘hang’ – nothing much will seem to be happening. However, in the background, the program has started to process your text. Depending on the size of your text, this could take anywhere from a few minutes to a few hours. Be patient! Watch the terminal window – once the program has results for you, these will start to scroll by in the terminal window. In the main program window, once the entire text has processed, the text will appear with colour-coded highlighting showing which words are location words, which ones are persons, which ones are organizations. You have now classified a text. Note: sometimes your computer may run out of memory – in that case, you’ll see an error referring to “Out of Heap Space” in your terminal window. That’s OK – just copy and paste a smaller bit of the document, say the first 10,000 lines or so. Then try again.

[insert figure 3.x terminal output from Stanford Named Entity Recognizer]

Notice also that there are blanks periodically in the output that scrolled by in the terminal window (as in figure 3.x) . These blanks correspond to the blanks between the letters in the original document. Once the program has finished processing, we will want to grab that output and visualize it, or perhaps count items, or do something else. On a Mac, you can copy and paste the output from the terminal window into your text editor of choice. It will look something like:

LOCATION: Texas

PERSON: Moore

ORGANIZATION: Suprema

And so forth. You can now use regular expressions to manipulate the text further.

Things are not so simple for PC users

On a PC, things are not so simple because the command window only shows a small fraction of the complete output – you cannot copy and paste it all! What we have to do instead is type in a command at the command prompt, rather than using the graphical interface displayed in figure 3.x, and then redirect the output into a new text file.

  1. Open a command prompt in the Stanford NER folder on your Windows machine (you can right-click on the folder in your windows explorer, and select ‘open command prompt here’).

  2. Type the following as a single line:

    java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile texas-letters.txt -outputFormat inlineXML > “my-ner-output.txt”

    The first bit, java –mx500m says how much memory to use. If you have 1gb of memory available, you can type java –mx 1g (or 2g, or 3g, etc). The next part of the command calls the NER programme itself. You can set which classifier to use after the –loadClassifier classifiers/ by typing in the exact file name for the classifier you wish to use (you are telling ‘loadClassifier’ the exact path to the classifier). At –textFile you give it the name of your input file (on our machine, called ‘texas-letters.txt’, and then specify the outputFormat. The > character sends the output to a new text file, here called “my-ner-output.txt”. Hit enter, and a few moments later the programme will tell you something along the lines of

    CRFCLassifier tagged 375281 words in 13745 documents at 10833.43 words per second

    Open the text file in Notepad++, and you’ll see output like this:

    In the name of the <LOCATION>Republic of Texas</LOCATION>, Free, Sovereign and Independent. To all whom these Presents shall come or may in any wise concern. I <PERSON>Sam Houston</PERSON> President thereof send Greeting

Congratulations – you’ve successfully tagged a document using a named entity recognizer!

Visualizing NER output

Let us imagine that we wanted to visualize the locations mentioned in the letters as a kind of network. You may wish to examine chapters five and six on networks before returning to this example.

REGEX on Mac to Useful Output (PC, please skip to next section):

We need to organize those locations so that locations mentioned in a letter are connected to each other. To begin, we trim away any line that does not start with LOCATION. We do this by using our regular expression skills from earlier, and adding in a few more commands:

FIND:
^(?!.*LOCATION).*$

and replace with nothing. We had a few new commands in there: the ?! tells it to begin looking ahead for the literal phrase LOCATION, and then the dot and dollar sign let us know to end at the end of the line. In any case, this will delete everything that doesn't have the location tag in front. Now, let us also mark those blank lines we noticed as the start of a new letter.

FIND:
^\s*$

where:

^ marks the beginning of a line
$ marks the end of a line
\s indicates ‘whitespace’
* indicates a zero or more repetition of the thing in front of it (in this case, whitespace)

and we replace with the phrase, “blankspace”. Because there might be a few blank lines, you might see ‘blankspace’ in places. Where you’ve got ‘blankspaceblankspace’, that’s showing us the spaces between the original documents (whereas one ‘blankspace’ was where an ORGANIZATION or PERSON tag was removed).

At this point, we might want to remove spaces between place names and use underscores instead, so that when we finally get this into a comma separated format, places like ‘United States’ will be ‘United_States’ rather than ‘united, states’.

Find: #a single space

Replace: _

Now, we want to reintroduce the space after ‘LOCATION:’, so

Find: :_ # this is a colon, followed by a underscore

Replace: : #this is a colon, followed by a space

Now we want to get all of the locations for a single letter into a single line. So we want to get rid of new lines and LOCATION:

Find: \n(LOCATION:)

Replace:

It is beginning to look like a list. Let’s replace ‘blankspaceblankspace’ with ‘new-document’.

Find: blankspaceblankspace

Replace: new-document

And now let’s get those single blankspace lines excised:

Find \n(blankspace)

Replace:

Now you have something that looks like this:

new-document Houston Republic_of_Texas United_States Texas

new-document Texas

new-document United_States Town_of_Columbia

new-document United_States Texas

new-document United_States United_States_of_ Republic_of_Texas

new-document New_Orleans United_States_Govern

new-document United_States Houston United_States Texas united_states

new-document United_States New_Orleans United_States

new-document Houston United_States Texas United_States

new-document New_Orleans

Why not leave ‘new-document’ in the file? It’ll make a handy marker. Let’s replace the spaces with commas:

Find: ###a single space

Replace: ,

Save your work. You now have a file that you can load into a variety of other platforms or tools to perform your analysis. Of course, NER is not perfect, especially when dealing with names like ‘Houston’ that were a personal name long before they were a place name. [EXPLAIN THAT THIS IS A CSV FILE OR SOMETHING]

REGEX on PC to Useful Output:

Open the file in Notepad++. There are a lot of carriage returns, line breaks, and white spaces in this document that will make our regex very complicated and will in fact break it periodically, if we try to work with it as it is. Instead, let us remove all new lines and whitespaces and the outset, and use regex to put structure back. To begin, we want to get the entire document into a single line:

Find: \n

Replace with nothing.

Notepad++ reports the number of lines in the document at the bottom of the window. It should now report lines: 1.

Let’s introduce line breaks to correspond with the original letters (ie, a line break signals a new letter), since we want to visualize the network of locations mentioned where we join places together on the basis of being mentioned in the same letter. If we examine the original document, one candidate for something to search for, to signal a new letter, could be <PERSON> to <PERSON> but the named entity recognizer sometimes confuses persons for places or organizations; the object character recognition also makes things complicated. Since the letters are organized chronologically, and we can see ‘Digitized by Google’ at the bottom of every page (and since no writer in the 1840s is going to use the word digitized) let’s use that as our page break marker. For the most part, one letter corresponds to one page in the book. It’s not ideal, but this is something that the historian will have to discuss in her methods and conclusions. Perhaps, over the seven hundred odd letters in this collection, it doesn’t actually make much difference.

Find: (digitized)

Replace: \n\1

You now have on the order of 700 odd lines in the document.

Let’s find the locations and put them on individual lines, so that we can strip out everything else.

Find: (LOCATION)(.*?)(LOCATION)

Replace: \n\2

In the search string above, the .* would look for everything between the first <location> on the line and the last <location> in the line, so we would get a lot of junk. By using .*? we just get the text between <location> and the next <location> tag.

We need to replace the < and > from the location tags now, as those are characters with special meanings in our regular expressions. Turn off the regular expressions in notepad ++ by unticking the ‘regular expression’ box. Now search for <, >, and \ in turn, replacing them with tildes:

<LOCATION>Texas</LOCATION> becomes ~Texas~~~

Now we want to delete in each line everything that isn’t a location. Our locations start the line and are trailed by three tildes, so we just need to find those three tildes and everything that comes after, and delete. Turn regular expressions back on in the search window.

Find: (~~~)(.*)

Replace with nothing.

Our marker for a new page in this book was the line that began with ‘Digitized by’. We need to delete that line in its entirety, leaving a blank line.

Find: (Digitized)(.*)

Replace: \n

Now we want to get those locations all in the same line again. We need to find the new line character followed by the tilde, and then delete that new line, replacing it with a comma:

Find: (\n)(~)

Replace: ,

Now we remove extraneous blank lines.

Find: \s*$

Replace with nothing.

We’re almost there. Let’s remove the comma that starts each line by searching for

^,

(remembering that the carat character indicates the start of a line) and replacing with nothing. Congratulations! You’ve taken quite complicated output and cleaned it so that every place mentioned on a single page in the original publication is now on its own line, which means you can import it into a network visualization package, a spreadsheet, or some other tool!

Other things you could do with this data

Open your csv (the one where each place name is separated by a comma) in Excel. Insert a column and add a unique number for each row. Insert a row at the top, and give each column a heading: record_id, place1, place2, place3…. And so on up to place29 (one row has 29 place names in it!)

Take this table, and copy and paste it into Palladio from the Centre for Spatial and Textual Analysis at Stanford. Go to http://palladio.designhumanities.org. Paste the table in. Click on the graph icon – this will make a network from your data, but you need to tell it the ‘source’ and ‘target’ nodes. When we look at the original letters, we see that the writer often identified the town in which he was writing, and the town of the addressee. Click on ‘choose’ under ‘source’ and select place 3. Under ‘target’ click on ‘choose’ and select place 4. You now will have a network graph much like figure 3.xx. Why choose the third and fourth places? Perhaps it makes sense, for a given research question, to assume that pleasantries out of the way the writers will discuss the places important to their message.

What would really be useful perhaps would be to add the temporal information from the letters, so that the evolution of this network could be spotted. There are a number of other things that you could do to enhance visualization. Now that you are familiar with named entity recognition, regex, and a few easy visualization tools, you could set the NER to recognize dates as well. You could then set Palladio up to show a timeslider against your network, only showing those nodes with the correct date range.17 Or you could upload one column of place names at a time to http://scargill.inf.ed.ac.uk/geoparser.html to obtain geographical coordinates; the results could then be incorporated into your Palladio visualization so that your network graph is laid out against geographic space (see the Palladio documentation for extending your table with new subtables of data). Palladio allows you to export your visualization as a graphic (in svg format) or in .json format.18

[*insert figure 3.xx Palladio representation of the network of places mentioned in letters]*

Let’s try another tool. Copy your table of data again, and then paste it in http://app.raw.densitydesign.org/ . You can now experiment with a number of different visualizations that are all built on the d3.js code library. Try the ‘alluvial’ diagram. Pick place1 and place2 as your dimensions. Does anything jump out? Try place3 and place 4. Try place 1, place2, place3, and place3 in a single alluvial diagram. Experiment! This is one of the joys of working with data, experimenting to see how you can deform your materials to see them in a new light. Which visualization is the right visualization? We turn to that question momentarily.

Quickly Extracting Tables from PDFs

Much of the historical data that we would like to examine that we find online comes in the cumbersome form of the PDF, ‘the portable document format’. Governments especially love the PDF because they can quickly be generated in response to freedom of information requests, and because they preserve the look and layout of the original paper documents. Sometimes, a PDF is little more than an image; the text show is just a pattern of dark and light dots. Other times, there is a hidden layer of machine readable text that one can select with the mouse by clicking and dragging and copying. When we are dealing with tens or hundreds or thousands of pages of pdfs, this quickly is not a feasible workflow. Journalists have this same problem, and have developed many tools that the historian may wish to incorporate into her own workflow. Recently, the ‘data journalist’ Jonathan Stray has written about the various free and paid tools that can be wrangled to extract meaningful data from thousands of pdfs at a time.19 One in particular that Stray mentions is called ‘Tabula’, which can be used to extract tables of information from PDFs, such as may be found in census documents.

Tabula is open source and runs on all the major platforms. You simply download it and double-click on the icon; it loads up inside your browser on the address http://127.0.0.1:8080.20 If for some reason it does not appear, try loading that address. You load your pdf into it and can manually detect tables. When the PDF appears, draw boxes around the tables that you are interested in using. Tabula will then extract that table cleanly, allowing you to download it as a csv or tab separated file, or paste it directly into something else.

For instance, say you’re interested in the data that Gill and Chippindale compiled on Cycladic Figures and the art market.21 If you have access to the database JSTOR, you can find it here http://www.jstor.org/stable/506716?. There are a lot of charts so it is a good example to play with. You would like to grab those tables of data to perhaps compile with data from other sources to perform some sort of meta study about the antiquities market.

Download the paper, open it in your pdf reader and then feed it into Tabula. Let’s look at table 2 from the article in your pdf reader. You could just highlight this table in your pdf reader and hit ctrl+c to copy it, but when you paste that into your spreadsheet, you’d get everything in a single column. For a small table, maybe that’s not such a big issue. But let’s look at what you get with Tabula. With Tabula running, you go to the pdf your are interested in, and draw a bounding box around the table. Release the mouse and you will be presented with a preview that you can then download as a csv. You can quickly drag the selection box around every table in the document and hit download just the one time.

Since you can copy directly to the clipboard, you can paste directly into a Google drive spreadsheet (thus taking advantage of all the visualization options that Google offers) or into something like Raw from Density Design, exploring the data and its patterns through a variety of quickly generated visualizations.22

Making Your Data Look Pretty: Visualizations

By this point in the book, you know how to gather data, and we are beginning to explore various ways to visualize it. While historians often make use of graphs and charts, the training of an historian typically includes very little on the principles of information visualization. The following section provides guidance on the various issues at play when we visualize data.

Principles of Information Visualization

It should now be clear that reading history through a macroscope will involve visualization. Visualization is a method of deforming, compressing, or otherwise manipulating data in order to see it in new and enlightening ways. A good visualization can turn hours of careful study into a flash of insight, or can convey a complex narrative in a single moment. Visualizations can also lie, confuse, or otherwise misrepresent if used poorly. This section will introduce historians to types of visualizations, why they might want to use them, and how to use them effectively. It will also show off some visualizations which have been used effectively by historians.

Why Visualize?

A 13th century Korean edition of the Buddhist canon contains over 52 million characters across 166,000 pages. Lewis Lancaster describes a traditional analysis of this corpus as such:

The previous approach to the study of this canon was the traditional analytical one of close reading of specific examples of texts followed by a search through a defined corpus for additional examples. When confronted with 166,000 pages, such activity had to be limited. As a result, analysis was made without having a full picture of the use of target words throughout the entire collection of texts. That is to say, our scholarship was often determined and limited by externalities such as availability, access, and size of written material. In order to overcome these problems, scholars tended to seek for a reduced body of material that was deemed to be important by the weight of academic precedent.23

As technologies advanced, the old limitations were no longer present; Lancaster and his team worked to create a search interface (figure 3.12) that would allow historians to see the evolution and use of glyphs over time, effectively allowing them to explore the entire text all at once. No longer would historians need to selectively pick which areas of this text to scrutinize; they could quickly see where in the corpus their query was most-often used, and go from there.

[insert Figure 3.12 Screenshot of Lancaster’s interface for studying the evolution of glyphs over time]

This approach to distant reading–seeing where in a text the object of inquiry is densest–has since become so common as to no longer feel like a visualization. Amazon’s Kindle has a search function called X-Ray (figure 3.13) which allows the reader to search for a series of words, and see the frequency with which those words appear in a text over the course of its pages. In Google’s web browser, Chrome, searching for a word on a webpage highlights the scroll bar on the right-hand side such that it is easy to see the distribution of that word use across the page.

[insert Figure 3.13 Amazon’s x-ray search juxtaposed against Google Chrome’s scroll-bar search term indicator]

The use of visualizations to show the distribution of words or topics in a document is an effective way of getting a sense for the location and frequency of your query in a corpus, and it represents only one of the many uses of information visualization. Uses of information visualization generally fall into two categories: exploration and communication.

Exploration

When first obtaining or creating a dataset, visualizations can be a valuable aid in understanding exactly what data are available and how they interconnect. In fact, even before a dataset is complete, visualizations can be used to recognize errors in the data collection process. Imagine you are collecting metadata from a few hundred books in a library collection, making note of the publisher, date of publication, author names, and so on. A few simple visualizations, made easily in software like Microsoft Excel, can go a long way in pointing out errors. Notice how in the chart in figure 3.14, it can easily be noticed that whomever entered the data on book publication dates accidentally typed “1909” rather than “1990” for one of the books.

[insert Figure 3.14 An error in data entry becomes apparent quickly when data is represented as a chart.]

Similarly, visualizations can be used to get a quick understanding of the structure of data being entered, right in the spreadsheet. The visualization in figure 3.15, of salaries at a university, makes it trivial to spot which department’s faculty have the highest salaries, and how those salaries are distributed. It utilizes basic functions in recent versions of Microsoft Excel.

[Insert Figure 3.15 Even a spreadsheet can be considered a visualization.}

More complex datasets can be explored with more advanced visualizations, and that exploration can be used for everything from getting a sense of the data at hand, to understanding the minute particulars of one data point in relation to another. The visualization in Figure 3.16, ORBIS, allows the user to explore transportation networks in the Ancient Roman world. This particular display is showing the most likely route from Rome to Constantinople under a certain set of conditions, but the user is invited to tweak those conditions, or the starting and ending cities, to whatever best suits their own research questions.

[*Insert Figure 3.16 Screenshot of a detail from ORBIS]*

Exploratory visualizations like this one form a key part of the research process when analyzing large datasets. They sit particularly well as an additional layer in the hermeneutic process of hypothesis formation. You may begin your research with a dataset and some preconceptions of what it means and what it implies, but without a well-formed thesis to be argued. The exploratory visualization allows you to notice trends or outliers that you may not have noticed otherwise, and those trends or outliers may be worth explaining or discussing in further detail. Careful historical research of those points might reveal even more interesting directions worth exploring, which can then be folded into future visualizations.

Communication

Once the research process is complete, visualizations still have an important role to play in translating complex data relationships into easily digestible units. The right visualization can replace pages of text with a single graph and still convey the same amount of information. The visualization created by Ben Schmidt reproduced in figure 3.17, for example, shows the frequency with which certain years are mentioned in the titles of history dissertations.24 The visualization clearly shows that the great majority of dissertations cover the years after 1750, with spikes around the American Civil War and the World Wars. While my description of the chart does describe the trends accurately, it does not convey the sheer magnitude of difference between earlier and later years as covered by dissertations, nor does it mention the sudden drop in dissertations covering periods after 1970.

[Insert Figure 3.17 The years mentioned in dissertation titles, after Schmidt 2013]

Visualizations in publications are often, but not always, used to improve a reader’s understanding of the content being described. It is also common for visualizations to be used to catch the eye of readers or peer reviewers, to make research more noticeable, memorable, or publishable. In a public world that values quantification so highly, visualizations may lend an air of legitimacy to a piece of research which it may or may not deserve. We will not comment on the ethical implications of such visualizations, but we do note that such visualizations are increasingly common and seem to play a role in successfully passing peer review, receiving funding, or catching the public eye. Whether the ends justifies the means is a decision we leave to our readers.

Types of Visualizations

Up until this point, we have used the phrase information visualization without explaining it or differentiating it from other related terms. We remedy that here: information visualization is the mapping of abstract data to graphic variables in order to make a visual representation. We use these representations to augment our abilities to read data; we cannot hope to intuit all relationships in our data by memory and careful consideration alone, and visualizations make those relationships more apparent.

An information visualization differs from a scientific visualization in the data it aims to represent, and in how that representation is instantiated. Scientific visualizations maintain a specific spatial reference system, whereas information visualizations do not. Visualizations of molecules, weather, motors, and brains are all scientific visualizations because they each already have a physical instantiation, and their visual form is preserved in the visualization. Bar charts, scatter plots, and network graphs, on the other hand, are all information visualizations, because they lay out in space data which do not have inherent spatiality. An infographic is usually a combination of information and scientific visualizations embedded in a very explicit narrative and marked up with a good deal of text.

These types are fluid, and some visualizations fall between categories. Most information visualizations, for example, contain some text, and any visualization we create is imbued with the narrative and purpose we give it, whether or not we realize we have done so. A truly “objective” visualization, where the data speak for themselves, is impossible. Our decisions on how to encode our data and which data to present deeply influence the understanding readers take away from a visualization.

Visualizations also vary between static, dynamic, and interactive. Experts in the area have argued that the most powerful visualizations are static images with clear legends and a clear point, although that may be changing with increasingly powerful interactive displays which give users impressive amounts of control over the data. Some of the best modern examples come from the New York Times visualization team. Static visualizations are those which do not move and cannot be manipulated; dynamic visualizations are short animations which show change, either over time or across some other variable; interactive visualizations allow the user to manipulate the graphical variables themselves in real-time. Often, because of change blindness, dynamic visualizations may be confusing and less informative than sequential static visualizations. Interactive visualizations have the potential to overload an audience, especially if the controls are varied and unintuitive. The key is striking a balance between clarity and flexibility.

There is more to visualization than bar charts and scatter plots. Scholars are constantly creating new variations and combinations of visualizations, and have been for hundreds of years. An exhaustive list of all the ways information has or can be visualized would be impossible, although we will attempt to explain many of the more common varieties. Our taxonomy is influenced by visualizing.org, a website dedicated to cataloging interesting visualizations, but we take examples from many other sources as well.

Statistical Charts & Time Series

Statistical charts are likely those that will be most familiar to any audience. When visualizing for communication purposes, it is important to keep in mind which types of visualizations your audience will find legible. Sometimes the most appropriate visualization for the job is the one that is most easily understood, rather than the one that most accurately portrays the data at hand. This is particularly true when representing many abstract variables at once: it is possible to create a visualization with color, size, angle, position, and shape all representing different aspects of the data, but it may become so complex as to be illegible.

[insert Figure 3.18 a basic bar chart]

Figure 3.18 is a basic bar chart of the amount of non-fiction books held in some small collection, categorized by genre. One dimension of data is the genre, which is qualitative, and each is being compared along a second category, number of books, which is quantitative. Data with two dimensions, one qualitative and one quantitative, usually are best represented as bar charts such as this.

Sometimes you want to visualize data as part of a whole, rather than in absolute values. In these cases, with the same qualitative/quantitative split in data, most will immediately rely on pie charts such as the one in figure 3.19. This is often a poor choice: pie charts tend to be cluttered–especially as the number of categories increase–and people have a difficult time interpreting the area of a pie slice.

[insert Figure 3.19 a pie chart]

The same data can be rendered as a stacked bar chart (figure 3.20), which produces a visualization with much less clutter. This chart also significantly decreases the cognitive load of the reader as well, as they merely need to compare bar length rather than try to internally calculate the area of a slice of pie.

[insert Figure 3.20 A stacked bar chart]

When there are two quantitative variables to be represented, rather than a quantitative and a qualitative, the visualization most often useful is the line graph or scatterplot. Volumes of books in a collection ordered by publication year, for example, can be expressed with the year on the horizontal axis (x-axis) and the number of books on the vertical axis (y-axis). The line drawn between each (x,y) point represents our assumption that the data points are somehow related to each other, and an upward or downward trend is somehow meaningful (figure 3.21).

[insert Figure 3.21 a line graph]

We could replace the individual lines between years with a trend line, one that shows the general upward or downward trend of the data points over time (figure 3.22). This reflects our assumption that not only are the year-to-year changes meaningful, but that there is some underlying factor that is causing the total number of volumes to shift upward or downward across the entire timespan. In this case, it seems that on average the number of books collection seems to be decreasing as publication dates approach the present day, which can easily be explained by the lag in time it might take before the decision is made to purchase a book for the collection.

[insert Figure 3.22 The line in figure 3.21 is removed and replaced with a trendline on the points]

Scatterplots have the added advantage of being amenable to additional dimensions of data. The scatterplot in 3.23 compares three dimensions of data: genre (qualitative), number of volumes of each genre in the collection (quantitative), and average number of pages per genre (quantitative). It shows us, for examples, that the collection contains quite a few biographies, and biographies have much fewer pages on average than reference books. The scatterplot also shows us that it is fairly useless; there are no discernible trends or correlations between any of the variables, and no new insights emerge from viewing the visualization.

[insert Figure 3.23 a scatterplot]

The histogram is a visualization that is both particularly useful and extremely deceptive for the unfamiliar. It appears to be a vertical bar chart, but instead of the horizontal axis representing categorical data, a histogram’s horizontal axis usually also represents quantitative data, sub-divided in a particular way. Another way of saying this is that in a bar chart, the categories can be moved left or right without changing the meaning of the visualization, whereas in a histogram, there is a definite order to the categories of the bar. For example, the figure below represents the histogram of grade distributions in a college class. It would not make sense for the letter grades to be in any order but the order presented below. Additionally, histograms always represent the distribution of certain values; that is, the height of the bar can never represent something like temperature or age, but instead represents the frequency with which some value appears. In figure 3.24, bar height represents the frequency with which students in a college course get certain grades.

[Figure 3.24 a histogram of students’ grades]

The histogram (figure 3.24) shows that the distribution of students’ grades does not follow a true bell curve, with as many As as Fs in the class. This is not surprising for anyone who has taught a course, but it is a useful visualization for representing such divergences from expected distributions.

Despite their seeming simplicity, these very basic statistical visualizations can be instigators for extremely useful analyses. The visualization in figure 3.25 shows the changing frequency of the use of “aboue” and “above” (spelling variations of the same word) in English printed text from 1580-1700. Sam Kaislaniemi noted in a blog post how surprising it is that the spelling variation seems to have changed so drastically in a period of two decades. This instigated further research, leading to an extended blog post and research into a number of other datasets from the same time period.

[Figure 3.25 The changing frequency of ‘aboue’ and ‘above’]

Maps

Basic maps may be considered scientific visualizations, because latitude and longitude is a pre-existing spatial reference system which most geographic visualizations conform to exactly. However, as content is added to a map, it may gain a layer or layers of information visualization.

One of the most common geographic visualizations is the choropleth, where bounded regions are colored and shaded to represent some statistical variable (figure 3.26). Common uses for choropleths include representing population density or election results. The below visualization, created by Mike Bostock, colors counties by unemployment rate, with darker counties having higher unemployment. Choropleth maps should be used for ratios and rates rather than absolute values, otherwise larger areas may be disproportionately colored darker due merely to the fact that there is more room for people to live.

[insert Figure 3.26 A choropleth map]

For some purposes, choropleths provide insufficient granularity for representing density. In the 1850s, a cholera outbreak in London left many concerned and puzzled over the origin of the epidemic. Dr. John Snow created a dot density map (Figure 3.27) showing the location of cholera cases in the city. The visualization revealed that most cases were around a single water pump, suggesting the outbreak was due to a contaminated water supply.

[insert Figure 3.27 John Snow’s dot density map]

For representing absolute values on maps, you should instead consider using a proportional symbol map. The map depicted in figure 3.28, created by Mike Bostock, shows the populations of some of the United States’ largest cities. These visualizations are good for directly comparing absolute values to one another, when geographic region size is not particularly relevant. Keep in mind that often, even if you plan on representing geographic information, the best visualizations may not be on a map. In this case, unless you are trying to show that the higher density of populous areas is in the Eastern U.S., you may be better served by a bar chart, with bar heights representative of population size. That is, the latitude and longitude of the cities is not particularly important in conveying the information we are trying to get across.

[insert Figure 3.28 A proportional symbol map showing the population of the largest cities in the US.]

Data that continuously change throughout geographic space (e.g. temperature or elevation) require more complex visualizations. The most common in this case are known as isopleth, isarithmic, or contour maps, and they represent gradual change using adjacent, curving lines. Note that these visualizations work best for data which contain smooth transitions. Topographic maps (figure 3.29) uses adjacent lines to show gradual changes in elevation; the closer together the lines, the more rapidly the elevation changes.

[insert Figure 3.29 detail of a contour map]

Geographic maps have one feature that sets them apart from most other visualizations: we know them surprisingly well. While few people can label every U.S. state or European country on a map accurately, we know the shape of the world enough that we can take some liberties with geographic visualizations that we cannot take with others. Cartograms are maps which distort the basic spatial reference system of latitude and longitude in order represent some statistical value. They work because we know what the reference is supposed to look like, so we can immediately intuit how cartogram results differ from the “base map” we are familiar with. The cartogram depicted in Figure 3.30, created by M.E.J. Newman, distorts state sizes by their population, and colors the states by how they voted in the 2008 U.S. presidential election.25 It shows that, although a greater area of the United States may have voted Republican, those areas tended to be quite sparsely populated.

[insert Figure 3.30 Newman’s cartogram of the 2008 US Presidential election results]

Maps are not necessarily always the most appropriate visualizations for the job, but when they are used well, they can be extremely informative.

In the humanities, map visualizations will often need to be of historical or imagined spaces. While there are many convenient pipelines to create custom data overlays of maps, creating new maps entirely can be a grueling process with few easy tools to support it. It is never as simple as taking a picture of an old map and scanning it into the computer; the aspiring cartographer will need to painstakingly match points on an old scanned map to their modern latitude and longitude, or to create new map tiles entirely. The below visualizations are two examples of such efforts: the first (figure 3.31) is a reconstructed map of the ancient world which includes aqueducts, defense walls, sites, and roads by Johan Åhlfeldt with Pelagios,26 and the second (figure 3.32) is a reconstructed map of Tolkien’s Middle Earth by Emil Johansson.27 Both are examples of extremely careful humanistic work which involved both additional data layers, and changes to the base map.

[Insert Figure 3.31 detail centred on Rome, Digital map of the Roman empire]

[insert Figure 3.32 Detail of the LOTRProject map of Middle earth, with The Shire to the left]

Hierarchies & Trees

Whereas the previous types of visualizations dealt with data that were some combination of categorical, quantitative, and geographic, some data are inherently relational, and do not lend themselves to these sorts of visualizations. Hierarchical and nested data are a variety of network data, but they are a common enough variety that many visualizations have been designed with them in mind specifically. Examples of this type of data include family lineages, organizational hierarchies, computer subdirectories, and the evolutionary branching of species.

The most common forms of visualization for this type of data are vertical and horizontal trees. The horizontal tree in figure 3.33, made in D3.js, shows the children and grandchildren of Josiah Wedgwood. These visualizations are extremely easy to read by most people, and have been used for many varieties of hierarchical data. Trees have the advantage of more legible than most other network visualizations, but the disadvantage of being fairly restrictive in what they can visualize.

[insert Figure 3.33 Horizontal tree map showing the descendents of Josiah Wedgwood]

Another form of hierarchical visualization, called a radial tree, is often used to show ever-branching structures, as in an organization. The radial tree in figure 3.33, a 1924 organization chart in a volume on management statistics by W.H.Smith,28 emphasizes how power in the organization is centralized in one primary authority. It is important to remember that stylistic choices can deeply influence the message taken from a visualization. Horizontal and radial trees can represent the same information, but the former emphasizes change over time, whereas the latter emphases the centrality of the highest rung on the hierarchy. Both are equally valid, but they send very different messages to the reader.

[Insert Figure 3.34 Radial tree]

One of the more recently-popular hierarchical visualizations is the treemap designed by Ben Shneiderman. Treemaps use nested rectangles to display hierarchies, the areas of which represent some quantitative value. The rectangles are often colored to represent a third dimension of data, either categorical or quantitative. The visualization in figure 3.35 is of Washington D.C.’s budget in 2013, separated into governmental categories. Rectangles are sized proportionally to the amount of money received per category in 2013, and colored by the percentage that amount had changed since the previous fiscal year.

[insert Figure 3.35 Washington D.C.’s budget, 2013, as a treemap]

Networks & Matrices

Network visualizations can be complex and difficult to read. Nodes and edges are not always represented as dots and lines, and even when they are, the larger the network, the more difficult they are to decipher. The reasons behind visualizing a network can differ, but in general, visualizations of small networks are best at allowing the reader to understand individual connections, whereas visualizations of large networks are best for revealing global structure.

Network visualizations, much like network analysis, may or may not add insight depending on the context. A good rule of thumb is to ask a network-literate friend reading the final product whether the network visualization helps them understand the data or the narrative any more than the prose alone. It often will not. We recommend not including a visualization of the data solely for the purpose of revealing the complexity of the data at hand, as it conveys little information, and feeds into a negative stereotype of network science as an empty methodology. We will go into network visualizations in some depth in Chapter 6, and the reader may wish to skip ahead.

In fact, we recommend that, where possible, complex network visualizations should be avoided altogether. It is often easier and more meaningful for a historical narrative to simply provide a list of the most well-connected nodes, or, e.g., a scatterplot showing the relationship between connectivity and vocation. If the question at hand can be more simply answered with a traditional visualization which historians are already trained to read, it should be.

Small Multiples & Sparklines

Small multiples and sparklines are not exactly different types of visualization than what have already been discussed, but they represent a unique way of presenting visualizations that can be extremely compelling and effective. They embody the idea that simple visualizations can be more powerful than complex ones, and that multiple individual visualizations can often be more easily understood than one incredibly dense visualization.

Small multiples are exactly what they sound like: the use of multiple small visualizations adjacent to one another for the purposes of comparison. They are used in lieu of animations or one single extremely complex visualization attempting to represent the entire dataset. Figure 3.36, by Brian Abelson of OpenNews, is of cold- and warm-weather anomalies in the United States since 1964.29 Cold weather anomalies are in blue, and warm weather anomalies are in read. This visualization is used to show increasingly extreme warm weather due to global warming.

[insert *Figure 3.36 small multiples, a series of maps representing weather anomalies in the US since 1964, after Brian Abelson]*

Sparklines, a term coined by Edward Tufte, are tiny line charts with no axis or legend. They can be used in the middle of a sentence, for example to show a changing stock price over the last week ( ), which will show us general upward or downward trends, or in small multiples to compare several values. Microsoft Excel has a built-in sparkline feature for just such a purpose. Figure 3.37 is a screenshot from Excel, showing how sparklines can be used to compare the frequency of character appearances across different chapters of a novel.

[Insert Figure 3.37 sparklines in MS Excel, depicting character appearances in a novel.]

The sparklines above quickly show Carol as the main character, and that two characters were introduced in Chapter 3, without the reader needing to look at the numbers in the rest of the of the spreadsheet.

Choosing the Right Visualization

There is no right visualization. A visualization is a decision you make based on what you want your audience to learn. That said, there are a great many wrong visualizations. Using a scatterplot to show average rainfall by country is a wrong decision; using a bar chart is a better one. Ultimately, your choice of which type of visualization to use is determined by how many variables you are using, whether they are qualitative or quantitative, how you are trying to compare them, and how you would like to present them. Creating an effective visualization begins by choosing from one of the many appropriate types for the task at hand, and discarding inappropriate types as necessary. Once you have chosen the form your visualization will take, you must decide how you will create the visualization: what colors will you use? What symbols? Will there be a legend? The following sections cover these steps.

Visual Encoding

Once a visualization type has been chosen, the details may seem either self-evident or negligible. Does it really matter what color or shape the points are? In short, yes, it matters just as much as the choice of visualization being used. And, when you know how to effectively use various types of visual encodings, you can effectively design new forms of visualization which suit your needs perfectly. The art of visual encoding is in the ability to match data variables and graphic variables appropriately. Graphic variables include the color, shape, or position of objects in the visualization, whereas data variables include what is attempting to be visualized (e.g. temperature, height, age, country name, etc.)

Scales of Measure

The most important aspect of choosing an appropriate graphic variable is to know the nature of your data variables. Although the form data might take will differ from project to project, it will likely conform to one of five varieties: nominal, relational, ordinal, interval, ratio, or relational.

Nominal data, also called categorical data, is a completely qualitative measurement. It represents different categories or labels or classes. Countries, people’s names, and different departments in a university are all nominal variables. They have no intrinsic order, and their only meaning is in how they differentiate from one another. We can put country names in alphabetical order, but that order does not say anything meaningful about their relationships to one another.

Relational data is data on how nominal data relate to one another. It is not necessarily quantitative, although it can be. Relational data requires some sort of nominal data to anchor it, and can include friendships between people, the existence of roads between cities, and the relationship between a musician and the instrument she plays. This type of data is usually, but not always, visualized in trees or networks. A quantitative aspects of relational data may be the length of a phone call between people or the distance between two cities.

Ordinal data is that which has inherent order, but no inherent degree of difference between what is being ordered. The first, second, and third place winners in a race are on an ordinal scale, because we do not know how much faster first place was than second; only that one was faster than the other. Likert scales, commonly used in surveys (e.g. strongly disagree / disagree / neither agree nor disagree / agree / strongly agree), are an example of commonly-used ordinal data. Although order is meaningful for this variable, the fact that it lacks any inherent magnitude makes ordinal data a qualitative category.

Interval data is data which exists on a scale with meaningful quantitative magnitudes between values. It is like ordinal in that the order matters, and additionally, the difference between first and second place is the same as the distance between second and third place. Longitude, temperature in Celsius, and dates all exist on an interval scale.

Ratio data is data which, like interval data, has a meaningful order and a constant scale between ordered values, but additionally it has a meaningful zero value Compare this to weight, age, or quantity; having no weight is physically meaningful and different both in quantity and kind to having some weight above zero.

Having a meaningful zero value allows us to use calculations with ratio data that we could not perform on interval data. For example, if one box weighs 50 lbs and another 100 lbs, we can say the second box weighs twice as much as the first. However, we cannot say a day that is 100°F is twice as hot as a day that is 50°F, and that is due to 0°F not being an inherently meaningful zero value.

The nature of each of these data types will dictate which graphic variables may be used to visually represent them. The following section discusses several possible graphic variables, and how they relate to the various scales of measure.

Graphic Variable Types

Graphic variables are any of those visual elements that are used to systematically represent information in a visualization. They are building blocks. Length is a graphic variable; in bar charts, longer bars are used to represent larger values. Position is a graphic variable; in a scatterplot, a dot’s vertical and horizontal placement are used to represent its x and y values, whatever they may be. Color is a graphic variable; in a choropleth map of United States voting results, red is often used to show states that voted Republican, and blue for states that voted Democrat.

Unsurprisingly, some graphic variable types are better than others in representing different data types. Position in a 2D grid is great for representing quantitative data, whether it be interval or ratio. Area or length is particularly good for showing ratio data, as size also has a meaningful zero point. These have the added advantage of having a virtually unlimited number of discernible points, so they can be used just as easily for a dataset of 2 or 2 million. Compare this with angle. You can conceivably create a visualization that uses angle to represent quantitative values, as in figure 3.38. This is fine if you have very few, incredibly varied data points, but you will eventually reach a limit beyond which minute differences in angle are barely discernible. Some graphic variable types are fairly limited in the number of potential variations, whereas others have much wider range.

[insert Figure 3.38, a misuse of angle to indicate age.]

Most graphic variables that are good for fully quantitative data will work fine for ordinal data, although in those cases it is important to include a legend making sure the reader is aware that a constant change in a graphic variable is not indicative of any constant change in the underlying data. Changes in color intensity are particularly good for ordinal data, as we cannot easily tell the magnitude of difference between pairs of color intensity.

Color is a particularly tricky concept in information visualization. Three variables can be used to describe color: hue, value, and saturation (figure 3.39).

[insert Figure 3.39 hue, value, and saturation]

These three variables should be used to represent different variable types. Except in one circumstance, discussed below, hue should only ever be used to represent nominal, qualitative data. People are not well-equipped to understand the quantitative difference between e.g. red and green. In a bar chart showing the average salary of faculty from different departments, hue can be used to differentiate the departments. Saturation and value, on the other hand, can be used to represent quantitative data. On a map, saturation might represent population density; in a scatterplot, saturation of the individual data points might represent somebody’s age or wealth. The one time hue may be used to represent quantitative values is when you have binary diverging data. For example, a map may show increasingly saturated blues for states which lean more Democratic, and increasingly saturated reds for states which lean more Republican. Besides this special case of two opposing colors, it is best to avoid using hue to represent quantitative data.

Shape is good for nominal data, although only if there are under half a dozen categories. You will see shape used on scatterplots when differentiating between a few categories of data, but shapes run out quickly after triangle, square, and circle. Patterns and textures can also be used to distinguish categorical data; these are especially useful if you need to distinguish between categories on something like a bar chart, but the visualization must be printed in black & white.

Relational data is among the most difficult to represent. Distance is the simplest graphic variable for representing relationships (closer objects are more closely related), but that variable can get cumbersome quickly for large datasets. Two other graphic variables to use are enclosure (surrounding items which are related by an enclosed line), or line connections (connecting related items directly via a solid line). Each has its strengths and weaknesses, and a lot of the art of information visualization comes in learning when to use which variable.

Cognitive and Social Aspects of Visualization

Luckily for us, there are a few gauges for choosing between visualization types and graphic variables that go beyond the merely aesthetic. Social research has shown various ways in which people process what they see, and that research should guide our decisions in creating effective information visualizations.

About a tenth of all men and a hundredth of all women have some form of color blindness. There are many varieties of color blindness; some people see completely in monochrome, others have difficulty distinguishing between red and green, or between blue and green, or other combinations besides. To compensate, visualizations may encode the same data in multiple variables. Traffic lights present a perfect example of this; most of us are familiar with the red, yellow, and green color schemes, but for those people who cannot distinguish among these colors, top-to-bottom orientation sends the same signal. If you need to pick colors that are safely distinguishable by most audiences, a few online services can assist. One popular service is colorbrewer (http://colorbrewer2.org/), which allows you to create a color scheme that fits whatever set of parameters you may need.

In 2010, Randall Munroe conducted a massive online survey asking people to name the colors they were randomly presented. The results showed that women disproportionately named colors more specifically than men, such that where a woman might have labeled a color as neon green, a man might have just named it green. This does not imply women can more easily differentiate between colors, as some occasionally suggest, although part of the survey results definitely do show the disproportionate amount of men who have some form of color blindness. Beyond sex and gender, culture also plays a role in the interpretation of colors in a visualization. In some cultures, death is signified by the color black; in others, it is signified by white. In most cultures, both heat and passion are signified by the color red; illness is often, but not always, signified by yellow. Your audience should influence your choice of color palette, as readers will always come to a visualization with preconceived notions of what your graphic variables imply.

Gestalt psychology is a century-old practice of understanding how people perceive patterns. It attempts to show how we perceive separate visual elements as whole units–how we organize what we see into discernible objects (figure 3.40). Among the principles of Gestalt psychology are:

These and other gestalt principles can be used to make informed decisions regarding graphic variables. Knowing what patterns tend to lead to perceptions of continuity or discontinuity are essential in making effective information visualizations.

At a more fine-grained level, when choosing between equally appropriate graphic variables, research on preattentive processing can steer us in the right direction. We preattentively process certain graphic variables, meaning we can spot differences in those variables in under 10 milliseconds. Color is a preattentively processed graphic variable, and thus in figure 3.41, you will very quickly spot the dot that does not belong. That the processing is preattentive implies you do not need to actively search for the difference to find it.

[insert Figure 3.41 One of these things just doesn’t belong here]

Size, orientation, color, density, and many other variables are preattentively processed. The issue comes when you have to combine multiple graphic variables and, in most visualizations, that is precisely the task at hand. When combining graphic variables (e.g. shape and color), what was initially preattentively processed often loses its ease of discovery. Research into preattentive processing can then be used to show which combinations are still useful for quick information gathering. One such combination is spatial distance and color. In figure 3.42, you can quickly determine both the two spatially distinct groups, and the spatially distinct colours.

[insert Figure 3.42 spatially distinct groups, spatially distinct colours]

Another important limitation of human perception to keep in mind is change blindness. When people are presented two pictures of the same scene, one after the other, and the second picture is missing some object that was in the first, it is surprisingly difficult to discern what has changed between the two images. The same holds true for animated / dynamic visualizations. We have difficulty holding in our minds the information from previous frames, and so while an animation seems a noble way of visualizing temporal change, it is rarely an effective one. Replacing an animation with small multiples, or some other static visualization, will improve the reader’s ability to notice specific changes over time.

Making an Effective Visualization

If choosing the data to go into a visualization is the first step, picking a general form the second, and selecting appropriate visual encoding the third, the final step for putting together an effective information visualization is in following proper aesthetic design principles. This step will help your visualization be both effective and memorable. We draw inspiration for this section from Edward Tufte’s many books on the subject, and Angela Zoss’s excellent online guide to Information Visualization.30

One seemingly obvious principle that is often not followed is to make sure your visualization is high resolution. The smallest details and words on the visualization should be crisp, clear, and completely legible. In practice, this means saving your graphics in large resolutions or creating your visualizations as scalable vector graphics. Keep in mind that most projectors in classrooms still do not have as high a resolution as a piece of printed paper, so creating a printout for students or attendees of a lecture may be more effective than projecting your visualization on a screen.

Another important element of visualizations often left out are legends which describe each graphic variable in detail, and explains how those graphic variables relate to the underlying data. Most visualization software do not automatically create legends, and so they become a neglected afterthought. A good legend means the difference between a pretty but undecipherable picture, and an informative scholarly visualization. Adobe Photoshop and Illustrator, as well as the free Inkscape and Gimp, are all good tools for creating legends.

A good rule of thumb when designing visualizations is to reduce your data:ink ratio as much as possible. Maximize data, minimize ink. Extraneous lines, bounding boxes, and other design elements can distract from the data being presented. Figure 3.43 shows a comparison between two identical charts, except for the amount of extraneous ink.

[insert Figure 3.43 The same chart, with and without extraneous lines]

A related rule is to avoid chartjunk at all costs. Chartjunk are those artistic flourishes that newspapers and magazines stick in their data visualizations to make them more eye-catching: a man blowing over in a heavy storm next to a visualization of today’s windy weather, or house crumbling down to represent the collapsing housing market. Chartjunk may catch the eye, but it is ultimately distracting from the data being presented, and readers will take more time to digest the information being presented to them.

Stylized graphical effects can be just as distracting as chartjunk. “Blown out” pie charts where the pie slices are far apart from one another, 3D bar charts, and other stylistic quirks that Excel provide are poor window decoration and can actually decrease your audience’s ability to read your visualization. In a 3D tilted pie chart, for example, it can be quite difficult to visually estimate the area of each pie slice. The tilt makes the pie slices in the back seem smaller than those in the front, and the 3D aspect confuses readers about whether they should be estimating area or volume.

While not relevant for every visualization, it is important to remember to label your axes and to make sure each axis is scaled appropriately. Particularly, the vertical axis of bar charts should begin at zero. Figure 3.44 is a perfect example of how to lie with data visualization by starting the axis far too high, making it seem as though a small difference in the data is actually a large difference.

[insert Figure 3.44 A chart that lies.]

There is an art to perfecting a visualization. No formula will tell you what to do in every situation, but by following these steps (1. Pick your data, 2. Pick your visualization type, 3. Pick your graphic variables, 4. Follow basic design principles), the visualizations you create will be effective and informative. Combining this with the lessons elsewhere in this book regarding text, network, and other analyses should form the groundwork for producing effective digital history projects.

Conclusion

This chapter has covered a lot of ground: we have moved from the relatively simple province of word clouds, to the more sophisticated world of Overview, regular expressions, and hinted at even more advanced techniques that lurk in the shadows. They all share the same goal, however: how to take a lot of information and explore it in ways that a person could not. How can we use our computers – these powerful machines sitting on our desks or laps – for more than just word processing, but tap their computational potential? As we will note in the next chapter, however, a potential pitfall of this is that we – in most of these cases – still needed to know what we were looking for. Data do not ‘speak’ for themselves: they require interpretation, they require visualization.

Scholars, often learn by reading and sifting: looking through archival boxes, reading literature, not with an eye on a particular research outcome but with a goal to holistically understand the field, approached from a particular perspective (theory). We can do the same with Big Data repositories, trying to get a macroscopic view of the field, through methods including topic modeling and network analysis. In the next chapters, we build on our more targeted investigations here with a full-scale implementation of our historian’s macroscope.


  1. Adam Crymble, “Can We Reconstruct a Text from a Wordcloud?” 5 August 2013, Thoughts on Digital and Public Historyhttp://adamcrymble.blogspot.ca/2013/08/can-we-reconstruct-text-from-wordcloud.html.

  2. http://www.antlab.sci.waseda.ac.jp/software.html.

  3. Sinclair, Stéfan and Geoffrey Rockwell. Hermenutic.ca – The Rhetoric of Text Analysis http://hermeneuti.ca

  4. There are more tools available in Voyant, by clicking on the ‘save’ icon in the top-right side of the page in the blue ‘Voyant Tools: Reveal Your Texts’ title bar. This icon opens a pop-up with five different export options. The first, ‘a URL for this tool and current data’ will provide you with a direct URL to your corpus which you may then share with others or return to at a later time; the final option, ‘a URL for a different tool/skin and current data’ will open another menu allowing you to select which tool you’d like to use. If you selected ‘RezoViz’ (a tool for constructing a network with organizations, individuals, and place names extracted from your texts), you would end up with a URL like this:

    http://voyant-tools.org/tool/RezoViz/?corpus=1394819798940.8347

    The string of numbers is the corpus ID for your texts. If you know the name of another tool, you can type it in after /tool/ and before /?corpus.

  5. Jonathan Stray has written an excellent piece on using Overview as part of a ‘data journalism’ workflow, many points of which are appropriate to the historian. See Stray, J. 2014 ‘You got the documents. Now what?’ Source.OpenNews.org https://source.opennews.org/en-US/learning/you-got-documents-now-what/

  6. The documentation for Overview may be found at http://overview.ap.org/ ; the software itself can be downloaded at https://github.com/overview/overview-server/wiki/Installing-and-Running-Overview

  7. http://overview.ap.org/blog/2013/04/how-overview-can-organize-thousands-of-documents-for-a-reporter/

  8. As in this example http://overview.ap.org/blog/2013/07/comparing-text-to-data-by-importing-tags/

  9. Regex expressions are instantiated sometimes differently depending on which program you are working with. They simply do not work in Microsoft Word. For best results, try TextWrangler (on Mac) or Notepad++ (in Windows).

  10. Notepad++ (for Windows) can be downloaded at http://notepad-plus-plus.org/ . Textwrangler (for Mac) can be found at http://www.barebones.com/products/textwrangler/

  11. Available at http://notepad-plus-plus.org/ if you haven’t already installed it. On mac, try Textwrangler http://www.barebones.com/products/textwrangler/

  12. These are markers for ‘word boundaries’. See http://www.regular-expressions.info/wordboundaries.html

  13. described in more detail here https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth.

  14. It’s worth pointing out that once you have cleaned data in CSV or TSV format, your data can be imported into a variety of other tools, or be ready for other kinds of analysis. Many online visualization tools like Raw (http://app.raw.densitydesign.org/ ) and Palladio (http://palladio.designhumanities.org/ ) accept and expect data in this format.

  15. http://nlp.stanford.edu/software/CRF-NER.shtml . You can develop classifiers for your own particular domains using the Stanford NER; more information in this regard is provided here: http://nlp.stanford.edu/software/crf-faq.shtml

  16. Michelle Moravec provides a good tutorial on NER on her blog at ‘How to use Stanford’s NER and Extract Results’ History in the City June 28 2014 http://historyinthecity.blogspot.ca/2014/06/how-to-use-stanfords-ner-and-extract.html

  17. See for instance ‘dyanmic networks in gephi’ at our draft site for this book, http://www.themacroscope.org/?page_id=525

  18. While we do not go into detail about json format, it is increasingly more common as the data format for web-based visualizations using the d3.js code library. A very good reference for working in this library and with json files is Elijah Meeks, D3.js in Action Manning, 2014 http://www.manning.com/meeks/

  19. Stray, Jonathan. ‘You Got the Documents. Now What? - Learning - Source: An OpenNews Project’. Source, 14 March 2014. https://source.opennews.org/en-US/learning/you-got-documents-now-what/.

  20. http://tabula.nerdpower.org/ Since it is open source, you can fork it on github to obtain and maintain your own copy, in the event that the original ‘Tabula’ website goes offline. Indeed, this is a habit you should get into.

  21. Gill, David W. J., and Christopher Chippindale. ‘Material and Intellectual Consequences of Esteem for Cycladic Figures’. American Journal of Archaeology 97, no. 4 (October 1993): 601. doi:10.2307/506716.

  22. Another open source project, ‘Raw’ lets you paste your data into a box on a webpage, and then render that data using a variety of different kinds of visualizations. http://app.raw.densitydesign.org/ Raw does not send any data over the internet; it performs all calculations and visualizations within your browser, so your data stays secure. It is possible (but not easy) to install Raw locally on your own machine, if you wish. Follow the links on the Raw website to its github code repository.

  23. Lancaster, Lewis. “From Text to Image to Analysis: Visualization of Chinese Buddhist Canon” Abstract for Digital Humanities DH 2010 King’s College London 7th – 10th July 2010 http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/html/ab-670.html

  24. Ben Schmidt “What years do historians write about?” Sapping Attention May 9, 2013 http://sappingattention.blogspot.ca/2013/05/what-years-do-historians-write-about.html

  25. M.E.J. Newman, ‘Maps of the 2008 US Presidential Election Results’ 2008 http://www-personal.umich.edu/~mejn/election/2008/ One may also download Newman’s data to explore.

  26. http://pelagios.dme.ait.ac.at/maps/greco-roman/. This map is discussed in detail in Johan Åhlfeldt ‘A digital map of the Roman Empire’ 19 September 2012 http://pelagios-project.blogspot.ca/2012/09/a-digital-map-of-roman-empire.html

  27. http://lotrproject.com/map/ For a discussion about this project, see http://lotrproject.com/about/

  28. W. H. Smith., Graphic Statistics in Management (McGraw-Hill Book Company, New York, ed. First, 1924) ; reproduced on the Visual Complexity database at http://www.visualcomplexity.com/vc/project.cfm?id=10

  29. See Brian Abelson, ‘Finding Evidence of Climate Change in a Billion Rows of Data’, Source Open News April 22, 2014 https://source.opennews.org/en-US/articles/finding-evidence-climate-change-billion-rows-data/ for details.

  30. http://guides.library.duke.edu/content.php?pid=355157.