The Migrant Isotopist: Text-Mining and Topic-Mining

Hello all! Welcome back!

This week I am talking about distance reading and text mining.

Distance reading is looking broadly across articles or other documents and being able to pull out patterns. This approach would be extremely useful when you are beginning to start research and need to see the frequency of how terms are used or are trying to determine which synonyms are worthy of investigation. This can approach can also help navigate hidden tones and give the perceptions of the bias of the time. One example of this is looking back at reports during the US Civil War.

I thought Ayers (2011) New York Times article was really fascinating. As far as US history goes, the Civil War era is my favorite to read and learn about. He stated that using computer aided technology helps to gain a better understanding of the region from large amounts of sources. Utilizing these methods can elucidate alternative conclusions. In the Ayers (2011) article, he was able to identify a different “primary cause” of the Civil War. It is interesting that these computer-aided tools can help to uncover patterns that are otherwise difficult to see.

Another interesting way to dig through a lot of information is by text mining. Text mining looks at the frequency of words or topics in a certain period of time. Ewing et al., (2014) provide an excellent example in their article on the flu epidemic. They used two text-mining methods: topic modeling and tone classification. Through their work they were able to uncover how often different words were used in reports during different stretches of time within the local community and outside of it. They also looked at the tone of newspaper reports about the flu. Ewing et al., (2014) developed four classifications: alarmist, warning, reassuring, and explanatory. This was created to determine how the tone in reporting prompted public health intervention. Through this exercise, they were able to see the tone shift in reporting from the beginning of the epidemic to the end.

After reading these articles and beginning to understand their use, I wanted to see how I could employ these methods for my own research. I attempted to use three difference tools, Google Ngram, Voyant, and JStor Data for Research. While playing around with Google Ngram and Voyant was fun, I was unable to figure out how to utilize JStor Data for Research. This is more likely an issue on my end than with the tool itself.

The Google Ngram Viewer is a pretty cool tool. As a test, I used the phrases Beauty and the Beast, Cinderella, Snow White, and Rapunzel. Then I viewed the differences of the phrases between the corpus of American English, British English, French, and German. This was interesting that there would be such a difference (as seen below).

Then I wanted to see the results of terms I would use in my own research, so I included peasant, Christian, and religion. Again, I used English, French, and German. The reason for this is that information regarding my research is not likely to be found in English, but rather French or German, so, I wanted to see if there were changes and if I could reveal anything about them (see below). I wonder how searching in the French or German language would change the results.

Voyant provides a really cool visual of the most often words used in a document. For this example, I used my Master’s thesis to create a word cloud with the top 55 words (see below). While these tools are undoubtably useful (and pretty interesting), I am not sure how useful they would be to my own research, but it is definitely worth investigating!