Wednesday, September 25, 2013

Text Analysis

This week we will be talking about text analysis in class, also known as text mining.  This is simply the process of deriving high quality information from text.  But how does this benefit us in our field of history, and how do you even start this process?  Ted Underwood gives a simple account of what text analysis can do and how to get started in his article Where To Start With Text Mining. 

To begin to gather data, you will first need to have a good selection of quality readable information.  As Underwood mentions in his article, a great deal of this information can be found on JSTOR.  JSTOR is short for Journal Storage, founded in 1995, this is a massive storage of academic journals, primary sources, and now books are beginning to find their way into the collection.  Around 80% of this is usable data that does not need to be translated, however, there are some readings that you cannot pull text directly from due to its limited transferability due to poor scan quality of the actual text.  There are several programs being developed that can correct this problem, which brings the amount of transferability up to 98% in some cases.

The result of the text mining and the quantitative methods that it requires can benefit us in several ways, the result can be a number of different ways to categorize information, contrast the vocabulary of different texts, trace the history of different words or phrases over time, cluster features associated in different documents, entity extraction, and visualization of data.  This information can help historians understand what a word or phrase has meant in the past and how that information has changed.  In short, a word used today could've meant something totally different 80 years ago.  This information helps us research-wise to insure that we are fully understanding the past when we read it in historical accounts.

After Google launched its web search API in 2002, Roy Rosenweig took it upon himself to develop a similar search tool that could be used on documents in a "document classification," or a large number of texts and syllabi.  This ability to search a large number of documents pertaining to a subject would allow us as researchers a quick and easy tool to search for a phrase used in these documents.  Clemson's library and many of its databases utilize this technology and has made it quite simple to search amongst the hundreds of thousands of documents stored within the databases.

Utilizing the information extracted from these documents that are researched, historians can then develop visualization.  This is particularly important for the younger generations, as people today have a significantly less attention span when it comes to reading data as generations past.  With the development of technology, there is no need to have people read through large amounts of information when a visual aid can provide the same information in a single picture.  This also helps historians convey history in a simple and concise manner that will make a larger impact on the reader.  Finally, I feel like history is being brought to life and making it much more understandable to people outside of the history profession.  This in turn is making much more interested in history, instead of dreading the class because all they see is dates, people, and places.

**UPDATE ON DIGITAL HISTORY PROJECT**
The groundwork has been laid to get the Fort Hill project in motion.  Information is being developed so that this particular site can be viewed digitally and can provide research assistance to individuals that cannot make the trek to Clemson's campus.  This in turn can provide Clemson with the ability to advertise its history on the internet, bringing attention to some of the things Clemson has to offer.

1 comment: