Text Mining: Wikipedia Article Vs. Opinion Article Word Association Mining

Learning about word association mining is utterly useless without practical application. Therefore, I saw it fit to conduct an experiment that would afford me the opportunity to implement the technique discussed in my previous article, to an actual text document. I chose two very different pieces of text to analyze. The first document was an article from the opinion section of CNN regarding Chris Christie’s “beachgate” scandal. The other document was the definition and history of the title “Governor” found on Wikipedia.

 

Instead of trying to analyze the relationships between every single word in both documents, I decided to “fix” the value of word 1 in my entropy calculations to “governor”. In so doing, I would be analyzing the differences in the words associated with “governor” in both the opinionated article about Christie and the expositional Wikipedia article about the title of Governor.

 

For the implementation of the mining algorithms, I utilized python. I wrote functions that accepted word 1, word 2, and the document text as input to calculate the entropy and mutual information measures. The text documents were also tokenized into individual sentences and then eventually into a “bag of words”. The tokenization was performed using the NLTK python package. The entropy of the word “governor” was calculated with respect to the likelihood of the occurrence of each word in the “bag of words” within each tokenized sentence.

 

I will not go into detail about or share the actual python code in this particular post. If anyone is interested in seeing the code that was written, feel free to leave a comment and/or send me a message. Perhaps that will be the subject of a subsequent post should there be any interest.

 

After parsing the text documents and calculating our association measures, the python script directed the top 30 results to a text file. The text file contained the following attributes:

 

  •  Word 1 (fixed at “governor”)
  • Word 2 (each word in the “bag of words”)
  • Conditional entropy value
  • Entropy of Word 1
  • Mutual Information (reduction of entropy given the presence or absence of Word 2)
  • Text (sentences where word 1 and word 2 co-occur)

 

The final stage of the experiment included the use of Tableau Public to construct an interactive visualization to portray the word relationships in each document (See below). The darker shaded circles indicate a lower conditional entropy in relation to the word governor and the larger circles represents larger values for mutual information. As you can see “governor” always has the largest and the darkest circle. The conditional entropy of “governor” to the word “governor” is 0 (the lowest possible value) and the mutual information is the highest because it is equal to value of entropy of the word “governor” (H(X) – H (X|X) = H(X)) (H(X|X)=0)

 

The interactive versions of the widgets below can be found HERE

 

Opinionated CNN Article about Chris Christie:

Governor (Wikipedia)

For those who were able to view the interactive version of the visualization in Tableau Public, you may have noticed that the circles for “east”, “India” and “company” are fairly large and dark. However, when hovered over, they show no text where “governor” and neither of the three words co-occurred. This is a good illustration of the fact that words do not need to co-occur in order to show syntagmatic relationship. The result tells us that the absence of these three words increases the likelihood of the occurrence of the word “governor”. Inverse relationships such as these can be filtered out if need be. However, to test this phenomena of inverse word relationships, I added another sentence to the sample of sentences I took from the Wikipedia article. I made sure that this sentence contained both the word “governor” and “india”. The new resulting visualization can be seen below. It appears that the word “india” no longer has the same impact on the likelihood of the occurrence of the word, “governor”. Now when the word “inda” appears, the probability of the word “governor” appearing is closer to 50% which is where entropy or randomness reaches it maximum.  Therefore, the relationship strength between these two words are weakened.

In conclusion, it is important to note that this type of analysis, may serve as a gateway into some deeper investigation. Perhaps these related words may be used for document generation to summarize a topic within the realm of topic modeling. Maybe one may want to analyze the polarity of each associated word to discover hidden sentiment about a noun (i.e governor). In other words, this stuff is only the beginning.

ryan harrison
rh64@njit.edu