Text Mining: A Short Overview of Word Association Mining
Text mining seeks to extract meaningful knowledge from unstructured text “documents”. Many times text mining is done on data found on web pages such as twitter or Wikipedia. There are extraordinary opportunities for knowledge extraction from text. Some use cases involves automatically organizing and indexing a body of text according to its content. One can also extract sentiment to analyze the consensus of a group of authors on a particular topic. In fact, topics can also be detected and inferred based on some probabilistic linguistic analysis. This is just the 1,000 ft view of the general topic of text mining.
My main purpose for writing this article, is to begin to broach the tip of the ice berg that is text analysis. So, let’s begin with the concept of mining the association between words in a document. Word association mining often precedes many text mining activities such as opinion mining and topic modeling. The goal of word association mining is to find words that are somehow related either by syntax (syntagmatic relation) or paradigm (paradigmatic relation).
Paradigmatic related words share a paradigm. One should be able to switch these words between sentences without losing too much meaning. For example, “ketchup” and “mustard” have a paradigmatic relation. These words tend to share similar context. Another example of paradigmatic related words are “pen” and “pencil”. Mining words such as these aid in detecting documents that may be comparing and contrasting two different entities that serve a similar purpose. Mining this type of relationship is typically done by computing the similarity between words that appear to the right and the left of the two words we are assessing. We can treat the left and right words as a “bad of word” and compute the cosign similarity, Jaccard index or the EOWC (expected overlap of word in context) method. Based on some threshold on the result of our computation, we can decide whether these two words have paradigmatic relation. I will not go into the details of the computation at this point in time. This will be the topic of a future post.
Word that share a syntagmatic relation can be used together in the same sentence. “Actor” and “movie” would have a syntagmatic relation because an “actor may be featured in a movie”. Syntagmatic relations are useful when trying to find positive or negative words that are used in the same sentences as a noun of interest. For instance, the fact that the noun “LinkedIn” was used in conjunction with the word “useful” and “enlightening” might be of interest if you wanted to see, in general, what was being said about LinkdIn.com on a product review site. Furthermore, syntagmatic analysis can applied to topic modeling. Given a particular topic, one can create a profile of words that are “connected” in such a way that is unique to that topic.
Syntagmatic relationships are typically found by calculating the conditional entropy or mutual information between two terms. Entropy is the measure of the amount of purity or randomness in a data set. The higher the entropy, the higher the impurity and randomness. Conditional entropy finds the randomness of the occurrence of word#1 given the occurrence of word#2. So if these two words have a low conditional entropy, then they are more likely to have a syntagmatic relationship. As a result, it is safe to say that the occurrence of word#2 makes the occurrence of word#1 more predictable. Mutual information is a measure of the reduction of randomness given the presence of word #2. So we want to find words that have low conditional entropy and high mutual information.
I hope to delve into more detail about how to compute paradigmatic and syntagmatic relationships using EOWC and conditional entropy in my subsequent posts. I will also be posting a mini experiment I conducted with articles about the July 4th’s beach scandal in New Jersey.