Let’s talk about latent semantic indexing.
We all know machine learning is magical. One of the most magical things I’ve seen machine learning do is document categorization.
What is latent semantic indexing?
Let’s say you have 10,000 text documents and you’d like to know what categories or topics exist. Machine learning can define the categories and label each of the documents in the corpus de novo, without any previous knowledge of which categories might be there.
To do this, you don’t need any fancy neural nets. In fact, all you need is linear algebra. The method I’m writing about here sometimes goes by the name “latent semantic indexing” (LSI), and it has been written about in the scholarly literature.
In this post, I’m taking a more practical approach.
I usually do some basic preprocessing like converting to lower case and removing punctuation (Python):
lower_doc = document.lower()
no_punc_doc = ' '.join( re.findall( '[a-z]+', lower_doc ) )
Once you have a corpus of processed documents, it’s ready to be vectorized. There are different ways to vectorize, but one of the simplest and most effective is TF-IDF (term frequency-inverse document frequency).
TF-IDF ignores word order and weights uncommon terms (term occurs in few documents in the corpus) and frequent terms (term occurs more than once in a given document) more highly. Here’s a sample implementation using scikit-learn in Python:
tfidf_vectorizer = TfidfVectorizer(
tfidf = tfidf_vectorizer.fit_transform( corpus )
After TF-IDF transformation, the corpus is represented as a matrix of numbers. Each row is a document and each column is a feature/term. A zero means that word doesn’t appear in the document. A non-zero number means the word appears at least once. The higher the number, the more frequently the term appears in the document, and the more uncommon the term is among all documents in the corpus.
A corpus represented as a TF-IDF matrix is ‘sparse’, which means that many of the elements are zero. Most words in the vocabulary don’t occur in a given document.
By default, each row (document) is L2-normalized, meaning that the euclidean ‘length’ of each row equals 1. In my experience, L2 normalization tends to work well for document categorization. As a bonus, the dot product of L2 normalized vectors is the same as their cosine similarity.
The next step is to look for internal structure within the corpus. In mathematical terms, we’ll use a singular value decomposition (SVD) to look for common axes of variation.
As with any SVD, we get to choose how many of the top axes we want to look at. In practical terms, this means we get to choose how many categories we want to break our corpus into.
So how does it work?
In practical terms, we look for which words most frequently occur in the same document together.
In mathematical terms, this means looking for vectors that approximately represent the whole corpus. Specifically, we specify k number of categories in a corpus of n documents with m features. Non-negative matrix factorization (NMF) will find k m-dimensional ‘category vectors’ that represent archetypal documents in the corpus. It also finds n k-dimensional ‘compressed’ documents, which represent the documents as a linear combination of categories.
C ~ rep * cat
dim(C) = [n,m]; dim(rep) = [n,k]; dim(cat) = [k,m];
n = number of documents
m = number of terms in vocabulary (number of features)
k = number of topics
from sklearn.decomposition import NMF
nmf = NMF() ## there are some parameters like alpha and l1_frac to play around with here
compressed_docs = nmf.fit_transform(tfidf, n_components=10)
topics = nmf.components_
## tfidf ~ compressed_docs * topics
## tfidf.shape == [n,m]
## compressed_docs.shape == [n,k]
## topics.shape == [k,m]
Non-negative matrix factorization (NMF) is a type of singular value decomposition where input and output values are restricted to be non-negative. This happens to work well for document categorization. As with other types of SVD, one metric to look at is the fraction of variance explained. As the number of topics increases, we can look for an ‘elbow’ where the fraction of variance explained increases more slowly. This may suggest a ‘natural’ number of topics.
As I said, NMF uses word co-occurrence to create distinct groups of words, which then become the labels that correspond to categories.
In an extreme example, imagine you have a corpus of French and English documents. Let’s say you use NMF to find the two most distinctive categories. The two category vectors would then look something like the average English document and the average French document. The representation matrices would then look approximately like a [1,0] or a [0,1] vector, depending on which language it was.
Now that we have the topic vectors, we have to find out whats in them. We will look at the most characteristic words in each topic. This will give us a good general idea about the semantic meaning of the topic.
In my experience, the topics tend to be fairly distinct and contained. It’s not unusual to have one or more ‘miscellaneous’ topics, which aren’t well-demarcated. You can either recategorize the whole corpus with more topics, or categorize the miscellaneous topic alone to break it up into distinct chunks.
import numpy as np
words = tfidf_vectorizer.get_feature_names()
for topic_number in topics.shape:
topic = topics[topic_number,:]
word_indices = np.argsort(topic)[::-1][:20]
print( ', '.join( [ words[ind] for ind in word_indices ] ) )
We also have to label all of our documents. In compressed form, each document is represented as a linear combination of topics. The simplest way to assign a document to a topic is to look at which coefficient of the compressed document is greatest.
doc_labels = [np.argmax( compressed_docs[i,:] ) for i in range(n) )]
Limitations and caution
This method works well, is easy and fast to implement. On a corpus of 10,000 documents of 500 words each, it should run on a Macbook Pro in under a minute.
Still, as with other machine learning algorithms, it’s not perfect. While there is no strict definition, some documents will clearly be miscategorized. Depending on expectations, even a small number of mistakes could be a deal-breaker, and you may have to resort to hand-categorization.
This algorithm is the kind of thing that data scientists like to geek out over. Elegant math, surprisingly good results, and nice implementations in scikit-learn. In a work situation, though, the business applicability of this kind of topic modeling is less clear. Using algorithms for tasks like churn modeling and product recommendation generate clear ROI. But this type of topic modeling can sometimes seem more useful than it is, just cause it’s so darn cool. So be careful how much time you spend on it at work.