Confusion matrices are simple in principle: four numbers that describe the performance of a binary classifier. Yet a full understanding of the behavior and meaning of a confusion matrix is far more subtle than it would appear based on the existence of just four numbers. This is a confusion matrix. Looks simple right? Accuracy The…

## Latent semantic indexing: practical example of document categorization with topic modeling

Let's talk about latent semantic indexing. We all know machine learning is magical. One of the most magical things I've seen machine learning do is document categorization. What is latent semantic indexing? Let's say you have 10,000 text documents and you'd like to know what categories or topics exist. Machine learning can define the categories…

## Explainable machine learning: the next frontier

It's time for machine learning to grow up. Machine learning has proved valuable in many areas of technology and business. ML has a seat at the table, but for ML to truly mature, we need to know we can trust it. For ML to integrate more fully and contribute everything it can, people need to…

## The problems with the simulation argument

We're all living in a simulation. Or so the argument goes. Lots of people, not least Elon Musk, make this argument or something like it: Technology increases over time. Eventually, the technology to create a simulation as complex as our world will exist. Therefore life-like simulations are inevitable and more numerous than base realities. Ergo,…

## The limits of data

Beware someone who says data can solve any problem. They're naive, malicious, or won't be around to deal with the aftermath. Of course, data can accomplish a lot. As a data scientist, my job depends on it. But if we're being honest, the limitations are becoming more and more obvious. As a community, data scientists have…

## 5 Pitfalls & Solutions for Today’s Data Leaders

The data revolution is in full swing: data science practitioners are prospering and creating huge value for their companies. Despite this success, data science leaders across the industry are facing stress and difficult conditions. Data leaders must avoid these pitfalls to succeed and generate value for their organizations. Limited Experience Pitfall #1: Because the field…

## How I got into data science

In the third year of my PhD, I had a decision to make. Research, especially the academic kind, just wasn't working for me. I was getting results and publications, but knew that it could never be a lifelong career for me. Things moved slowly. I disliked the territorialism and being a master of abstractions, outside…

## Analyzing resumes with data science

Ever wonder if there is a "secret code" for resumes, some key words that will actually make you stand out? It turns out there are indeed some very characteristic differences between experienced and novice resumes. Gathering Resume Data Resumes from an indeed.com resume search were used to analyze the differences between experienced and inexperienced resumes. For maximum…

## What I’m reading/watching/taking

Books, videos and courses I've done, am doing or want to do, for the curious and for my own reference. Read/watched/took: Machine Learning with Andrew Ng Biostatistics Boot Camp 1&2 with Brian Caffo - Link to 2 Data Science For Business CS231N with Andrej Karpathy Statistical Rethinking Mathematical Statistics and Data Analysis Introduction to Statistical Learning…

## Linear Regression vs. Decision Trees: Handling Outliers

In regression tasks, it's often assumed that decision trees are more robust to outliers than linear regression. See this Quora question for a typical example. I believe this is also mentioned in the book "Introduction to Statistical Learning", which may be the source of the notion. Predictions from a decision tree are based on the…