SQL: Combining tables without JOIN

SQL is a versatile language–there are many ways to get the job done.

One way to compare information from multiple tables is the JOIN command, but there are sometimes alternatives.

Let’s say I’ve got two tables, office and person:

officeperson

Let’s say I need to know which people have printers. I could do a JOIN ON office_room = office_num. On the other hand, if you just select FROM multiple tables at the same time, you can filter with WHERE.

SELECT name, has_printer FROM office, person WHERE office_room=office_num;

has-printer

It turns out that SELECT FROM tbl1,tbl2 is actually the same as a CROSS JOIN, also known as the cartesian product of two tables. This joins every row of the first table with every row of the second table. Use WHERE to filter out only the correct rows you want.

Now let’s say I want to know the number of chairs left in each office. One way would be to JOIN, something like:

SELECT office_num,
num_chairs – count(name)
AS chairs_left
FROM person
JOIN office
ON office_room = office_num
GROUP BY office_room;

But another option is a correlated subquery, where the subquery refers to a column in the outer query, and doesn’t use JOIN at all:

SELECT office_num,
num_chairs –
(SELECT count(name)
FROM person
WHERE office_room = office.office_num)
AS chairs_left
FROM office

chairs_left

Correlated subqueries definitely have a performance hit, since each row is evaluated separately, but they can be quite handy.

Note: These recommendations aren’t best practices, but handy syntax to be aware of and are sometimes convenient.

Advertisements

Information Criteria & Cross Validation

A problem of predictive models is overfitting. This happens when the model is too responsive and picks up trends that are due to quirks in a specific sample, and not reflective of general, underlying trends in the data process. The result is a model that doesn’t predict very well. A model can be made less responsive by regularization–i.e. penalizing the model for complexity–and/or by reducing the number of parameters.

Of course, a completely unresponsive model is not correct either, so a balance must be struck. But how much regularization is the right amount? How many parameters is the correct amount? There are (at least) two main ways to test: cross validation & information criteria.

Cross validation

Cross validation is the practice of leaving out some of the data from the training set (20-40% of the data) and using it to select between different models. The notion here is that the leave-out data is ‘fresh’ to the model, and is thus representative of new data that the model would face in production. The various models being selected between can all be tested against the leave-out data, and the one that scores the best is selected.

Information criteria

Information criteria are a slightly different approach. In general, they rely on calculating the joint likelihood of observing the data, under the model, and taking the negative log, a quantity known as ‘deviance’. The information criteria is some function of the deviance. The Akaike Information Criterion (AIC) is two times the deviance plus two times the number of parameters. The information criteria, just like cross validation, can be used for model selection.

Calculating deviance requires a model that includes error, not just a point estimate. (Otherwise: the likelihood of any data point is just zero). In some methods of model generation (i.e. normal equation for linear regression), the error isn’t explicitly calculated. Thus, information criteria can’t be used for these ‘off-the-shelf’ types of models.

When they are used

Information criteria are often used in the context of Bayesian modeling, when the model explicitly includes error, and determines uncertainty in all parameters. The information criteria are somewhat abstract but seem more soundly based, theoretically speaking.

In contrast, cross validation can be used even when error & uncertainty is not modeled. Additionally, cross validation is highly applied and the principle makes sense and appeals even to machine learning novices.

Overall, both methods are highly useful and informative. Information criteria may be more sound, theoretically speaking, and may appeal to academic types for this reason. In general, however, more people are likely to be familiar with cross-validation, and it’s probably easier to explain and sell this technique to a non-technical audience.