A problem of predictive models is overfitting. This happens when the model is too responsive and picks up trends that are due to quirks in a specific sample, and not reflective of general, underlying trends in the data process. The result is a model that doesn’t predict very well. A model can be made less responsive by regularization–i.e. penalizing the model for complexity–and/or by reducing the number of parameters.
Of course, a completely unresponsive model is not correct either, so a balance must be struck. But how much regularization is the right amount? How many parameters is the correct amount? There are (at least) two main ways to test: cross validation & information criteria.
Cross validation is the practice of leaving out some of the data from the training set (20-40% of the data) and using it to select between different models. The notion here is that the leave-out data is ‘fresh’ to the model, and is thus representative of new data that the model would face in production. The various models being selected between can all be tested against the leave-out data, and the one that scores the best is selected.
Information criteria are a slightly different approach. In general, they rely on calculating the joint likelihood of observing the data, under the model, and taking the negative log, a quantity known as ‘deviance’. The information criteria is some function of the deviance. The Akaike Information Criterion (AIC) is two times the deviance plus two times the number of parameters. The information criteria, just like cross validation, can be used for model selection.
Calculating deviance requires a model that includes error, not just a point estimate. (Otherwise: the likelihood of any data point is just zero). In some methods of model generation (i.e. normal equation for linear regression), the error isn’t explicitly calculated. Thus, information criteria can’t be used for these ‘off-the-shelf’ types of models.
When they are used
Information criteria are often used in the context of Bayesian modeling, when the model explicitly includes error, and determines uncertainty in all parameters. The information criteria are somewhat abstract but seem more soundly based, theoretically speaking.
In contrast, cross validation can be used even when error & uncertainty is not modeled. Additionally, cross validation is highly applied and the principle makes sense and appeals even to machine learning novices.
Overall, both methods are highly useful and informative. Information criteria may be more sound, theoretically speaking, and may appeal to academic types for this reason. In general, however, more people are likely to be familiar with cross-validation, and it’s probably easier to explain and sell this technique to a non-technical audience.