Probabilistic Model Selection with AIC, BIC

3 min readJun 7, 2023

Akaike Information Criterion (AIC) derived from frequentist probability vs Bayesian Information Criterion (BIC)

The Cathedral of Saint George in Sicily, Italy

Hi everyone, i’m back.

In some situations, where we are confused about which models will be the best, there are tons of model comparison techniques available to compare the performance of the past and the new model based on model weights via model retraining.

Today we will look into a technique where we can compare the model performance and complexity thoroughly using a probabilistic approach.

Models are usually compared based on 2 main factors:

Model Performance
Model Complexity

Model performance can be measured using techniques like log-likelihood under the framework of maximum likelihood estimation and Model complexity can be measured as the number of degrees of freedom or parameters in the model.

One of the benefits of the probabilistic model selection is that we don't need the test dataset, we can directly use the model inferences to score directly However, one limitation that I can think of is that the same general statistic cannot be calculated across a range of different types of models.

There are 3 approaches:

Akaike Information Criterion (AIC). Derived from frequentist probability.
Bayesian Information Criterion (BIC). Derived from Bayesian probability.
Minimum Description Length (MDL). Derived from information theory.

1 .AIC: = -2/N * LL + 2 * k/N

Where N is the number of examples in the training dataset, LL is the log-likelihood of the model on the training dataset, and k is the number of parameters in the model.

Python version: aic = n * log(mse) + 2 * num_params

where the number of parameters is num_params = len(model.coef_) + 1 model.coef_ is the model coefficients (that is the number of parameters/ attributes used in the model)

And the model with the lowest AIC is selected.

AIC compared to BIC approach gives more emphasis on model performance on the training dataset and penalizes less on complex models

So this may result in overfitting. Thus BIC solves this problem by introducing a penalty term for the number of parameters in the model.

2. BIC: = -2 * LL + log(N) * k

Where log() has the base-e called the natural logarithm, LL is the log-likelihood of the model, N is the number of examples in the training dataset, and k is the number of parameters in the model. The model with the lowest BIC is selected.

Python version: bic = n * log(mse) + num_params * log(n)

BIC penalizes more on the complexity of the model.

The limitation of the BIC is for a smaller, less representative training dataset, it is more likely to end up choosing too simple models.

A lower BIC value indicates lower penalty terms hence a better model.

I hope this helps, thanks again, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy

You might also enjoy MultiLabel Multi Class Algorithms

Kaggle Implementation:

https://www.kaggle.com/rupakroy/multilabel-multi-class-algorithms-ii

Git Repo:

OutputCodeClassiifer: https://github.com/rupak-roy/MultiLabel-OutputCodeClassifier

MultiLabel-MultiClass-Power-Transformation-Approach: https://github.com/rupak-roy/MultiLabel-MultiClass-Power-Transformation-Approach

Some of my alternative internet presences Facebook, Instagram, Udemy, Blogger, Issuu, Slideshare, Scribd and more.

Also available on Quora @ https://www.quora.com/profile/Rupak-Bob-Roy

Let me know if you need anything. Talk Soon.

Kaggle Implementation:

https://www.kaggle.com/rupakroy/multilabel-multi-class-algorithms-ii

Git Repo: