Is there any reason to prefer the AIC or BIC over the other?

181

167

The AIC and BIC are both methods of assessing model fit penalized for the number of estimated parameters. As I understand it, BIC penalizes models more for free parameters than does AIC. Beyond a preference based on the stringency of the criteria, are there any other reasons to prefer AIC over BIC or vice versa?

russellpierce

Posted 2010-07-23T20:49:12.340

Reputation: 9 334

I don't know if your question applies specifically to phylogeny (bioinformatics), but if so, this study can provide some thoughts on this aspect: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2925852/

– tlorin – 2018-01-03T09:09:30.307

1I think it is more appropriate to call this discussion as "feature" selection or "covariate" selection. To me, model selection is much broader involving specification of the distribution of errors, form of link function, and the form of covariates. When we talk about AIC/BIC, we are typically in the situation where all aspects of model building are fixed, except the selection of covariates. – None – 2012-08-13T21:17:47.313

5Deciding the specific covariates to include in a model does commonly go by the term model selection and there are a number of books with model selection in the title that are primarily deciding what model covariates/parameters to include in the model. – Michael Chernick – 2012-08-24T14:44:28.847

Answers

152

Your question implies that AIC and BIC try to answer the same question, which is not true. AIC tries to select the model that most adequately describes an unknown, high dimensional reality. This means that reality is never in the set of candidate models that are being considered. On the contrary, BIC tries to find the TRUE model among the set of candidates. I find it quite odd the assumption that reality is instantiated in one of the model that the researchers built along the way. This is a real issue for BIC.

Nevertheless, there are a lot of researchers who say BIC is better than AIC, using model recovery simulations as an argument. These simulations consist of generating data from models A and B, and then fitting both datasets with the two models. Overfitting occurs when the wrong model fits the data better than the generating. The point of these simulations is to see how well AIC and BIC correct these overfits. Usually, the results point to the fact that AIC is too liberal and still frequently prefers a more complex, wrong model over a simpler, true model. At first glance these simulations seem to be really good arguments, but the problem with them is that they are meaningless for AIC. As I said before, AIC does not consider that any of the candidate models being tested is actually true. According to AIC, all models are approximations to reality, and reality should never have a low dimensionality. At least lower than some of the candidate models.

My recommendation: use both AIC and BIC. Most of the times they will agree on the preferred model, when they don't, just report it.

If you are unhappy with both AIC and BIC, and you have free time to invest, look up for Minimum Description Length (MDL), a totally different approach that overcomes the limitations of AIC and BIC. There are several measures stemming from MDL, like normalized maximum likelihood or the Fisher Information approximation. The problem with MDL is that its mathematically demanding and/or computationally intensive.

Still, if you wanna stick to simple solutions, a nice way for assessing model flexibility (especially when the number of parameters are equal, rendering AIC and BIC useless) is doing Parametric Bootstrap, which is quite easy to implement. here is a link to a paper on it: link text

Some people here advocate the use of cross-validation. I personally have used it, and don't have anything against it, but the issue with it is that the choice among the sample-cutting rule (leave-one-out, K-fold, etc) is an unprincipled one.

Dave Kellen

Posted 2010-07-23T20:49:12.340

Reputation: 2 258

6Difference can be viewed purely from mathematical standpoint -- BIC was derived as an asymptotic expansion of log P(data) where true model parameters are sampled according to arbitrary nowhere vanishing prior, AIC was similarly derived with true parameters held fixed – Yaroslav Bulatov – 2011-01-24T05:57:44.100

3You said that " there are a lot of researchers who say BIC is better than AIC, using model recovery simulations as an argument. These simulations consist of generating data from models A and B, and then fitting both datasets with the two models." Would you be so kind as to point some references. I'm curious about them! :) – deps_stats – 2011-05-03T16:21:43.600

These slides http://myweb.uiowa.edu/cavaaugh/ms_lec_2_ho.pdf say that AIC assumes that the generating model is among the set of candidate models.

– João Abrantes – 2015-10-22T11:09:15.607

discussion on comment by @gui11aume: http://stats.stackexchange.com/questions/205222/does-bic-try-to-find-a-true-model

– Erosennin – 2016-04-03T16:39:49.317

1I do not believe the statements in this post. – user9352 – 2012-05-02T14:06:57.923

1I don't completely agree with Dave especially regarding the objectives being different. I think both methods look to find a good and in some sense optimal set of variables for a model. We really in practice never assume that we can construct a "perfect" model. I think that in a purely probabilistic sense that if we assume that there is a "correct" model then BIC will be consistent and AIC will not. By this the mathematical statisticians mean that as the sample size grows to infinity BIC will find it with probability tending to 1. – Michael Chernick – 2012-05-04T17:21:57.897

I think that is why some people think that AIC does not provide a strong enough penalty. – Michael Chernick – 2012-05-04T17:22:04.433

14

(-1) Great explanation, but I would like to challenge an assertion. @Dave Kellen Could you please give a reference to where the idea that the TRUE model has to be in the set for BIC? I would like to investigate on this, since in this book the authors give a convincing proof that this is not the case.

– gui11aume – 2012-05-27T21:47:49.233

When you work through the proof of the AIC, for the penalty term to equal the number of linearly independent parameters, the true model must hold. Otherwise it is equal to $\text{Trace}(J^{-1} I)$ where $J$ is the variance of the score, and $I$ is the expectation of the hessian of the log-likelihood, with these expectations evaluated under the truth, but the log-likelihoods are from a mis-specified model. I am unsure why many sources comment that the AIC is independent of the truth. I had this impression, too, until I actually worked through the derivation. – Andrew M – 2017-10-06T22:05:35.547

66

Though AIC and BIC are both Maximum Likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. Lets look at one commonly presented version of the methods (which results form stipulating normally distributed errors and other well behaving assumptions):

  • AIC = -2*ln(likelihood) + 2*k,

and

  • BIC = -2*ln(likelihood) + ln(N)*k,

where:

  • k = model degrees of freedom
  • N = number of observations

The best model in the group compared is the one that minimizes these scores, in both cases. Clearly, AIC does not depend directly on sample size. Moreover, generally speaking, AIC presents the danger that it might overfit, whereas BIC presents the danger that it might underfit, simply in virtue of how they penalize free parameters (2*k in AIC; ln(N)*k in BIC). Diachronically, as data is introduced and the scores are recalculated, at relatively low N (7 and less) BIC is more tolerant of free parameters than AIC, but less tolerant at higher N (as the natural log of N overcomes 2).

Additionally, AIC is aimed at finding the best approximating model to the unknown data generating process (via minimizing expected estimated K-L divergence). As such, it fails to converge in probability to the true model (assuming one is present in the group evaluated), whereas BIC does converge as N tends to infinity.

So, as in many methodological questions, which is to be preferred depends upon what you are trying to do, what other methods are available, and whether or not any of the features outlined (convergence, relative tolerance for free parameters, minimizing expected K-L divergence), speak to your goals.

John L. Taylor

Posted 2010-07-23T20:49:12.340

Reputation: 2 461

5nice answer. a possible alternative take on AIC and BIC is that AIC says that "spurious effects" do not become easier to detect as the sample size increases (or that we don't care if spurious effects enter the model), BIC says that they do. Can see from OLS perspective as in Raftery's 1994 paper, effect becomes approximately "significant" (i.e. larger model preferred) in AIC if its t-statistic is greater than $|t|>\sqrt{2}$, BIC if its t-statistic is greater than $|t|>\sqrt{log(n)}$ – probabilityislogic – 2011-05-13T14:33:54.803

1Nice answer, +1. I especially like the caveat about whether the true model is actually present in the group evaluated. I would argue that "the true model" is never present. (Box & Draper said that "all models are false, but some are useful", and Burnham & Anderson call this "tapering effect sizes".) Which is why I am unimpressed by the BIC's convergence under unrealistic assumptions and more by the AIC's aiming at the best approximation among the models we actually look at. – Stephan Kolassa – 2012-11-24T20:00:49.677

58

My quick explanation is

  • AIC is best for prediction as it is asymptotically equivalent to cross-validation.
  • BIC is best for explanation as it is allows consistent estimation of the underlying data generating process.

Rob Hyndman

Posted 2010-07-23T20:49:12.340

Reputation: 31 725

AIC is equivalent to K-fold cross-validation, BIC is equivalent to leve-one-out cross-validation. Still, both theorems hold only in case of linear regression. – mbq – 2010-07-24T08:23:58.813

3mbq, it's AIC/LOO (not LKO or K-fold) and I don't think the proof in Stone 1977 relied on linear models. I don't know the details of the BIC result. – ars – 2010-07-24T11:01:35.353

7ars is correct. It's AIC=LOO and BIC=K-fold where K is a complicated function of the sample size. – Rob Hyndman – 2010-07-24T12:42:31.373

Congratulations, you've got me; I was in hurry writing that and so I made this error, obviously it's how Rob wrote it. Neverthelss it is from Shao 1995, where was an assumption that the model is linear. I'll analyse Stone, still I think you, ars, may be right since LOO in my field has equally bad reputation as various *ICs. – mbq – 2010-07-24T20:10:12.740

The description on Wikipedia (http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation) makes it seem like K-fold cross-validation is sort of like a repeated simulation to estimate the stability of the parameters. I can see why AIC would be expected to be stable with LOO (since LOO can wasily be conducted exhaustively), but I don't understand why the BIC would be stable with K-fold unless K is also exhaustive. Does the complex formula underlying the value for K make it exhaustive? Or is something else happening?

– russellpierce – 2010-07-25T03:18:18.817

BIC is also equivalent to cross validation, but a "learning" type cross validation. For BIC the CV procedure is to predict the first observation with no-data (prior information alone). Then "learn" from the first observation, and predict the second. Then learn from the first and second, and predict the third, and so on. This is true because of the representation $p(D_1\dots D_n|MI)=p(D1|MI)\prod{i=2}^{n}p(D_i|D1\dots D{i-1}MI)$ – probabilityislogic – 2012-04-04T07:31:51.093

15

In my experience, BIC results in serious underfitting and AIC typically performs well, when the goal is to maximize predictive discrimination.

Frank Harrell

Posted 2010-07-23T20:49:12.340

Reputation: 49 422

12

An informative and accessible "derivation" of AIC and BIC by Brian Ripley can be found here: http://www.stats.ox.ac.uk/~ripley/Nelder80.pdf

Ripley provides some remarks on the assumptions behind the mathematical results. Contrary to what some of the other answers indicate, Ripley emphasizes that AIC is based on assuming that the model is true. If the model is not true, a general computation will reveal that the "number of parameters" has to be replaced by a more complicated quantity. Some references are given in Ripleys slides. Note, however, that for linear regression (strictly speaking with a known variance) the, in general, more complicated quantity simplifies to be equal to the number of parameters.

NRH

Posted 2010-07-23T20:49:12.340

Reputation: 13 032

1When selecting covariance structures for longitudinal data (mixed effects models or generalized least squares) AIC can easily find the wrong structure if there are more than 3 candidate structures. If if there are more than 3 you will have to use the bootstrap or other means to adjust for model uncertainty caused by using AIC to select the structure. – Frank Harrell – 2016-04-04T12:40:09.670

3(+1) However, Ripley is wrong on the point where he says that the models must be nested. There is no such constraint on Akaike's original derivation, or, to be clearer, on the derivation using the AIC as an estimator of the Kullback-Leibler divergence. In fact, in a paper that I'm working on, I show somewhat "empirically" that the AIC can even be used for model selection of covariance structures (different number of parameters, clearly non-nested models). From the thousands of simulations of time-series that I ran with different covariance structures, in none of them the AIC gets it wrong... – Néstor – 2012-08-14T17:06:43.550

...if "the correct" model is in fact on the set of models (this, however, also implies that for the models I'm working on, the variance of the estimator is very small...but that's only a technical detail). – Néstor – 2012-08-14T17:07:15.633

1@Néstor, I agree. The point about the models being nested is strange. – NRH – 2012-08-16T06:43:46.717

8

Indeed the only difference is that BIC is AIC extended to take number of objects (samples) into account. I would say that while both are quite weak (in comparison to for instance cross-validation) it is better to use AIC, than more people will be familiar with the abbreviation -- indeed I have never seen a paper or a program where BIC would be used (still I admit that I'm biased to problems where such criteria simply don't work).

Edit: AIC and BIC are equivalent to cross-validation provided two important assumptions -- when they are defined, so when the model is a maximum likelihood one and when you are only interested in model performance on a training data. In case of collapsing some data into some kind of consensus they are perfectly ok.
In case of making a prediction machine for some real-world problem the first is false, since your training set represent only a scrap of information about the problem you are dealing with, so you just can't optimize your model; the second is false, because you expect that your model will handle the new data for which you can't even expect that the training set will be representative. And to this end CV was invented; to simulate the behavior of the model when confronted with an independent data. In case of model selection, CV gives you not only the quality approximate, but also quality approximation distribution, so it has this great advantage that it can say "I don't know, whatever the new data will come, either of them can be better."

mbq

Posted 2010-07-23T20:49:12.340

Reputation: 19 511

4@mbq - I don't see how cross validation overcomes the "un-representativeness" problem. If your training data is un-representative of the data you will receive in the future, you can cross-validate all you want, but it will be unrepresentative of the "generalisation error" that you are actually going to be facing (as "the true" new data is not represented by the non-modeled part of the training data). Getting a representative data set is vital if you are to make good predictions. – probabilityislogic – 2011-05-13T14:21:44.363

@probabilityislogic Sure; I tried here to explain that *IC based selection may become invalidated by looking from CV perspective; of course CV may be equally easy broken by bad sample selection. However this won't help with selecting better model. – mbq – 2011-05-13T16:16:08.747

1@mbq - my point is that you seem to "gently reject" IC based selection based on an alternative which doesn't fix the problem. Cross-validation is good (although computation worth it?), but un-representative data can't be dealt with using a data driven process. At least not reliably. You need to have prior information which tells you how it is un-representative (or more generally, what logical connections the "un-representative" data has to the actual future data you will observe). – probabilityislogic – 2011-05-13T17:12:27.210

@probabilityislogic Well, I show that IC sux in comparison to CV, so the fact that CV sux too only makes IC sux even more. But you are right that I've abused the word "representative" in the answer -- I'll try to fix it. And in fact I'm a general denier of model selection =) – mbq – 2011-05-13T18:45:40.870

@mbq - model average ftw! – probabilityislogic – 2011-05-13T19:08:33.370

Does that mean that for certain sample sizes BIC may be less stringent than AIC? – russellpierce – 2010-07-23T21:36:49.540

1Stringent is not a best word here, rather more tolerant for parameters; still, yup, for the common definitions (with natural log) it happens for 7 and less objects. – mbq – 2010-07-23T22:13:56.467

AIC is asymptotically equivalent to cross-validation. – Rob Hyndman – 2010-07-24T01:47:58.633

@Rob Can you give a reference? I doubt that it is general. – mbq – 2010-07-24T08:03:12.467

@Rob For what I could found, this is true only for linear models. – mbq – 2010-07-24T08:13:15.780

@mbq. I was thinking of Shao 1995 which is, indeed, only for linear models. I don't know if the result has been extended to other models. – Rob Hyndman – 2010-07-27T13:30:26.377

5

From what I can tell, there isn't much difference between AIC and BIC. They are both mathematically convenient approximations one can make in order to efficiently compare models. If they give you different "best" models, it probably means you have high model uncertainty, which is more important to worry about than whether you should use AIC or BIC. I personally like BIC better because it asks more (less) of a model if it has more (less) data to fit its parameters - kind of like a teacher asking for a higher (lower) standard of performance if their student has more (less) time to learn about the subject. To me this just seems like the intuitive thing to do. But then I am certain there also exists equally intuitive and compelling arguments for AIC as well, given its simple form.

Now any time you make an approximation, there will surely be some conditions when those approximations are rubbish. This can be seen certainly for AIC, where there exist many "adjustments" (AICc) to account for certain conditions which make the original approximation bad. This is also present for BIC, because various other more exact (but still efficient) methods exist, such as Fully Laplace Approximations to mixtures of Zellner's g-priors (BIC is an approximation to the Laplace approximation method for integrals).

One place where they are both crap is when you have substantial prior information about the parameters within any given model. AIC and BIC unnecessarily penalise models where parameters are partially known compared to models which require parameters to be estimated from the data.

one thing I think is important to note is that BIC does not assume a "true" model a) exists, or b) is contained in the model set. BIC is simply an approximation to an integrated likelihood $P(D|M,A)$ (D=Data, M=model, A=assumptions). Only by multiplying by a prior probability and then normalising can you get $P(M|D,A)$. BIC simply represents how likely the data was if the proposition implied by the symbol $M$ is true. So from a logical viewpoint, any proposition which would lead one to BIC as an approximation are equally supported by the data. So if I state $M$ and $A$ to be the propositions

$$\begin{array}{l|l} M_{i}:\text{the ith model is the best description of the data} \\ A:\text{out of the set of K models being considered, one of them is the best} \end{array} $$

And then continue to assign the same probability models (same parameters, same data, same approximations, etc.), I will get the same set of BIC values. It is only by attaching some sort of unique meaning to the logical letter "M" that one gets drawn into irrelevant questions about "the true model" (echoes of "the true religion"). The only thing that "defines" M is the mathematical equations which use it in their calculations - and this is hardly ever singles out one and only one definition. I could equally put in a prediction proposition about M ("the ith model will give the best predictions"). I personally can't see how this would change any of the likelihoods, and hence how good or bad BIC will be (AIC for that matter as well - although AIC is based on a different derivation)

And besides, what is wrong with the statement If the true model is in the set I am considering, then there is a 57% probability that it is model B. Seems reasonable enough to me, or you could go the more "soft" version there is a 57% probability that model B is the best out of the set being considered

One last comment: I think you will find about as many opinions about AIC/BIC as there are people who know about them.

probabilityislogic

Posted 2010-07-23T20:49:12.340

Reputation: 17 954

5

As you mentioned, AIC and BIC are methods to penalize models for having more regressor variables. A penalty function is used in these methods, which is a function of the number of parameters in the model.

  • When applying AIC, the penalty function is z(p) = 2 p.

  • When applying BIC, the penalty function is z(p) = p ln(n), which is based on interpreting the penalty as deriving from prior information (hence the name Bayesian Information Criterion).

When n is large the two models will produce quite different results. Then the BIC applies a much larger penalty for complex models, and hence will lead to simpler models than AIC. However, as stated in Wikipedia on BIC:

it should be noted that in many applications..., BIC simply reduces to maximum likelihood selection because the number of parameters is equal for the models of interest.

Amanda

Posted 2010-07-23T20:49:12.340

Reputation: 385

3note that AIC is also equivalent to ML when dimension doesn't change. Your answer makes it seem like this is only for BIC. – probabilityislogic – 2011-05-13T12:10:48.610

4

AIC and BIC are information criteria for comparing models. Each tries to balance model fit and parsimony and each penalizes differently for number of parameters.

AIC is Akaike Information Criterion the formula is AIC = 2k - 2ln(L) where k is number of parameters and L is likelihood; with this formula, smaller is better. (I recall that some programs output the opposite 2Ln(L) - 2k, but I don't remember the details)

BIC is Bayesian Information Criterion, the formula is BIC = k ln(k) - 2ln(L). It favors more parsimonious models than AIC

I haven't heard of KIC.

Peter Flom

Posted 2010-07-23T20:49:12.340

Reputation: 67 912

haven't heard of KIC either, but for AIC and BIC have a look at the linked question, or search for AIC. http://stats.stackexchange.com/q/577/442

– Henrik – 2011-09-16T10:30:38.417

1(This reply was merged from a duplicate question that also asked for interpretation of "KIC".) – whuber – 2011-09-16T17:49:12.867

3The models don't need to be nested to be compared with AIC or BIC. – Macro – 2012-04-05T13:03:51.293

4

AIC should rarely be used, as it is really only valid asymptotically. It is almost always better to use AICc (AIC with a correction for finite sample size). AIC tends to overparameterize: that problem is greatly lessened with AICc. The main exception to using AICc is when the underlying distributions are heavily leptokurtic. For more on this, see the book Model Selection by Burnham & Anderson.

user2875

Posted 2010-07-23T20:49:12.340

Reputation: 126

1So, what you are saying is that AIC doesn't sufficiently punish models for parameters so using it as a criteria may lead to overparametrization. You recommend the use of AICc instead. To put this back in the context of my initial question, since BIC already is more stringent than AIC is there a reason to use AICc over BIC? – russellpierce – 2011-01-25T05:41:24.113

1What do you mean by AIC is valid asymptotically. As pointed out by John Taylor AIC is inconsistent. I think his coomments contrasting AIC with BIC are the best ones given. I do not see the two being the same as cross-validation. They all have a nice property that they usually peak at a model with less than the maximum number of variables. But they all can pick different models. – Michael Chernick – 2012-05-06T01:05:11.570