From what I can tell, there isn't much difference between AIC and BIC. They are both mathematically convenient approximations one can make in order to efficiently compare models. If they give you different "best" models, it probably means you have high model uncertainty, which is more important to worry about than whether you should use AIC or BIC. I personally like BIC better because it asks more (less) of a model if it has more (less) data to fit its parameters - kind of like a teacher asking for a higher (lower) standard of performance if their student has more (less) time to learn about the subject. To me this just seems like the intuitive thing to do. But then I am certain there also exists equally intuitive and compelling arguments for AIC as well, given its simple form.
Now any time you make an approximation, there will surely be some conditions when those approximations are rubbish. This can be seen certainly for AIC, where there exist many "adjustments" (AICc) to account for certain conditions which make the original approximation bad. This is also present for BIC, because various other more exact (but still efficient) methods exist, such as Fully Laplace Approximations to mixtures of Zellner's g-priors (BIC is an approximation to the Laplace approximation method for integrals).
One place where they are both crap is when you have substantial prior information about the parameters within any given model. AIC and BIC unnecessarily penalise models where parameters are partially known compared to models which require parameters to be estimated from the data.
one thing I think is important to note is that BIC does not assume a "true" model a) exists, or b) is contained in the model set. BIC is simply an approximation to an integrated likelihood $P(D|M,A)$ (D=Data, M=model, A=assumptions). Only by multiplying by a prior probability and then normalising can you get $P(M|D,A)$. BIC simply represents how likely the data was if the proposition implied by the symbol $M$ is true. So from a logical viewpoint, any proposition which would lead one to BIC as an approximation are equally supported by the data. So if I state $M$ and $A$ to be the propositions
$$\begin{array}{l|l}
M_{i}:\text{the ith model is the best description of the data}
\\
A:\text{out of the set of K models being considered, one of them is the best}
\end{array}
$$
And then continue to assign the same probability models (same parameters, same data, same approximations, etc.), I will get the same set of BIC values. It is only by attaching some sort of unique meaning to the logical letter "M" that one gets drawn into irrelevant questions about "the true model" (echoes of "the true religion"). The only thing that "defines" M is the mathematical equations which use it in their calculations - and this is hardly ever singles out one and only one definition. I could equally put in a prediction proposition about M ("the ith model will give the best predictions"). I personally can't see how this would change any of the likelihoods, and hence how good or bad BIC will be (AIC for that matter as well - although AIC is based on a different derivation)
And besides, what is wrong with the statement If the true model is in the set I am considering, then there is a 57% probability that it is model B. Seems reasonable enough to me, or you could go the more "soft" version there is a 57% probability that model B is the best out of the set being considered
One last comment: I think you will find about as many opinions about AIC/BIC as there are people who know about them.
I don't know if your question applies specifically to phylogeny (bioinformatics), but if so, this study can provide some thoughts on this aspect: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2925852/
– tlorin – 2018-01-03T09:09:30.3071I think it is more appropriate to call this discussion as "feature" selection or "covariate" selection. To me, model selection is much broader involving specification of the distribution of errors, form of link function, and the form of covariates. When we talk about AIC/BIC, we are typically in the situation where all aspects of model building are fixed, except the selection of covariates. – None – 2012-08-13T21:17:47.313
5Deciding the specific covariates to include in a model does commonly go by the term model selection and there are a number of books with model selection in the title that are primarily deciding what model covariates/parameters to include in the model. – Michael Chernick – 2012-08-24T14:44:28.847