The statement "it tests combination of features" is **not true**. It tests individual features. However, a tree can approximate *any* continuous function $f$ over training points, since it is a universal approximator just like neural networks.

In Random Forest (or Decision Tree, or Regression Tree), *individual* features are compared to each other, not a combination of them, then the most informative individual is peaked to split a leaf. Therefore, there is no notion of "better combination" in the whole process.

Furthermore, Random Forest is a bagging algorithm which does not favor the randomly-built trees over each other, they all have the same weight in the aggregated output.

It is worth noting that "Rotation forest" first applies PCA to features, which means each new feature is a linear combination of original features. However, this does not count since the same pre-processing can be used for any other method too.

**EDIT**:

@tam provided a counter-example for XGBoost, which is not the same as Random Forest. However, the issue is the same for XGBoost. Its learning process comes down to splitting each leaf **greadily** based on a single feature *instead of* selecting the best combination of features among a set of combinations, or the best tree among a set of trees.

From this explanation, you can see that *The Structure Score* is defined for a tree (which is a function) based on the first- and second-order derivatives of loss function in each leaf $j$ ($G_j$ and $H_j$ respectively) summed over all $T$ leaves, i.e.
$$\text{obj}^*=-\frac{1}{2} \sum_{j=1}^{T}\frac{G_j}{H_j + \lambda} + \gamma T$$
However, the optimization process **greedily** splits a leaf using the best individual feature that gives the highest gain in $\text{obj}^*$.

A tree $t$ is built by **greedily** minimizing the loss, i.e. branching on the best individual feature, and when the tree is built, process goes to create the next tree $t+1$ in the same way, and so on.

Here is the key quote from XGBoost paper:

This score is like the impurity score for evaluating decision trees,
except that it is derived for a wider range of objective functions [..] Normally it is impossible to enumerate all the possible tree
structures q. A greedy algorithm that starts from a single leaf and
iteratively adds branches to the tree is used instead.

In summary:

Although a tree represents a combination of features (a function), but
none of XGBoost and Random Forest are selecting between functions.
They build and aggregate multiple functions by greedily favoring **individual**
features.

(+1) As an example,Tree 1 works with features (A, B) and gives 80% accuracy, Tree 2 works with features (C, D) and gives 60%. A boosting algorithm puts more weight on Tree 1, thus effectively favors f(A, B) over g(C, D). – Esmailian – 2019-03-31T19:14:45.003

Thank you for your answer. However, to be honest I would like a more in depth answer. To start with, my second question is still unanswered I think: "Also if it true then is it about any kind of combination of features (e.g. X*W, X+W+Z etc) or only for some specific ones (e.g. X+W)?" – Outcast – 2019-04-01T10:49:24.077

Please refer this link ( http://mariofilho.com/can-gradient-boosting-learn-simple-arithmetic/ ) . This article talks about how boosting trees can model arithmetic operations like X*W, X/W, etc. Theoretically, it is possible. Trees are like neural networks, they are universal approximator (Theoretically). And I am stressing on the word Theoretically.

– tam – 2019-04-01T11:05:46.760Ok thank you for this too. However, to start with both the other people here are claiming the opposite than you so it is quite difficult for me to draw a definite conclusion. – Outcast – 2019-04-01T11:26:52.760

Also by the way at your answer you are saying "... has the capability of capturing different feature interactions". However, my question is whether is built-in in random forest (or in boosting algos). In a sense, linear regression also has the "capability" of doing this but exactly you will have to programme it i.e. add some lines of code where you are adding, multiplying some of the features etc. – Outcast – 2019-04-01T14:04:47.907

By capability, I meant the ability to capture the relationship without explicitly mentioning it like as in linear regression. I think now we all are in the same page with boosting algorithms. Now coming to a simple decision tree, I think this also has the capability to capture any relationship.It is not just linear decision functions its approximates it can learn a non-linear function too. Since its decision boundary can be approximated to any shape. You can try fitting decision tree to Y=X1*X2. So to summarise, theoretically all tree-based algorithms can capture any feature interactions. – tam – 2019-04-01T23:32:12.563