Gini Impurity vs Entropy

28

20

Can someone practically explain the rationale behind Gini impurity vs Information gain (based on Entropy)?

Which metric is better to use in different scenarios while using decision trees?

Krish Mahajan

Posted 2016-02-12T22:05:41.193

Reputation: 141

5@Anony-Mousse I guess that was obvious before your comment. The question is not if both have their advantages, but in which scenarios one is better than the other.Martin Thoma 2016-02-14T10:34:55.057

I have proposed "Information gain" instead of "Entropy", since it is quite closer (IMHO), as marked in the related links. Then, the question was asked in a different form in When to use Gini impurity and when to use information gain?

Laurent Duval 2016-02-16T07:23:36.827

1

I have posted here a simple interpretation of the Gini impurity that may be helpful.

Picaud Vincent 2017-11-06T11:33:25.850

Answers

20

Gini impurity and Information Gain Entropy are pretty much the same. And people do use the values interchangeably. Below are the formulae of both:

  1. $\textit{Gini}: \mathit{Gini}(E) = 1 - \sum_{j=1}^{c}p_j^2$
  2. $\textit{Entropy}: H(E) = -\sum_{j=1}^{c}p_j\log p_j$

Given a choice, I would use the Gini impurity, as it doesn't require me to compute logarithmic functions, which are computationally intensive. The closed form of it's solution can also be found.

Which metric is better to use in different scenarios while using decision trees ?

The Gini impurity, for reasons stated above.

So, they are pretty much same when it comes to CART analytics.

Helpful reference for computational comparison of the two methods

Dawny33

Posted 2016-02-12T22:05:41.193

Reputation: 4 688

It is so common to see formula of entropy, while what is really used in decision tree looks like conditional entropy. I think it is important distinction or am missing something?user1700890 2017-08-12T13:01:18.253

@user1700890 The ID3 algorithm uses Info. gain entropy. I need to read up on conditional entropy. Probably an improvement over ID3 :)Dawny33 2017-08-12T13:55:34.813

1

I think your definition of the gini impurtiy might be wrong: https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity

Martin Thoma 2017-10-19T11:30:58.387

7

Generally, your performance will not change whether you use Gini impurity or Entropy.

Laura Elena Raileanu and Kilian Stoffel compared both in "Theoretical comparison between the gini index and information gain criteria". The most important remarks were:

  • It only matters in 2% of the cases whether you use gini impurity or entropy.
  • Entropy might be a little slower to compute (because it makes use of the logarithm).

I was once told that both metrics exist because they emerged in different disciplines of science.

Archie

Posted 2016-02-12T22:05:41.193

Reputation: 373

7

Gini is intended for continuous attributes, and Entropy for attributes that occur in classes

Gini is to minimize misclassification
Entropy is for exploratory analysis

Entropy may be a little slower to compute

NIMISHAN

Posted 2016-02-12T22:05:41.193

Reputation: 173

5

For the case of a variable with two values, appearing with fractions f and (1-f),
the gini and entropy are given by:
gini = 2*f(1-f)
entropy = f*ln(1/f) + (1-f)*ln(1/(1-f))
These measures are very similar if scaled to 1.0 (plotting 2*gini and entropy/ln(2) ):

Gini (y4,purple) and Entropy (y3,green) values scaled for comparison

DanLvii Dewey

Posted 2016-02-12T22:05:41.193

Reputation: 51

3

To add upon the fact that there are more or less the same, consider also the fact that: $$ \begin{split} \forall \; 0 < u < 1,\; \log (1-u) &= -u - u^2/2 - u^3/3 \, + \, \cdots\\ \forall \; 0 < p < 1,\; \log (p) &= p-1 - (1-p)^2/2 - (1-p)^3/3 \, + \, \cdots\\ \end{split} $$ so that: $$ \forall \; 0 < p < 1,\; -p \log (p) = p(1-p) + p(1-p)^2/2 + p(1-p)^3/3 \, + \, \cdots $$ See the following plot of the two functions normalised to get 1 as maximum value: red curve is for Gini while black one is for entropy. Normalised Gini and Entropy criteria

In the end as explained by @NIMISHAN Gini is more suitable to minimise misclassfication as it is symetric to 0.5, while entropy will more penalised small probabilities.

clemlaflemme

Posted 2016-02-12T22:05:41.193

Reputation: 131