## Micro Average vs Macro average Performance in a Multiclass classification setting

200

130

I am trying out a multiclass classification setting with 3 classes. The class distribution is skewed with most of the data falling in 1 of the 3 classes. (class labels being 1,2,3, with 67.28% of the data falling in class label 1, 11.99% data in class 2, and remaining in class 3)

I am training a multiclass classifier on this dataset and I am getting the following performance:

                    Precision           Recall           F1-Score
Micro Average       0.731               0.731            0.731
Macro Average       0.679               0.529            0.565


I am not sure why all Micro average performances are equal and also Macro average performances are low compared to Micro average.

1can't you look at the individual true positives etc. before averaging? also, macro averages tend to be lower than micro averages – oW_ – 2016-12-30T00:53:05.760

Are Micro and Macro F-measures are specific to text classification or retrieval, or they can be used for any recognition or classification problem.....If so where we can get the significance of each or any other refrence... – idrees – 2018-02-19T07:24:50.427

9Isn't the Micro Average Precision the same as the Accuracy of a data set? From what I understand, for Micro Average Precision you calculate the sum of all true positives and divide it by the sum of all true positives plus the sum of all false positives. So basically you divide the number of correctly identified predictions by the total number of predictions. Where is that any different from the accuracy calculation? Why do we need a new special precision term which makes things more complicated instead of simply sticking to the accuracy value? Please prove me wrong so I can sleep peacefully. – Nico Zettler – 2018-11-15T10:06:39.633

7

@NicoZettler You are correct. Micro-averaged precision and micro-averaged recall are both equal to the accuracy when each data point is assigned to exactly one class. As to your second question, micro-averaged metrics are different from the overall accuracy when the classifications are multi-labeled (each data point may be assigned more than one label) and/or when some classes are excluded in the multi-class case. See https://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification.

– Johnson – 2019-05-07T17:30:56.613

Just to add to Nico's point, in the micro average framework there's no concept of false negative – Tommaso Guerrini – 2020-05-13T11:27:22.773

You can refer to this article as well. It elaborates on using micro-averaged F1 score while dealing with multiclass classification.

– Shayan Shafiq – 2021-01-06T20:04:50.733

"Isn't the Micro Average Precision the same as the Accuracy of a data set?" > No, it isn't. The have the same numerator but the denominator is different. – Amr Keleg – 2021-01-29T17:27:09.790

## Answers

303

Micro- and macro-averages (for whatever metric) will compute slightly different things, and thus their interpretation differs. A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).

To illustrate why, take for example precision $Pr=\frac{TP}{(TP+FP)}$. Let's imagine you have a One-vs-All (there is only one correct class output per example) multi-class classification system with four classes and the following numbers when tested:

• Class A: 1 TP and 1 FP
• Class B: 10 TP and 90 FP
• Class C: 1 TP and 1 FP
• Class D: 1 TP and 1 FP

You can see easily that $Pr_A = Pr_C = Pr_D = 0.5$, whereas $Pr_B=0.1$.

• A macro-average will then compute: $Pr=\frac{0.5+0.1+0.5+0.5}{4}=0.4$
• A micro-average will compute: $Pr=\frac{1+10+1+1}{2+100+2+2}=0.123$

These are quite different values for precision. Intuitively, in the macro-average the "good" precision (0.5) of classes A, C and D is contributing to maintain a "decent" overall precision (0.4). While this is technically true (across classes, the average precision is 0.4), it is a bit misleading, since a large number of examples are not properly classified. These examples predominantly correspond to class B, so they only contribute 1/4 towards the average in spite of constituting 94.3% of your test data. The micro-average will adequately capture this class imbalance, and bring the overall precision average down to 0.123 (more in line with the precision of the dominating class B (0.1)).

For computational reasons, it may sometimes be more convenient to compute class averages and then macro-average them. If class imbalance is known to be an issue, there are several ways around it. One is to report not only the macro-average, but also its standard deviation (for 3 or more classes). Another is to compute a weighted macro-average, in which each class contribution to the average is weighted by the relative number of examples available for it. In the above scenario, we obtain:

$Pr_{macro-mean}={0.25·0.5+0.25·0.1+0.25·0.5+0.25·0.5}=0.4$ $Pr_{macro-stdev}=0.173$

$Pr_{macro-weighted}={0.0189·0.5+0.943·0.1+0.0189·0.5+0.0189·0.5}={0.009+0.094+0.009+0.009}=0.123$

The large standard deviation (0.173) already tells us that the 0.4 average does not stem from a uniform precision among classes, but it might be just easier to compute the weighted macro-average, which in essence is another way of computing the micro-average.

30This answer deserves more upvotes, because it helps building an understanding why micro and macro behave differently instead of just listing the formulas (and it is original content). – steffen – 2018-01-22T13:40:10.700

2How does this explain the different macro values in the original question? – shakedzy – 2018-02-07T09:54:54.970

5If you flip the scenario sketched in the reply, with the large class performing better than the small ones, you would expect to see micro average being higher than the macro average (which is the behavior reported in the question). That the macro values are different is more or less to be expected, since you are measuring different things (precision, recall...). Why the micro averages are all the same I believe is the question. – pythiest – 2018-02-08T16:57:47.043

25

I disagree with the statement that micro average should be preferred over macro in case of imbalanced datasets. In fact, for F scores, macro is preferred over micro as the former gives equal importance to each class whereas the later gives equal importance to each sample (which means the more the number of samples, the more say it has in the final score thus favoring majority classes much like accuracy).

Sources:

– shahensha – 2018-07-31T05:49:30.530

5

Is the "weighted macro-average" always going to equal the micro average? In Scikit-Learn, the definition of "weighted" is slightly different: "Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). " From the docs for F1 Score.

– willk – 2018-08-06T20:13:59.737

This is the best explanation of micro/macro/weighted averaging I've ever seen. Thank you! – Alber8295 – 2018-11-03T11:06:55.303

8Very good explaination, but I disagree with the part "In multi-class classification , micro-average is preferable if class is imbalance". It depends what's the objective. If it cares about overall data (not prefer to any class), 'micro' is just fine. However, let's say, the class X is rare, but it's way important, 'macro' should be a better choice because it treats each class equally. 'micro' is better if we care more about the accuracy in overall. From my view point, 'micro' is closer to 'accuracy', while 'macro' is a bit different when its not dominated by prevalent class. – Catbuilts – 2019-09-20T08:52:24.560

Does it make sense to add weight to prefer 1 class over another in macro average? Meaning if we consider class A (preferred) and B, low precision of B doesn't bother me too much as long as A's precision is high – Minh Thai – 2020-01-09T11:00:22.313

1@Catbuilts correct, it all depends on your application and how you want to handle the performance reporting. If you have multiple classes and care equally about them, the macro-average is a good solution. – pythiest – 2020-01-10T12:52:10.280

@MinhThai yes, that is what the macro-weighted solution does. it can be generalized to introduce arbitrary weighting, but then you should always state what the weighting scheme and why it was introduced. – pythiest – 2020-01-10T12:55:25.587

2I agree with @shahensha, because if you have a classification problem which deals with rare events, a classification model which always predict the common event will be a great model from micro F1 point of view. – caiohamamura – 2020-04-15T22:08:23.393

While this explanation makes sense for micro-averaged precision, how does it apply for recall? I am confused, since it looks like micro-averaged recall is same as micro-averaged precision? – arun – 2020-12-03T18:04:53.737

32

This is the Original Post.

In Micro-average method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and the apply them to get the statistics.

Tricky, but I found this very interesting. There are two methods by which you can get such average statistic of information retrieval and classification.

## 1. Micro-average Method

In Micro-average method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and the apply them to get the statistics. For example, for a set of data, the system's

True positive (TP1)  = 12
False positive (FP1) = 9
False negative (FN1) = 3


Then precision (P1) and recall (R1) will be $$57.14 \%=\frac {TP1}{TP1+FP1}$$ and $$80\%=\frac {TP1}{TP1+FN1}$$

and for a different set of data, the system's

True positive (TP2)  = 50
False positive (FP2) = 23
False negative (FN2) = 9


Then precision (P2) and recall (R2) will be 68.49 and 84.75

Now, the average precision and recall of the system using the Micro-average method is

$$\text{Micro-average of precision} = \frac{TP1+TP2}{TP1+TP2+FP1+FP2} = \frac{12+50}{12+50+9+23} = 65.96$$

$$\text{Micro-average of recall} = \frac{TP1+TP2}{TP1+TP2+FN1+FN2} = \frac{12+50}{12+50+3+9} = 83.78$$

The Micro-average F-Score will be simply the harmonic mean of these two figures.

## 2. Macro-average Method

The method is straight forward. Just take the average of the precision and recall of the system on different sets. For example, the macro-average precision and recall of the system for the given example is

$$\text{Macro-average precision} = \frac{P1+P2}{2} = \frac{57.14+68.49}{2} = 62.82$$ $$\text{Macro-average recall} = \frac{R1+R2}{2} = \frac{80+84.75}{2} = 82.25$$

The Macro-average F-Score will be simply the harmonic mean of these two figures.

Suitability Macro-average method can be used when you want to know how the system performs overall across the sets of data. You should not come up with any specific decision with this average.

On the other hand, micro-average can be a useful measure when your dataset varies in size.

32

should you give credit to this blog post?

– xiaohan2012 – 2017-08-16T12:30:16.000

1

It might be worth noting that the F1-score here is not necessarily the same as the macro-averaged F1 score commonly used (like implemented in scikit or described in this paper). Usually, the F1 score is calculated for each class/set separately and then the average is calculated from the different F1 scores (here, it is done in the opposite way: first calculating the macro-averaged precision/recall and then the F1-score).

– Milania – 2018-08-23T14:55:33.910

FYI original link is dead – Y.Terz – 2020-06-21T16:55:28.917

19

In a multi-class setting micro-averaged precision and recall are always the same.

$$P = \frac{\sum_c TP_c}{\sum_c TP_c + \sum_c FP_c}\\ R = \frac{\sum_c TP_c}{\sum_c TP_c + \sum_c FN_c}$$ where c is the class label.

Since in a multi-class setting you count all false instances it turns out that $$\sum_c FP_c = \sum_c FN_c$$

Hence P = R. In other words, every single False Prediction will be a False Positive for a class, and every Single Negative will be a False Negative for a class. If you treat a binary classification case as a bi-class classification and compute the micro-averaged precision and recall they will be same.

The answer given by Rahul is in the case of averaging binary precision and recall from multiple dataset. In which case the micro-averaged precision and recall are different.

3

That's how it should be. I had the same result for my research. It seemed weird at first. But precision and recall should be the same while micro-averaging the result of multi-class single-label classifier. This is because if you consider a misclassification c1=c2 (where c1 and c2 are 2 different classes), the misclassification is a false positive (fp) with respect to c2 and false negative (fn) with respect to c1. If you sum the fn and fp for all classes, you get the same number because you are counting each misclassification as fp with respect to one class and fn with respect to another class.

2

The advantage of using the Macro F1 Score is that it gives equal weight to all data points.

For example: Let's think of it as the F1 micro takes the Sum of all the Recall and Presession of different labels independently, so when we have class imbalance like:

T1 = 90% , T2 = 80% , T3=5

then F1 Micro gives equal weight to all the class and is not affected by the deviations in the distribution of the class log the Log loss it penalizes small deviations in the class.

1

I think the reason why macro average is lower than micro average is well explained by pythiest's answer (dominating class has better predictions and so the micro average increase).

But the fact that micro average is equal for Precision, Recall and F1 score is because micro averaging these metrics results in overall Accuracy (as micro avg considers all classes as positive). Note that if Precision and Recall are equal then F1 score is just equal to precision/recall.

As for the question if the "weighted macro-average" always going to equal the "micro average"? I did some experiments with different no. of classes and different class imbalance and it turns out that this is not necessary true.

These statements are made with assumption that we are considering all the classes of same dataset (in contrast to Rahul Reddy Vemireddy's answer)