Micro- and macro-averages (for whatever metric) will compute slightly different things, and thus their interpretation differs. A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).

To illustrate why, take for example precision $Pr=\frac{TP}{(TP+FP)}$. Let's imagine you have a *One-vs-All* (there is only one correct class output per example) multi-class classification system with four classes and the following numbers when tested:

- Class A: 1 TP and 1 FP
- Class B: 10 TP and 90 FP
- Class C: 1 TP and 1 FP
- Class D: 1 TP and 1 FP

You can see easily that $Pr_A = Pr_C = Pr_D = 0.5$, whereas $Pr_B=0.1$.

- A macro-average will then compute: $Pr=\frac{0.5+0.1+0.5+0.5}{4}=0.4$
- A micro-average will compute: $Pr=\frac{1+10+1+1}{2+100+2+2}=0.123$

These are quite different values for precision. Intuitively, in the macro-average the "good" precision (0.5) of classes A, C and D is contributing to maintain a "decent" overall precision (0.4). While this is technically true (across classes, the average precision is 0.4), it is a bit misleading, since a large number of examples are not properly classified. These examples predominantly correspond to class B, so they only contribute 1/4 towards the average in spite of constituting 94.3% of your test data. The micro-average will adequately capture this class imbalance, and bring the overall precision average down to 0.123 (more in line with the precision of the dominating class B (0.1)).

For computational reasons, it may sometimes be more convenient to compute class averages and then macro-average them. If class imbalance is known to be an issue, there are several ways around it. One is to report not only the macro-average, but also its standard deviation (for 3 or more classes). Another is to compute a weighted macro-average, in which each class contribution to the average is weighted by the relative number of examples available for it. In the above scenario, we obtain:

$Pr_{macro-mean}={0.25·0.5+0.25·0.1+0.25·0.5+0.25·0.5}=0.4$
$Pr_{macro-stdev}=0.173$

$Pr_{macro-weighted}={0.0189·0.5+0.943·0.1+0.0189·0.5+0.0189·0.5}={0.009+0.094+0.009+0.009}=0.123$

The large standard deviation (0.173) already tells us that the 0.4 average does not stem from a uniform precision among classes, but it might be just easier to compute the weighted macro-average, which in essence is another way of computing the micro-average.

1can't you look at the individual true positives etc. before averaging? also, macro averages tend to be lower than micro averages – oW_ – 2016-12-30T00:53:05.760

Are Micro and Macro F-measures are specific to text classification or retrieval, or they can be used for any recognition or classification problem.....If so where we can get the significance of each or any other refrence... – idrees – 2018-02-19T07:24:50.427

9Isn't the Micro Average Precision the same as the Accuracy of a data set? From what I understand, for Micro Average Precision you calculate the sum of all true positives and divide it by the sum of all true positives plus the sum of all false positives. So basically you divide the number of correctly identified predictions by the total number of predictions. Where is that any different from the accuracy calculation? Why do we need a new special precision term which makes things more complicated instead of simply sticking to the accuracy value? Please prove me wrong so I can sleep peacefully. – Nico Zettler – 2018-11-15T10:06:39.633

7

@NicoZettler You are correct. Micro-averaged precision and micro-averaged recall are both equal to the accuracy when each data point is assigned to exactly one class. As to your second question, micro-averaged metrics are different from the overall accuracy when the classifications are multi-labeled (each data point may be assigned more than one label) and/or when some classes are excluded in the multi-class case. See https://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification.

– Johnson – 2019-05-07T17:30:56.613Just to add to Nico's point, in the micro average framework there's no concept of false negative – Tommaso Guerrini – 2020-05-13T11:27:22.773

You can refer to this article as well. It elaborates on using micro-averaged F1 score while dealing with multiclass classification.

– Shayan Shafiq – 2021-01-06T20:04:50.733"Isn't the Micro Average Precision the same as the Accuracy of a data set?" > No, it isn't. The have the same numerator but the denominator is different. – Amr Keleg – 2021-01-29T17:27:09.790