One major difference is that the F1-score does not care at all about how many negative examples you classified or how many negative examples are in the dataset at all; instead, the balanced accuracy metric gives half its weight to how many positives you labeled correctly and how many negatives you labeled correctly.

When working on problems with heavily imbalanced datasets AND you care more about detecting positives than detecting negatives (outlier detection / anomaly detection) then you would prefer the F1-score more.

Let's say for example you have a validation set than contains 1000 negative samples and 10 positive samples. If a model predicts there are 15 positive examples (5 truly positive and 10 it incorrectly labeled) and predicts the rest as negative, thus

```
TP=5; FP=10; TN=990; FN=5
```

Then its F1-score and balanced accuracy will be

$Precision = \frac{5}{15}=0.33...$

$Recall = \frac{5}{10}= 0.5$

$F_1 = 2 * \frac{0.5*0.33}{0.5+0.3} = 0.4$

$Balanced\ Acc = \frac{1}{2}(\frac{5}{10} + \frac{990}{1000}) = 0.745$

You can see that balanced accuracy still cares about the negative datapoints unlike the F1 score.

For even more analysis we can see what the change is when the model gets exactly one extra positive example correctly and one negative sample incorrectly:

```
TP=6; FP=9; TN=989; FN=4
```

$Precision = \frac{6}{15}=0.4$

$Recall = \frac{6}{10}= 0.6$

$F_1 = 2 * \frac{0.6*0.4}{0.6+0.4} = 0.48$

$Balanced\ Acc = \frac{1}{2}(\frac{6}{10} + \frac{989}{1000}) = 0.795$

Correctly classifying an extra positive example increased the F1 score a bit more than the balanced accuracy.

Finally let's look at what happens when a model predicts there are still 15 positive examples (5 truly positive and 10 incorrectly labeled); *however*, this time the dataset is *balanced* and there are exactly 10 positive and 10 negative examples:

```
TP=5; FP=10; TN=0; FN=5
```

$Precision = \frac{5}{15}=0.33...$

$Recall = \frac{5}{10}= 0.5$

$F_1 = 2 * \frac{0.5*0.33}{0.5+0.3} = 0.4$

$Balanced\ Acc = \frac{1}{2}(\frac{5}{10} + \frac{0}{0}) = 0.25$

You can see that the F1-score did not change at all (compared to the first example) while the balanced accuracy took a massive hit (decreased by 50%).

This shows how F1-score only cares about the points the model *said* are positive, and the points that *actually* are positive, and doesn't care at all about the plathero points that are negative.

I really liked your answer, the concept and the examples are very clear! Thank you. One more question (maybe a stupid one): in case negative samples are almost as important as positive samples (even though the dataset is imbalanced), I think that balanced accuracy should be taken more into consideration than F1 score.. Does it make sense? – Ric S – 2020-05-12T08:24:44.830

1Yes I would say in that case more attention should be placed on balanced accuracy and Area Under ROC. – A Kareem – 2020-05-12T08:26:17.200