## How to calculate mAP for detection task for the PASCAL VOC Challenge?

30

32

How to calculate the mAP (mean Average Precision) for the detection task for the Pascal VOC leaderboards?

There said - at page 11:

Average Precision (AP). For the VOC2007 challenge, the interpolated average precision (Salton and Mcgill 1986) was used to evaluate both classification and detection. For a given task and class, the precision/recall curve is computed from a method’s ranked output. Recall is defined as the proportion of all positive examples ranked above a given rank. Precision is the proportion of all examples above that rank which are from the positive class. The AP summarises the shape of the precision/recall curve, and is defined as the mean precision at a set of eleven equally spaced recall levels [0,0.1,...,1]: AP = 1/11 ∑ r∈{0,0.1,...,1} pinterp(r)

The precision at each recall level r is interpolated by taking the maximum precision measured for a method for which the corresponding recall exceeds r: pinterp(r) = max p(r˜), where p(r˜) is the measured precision at recall ˜r

So does it mean that:

1. We calculate Precision and Recall:
• A) For many different IoU > {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} we calculate True/False Positive/Negative values

Where True positive = Number_of_detection with IoU > {0, 0.1,..., 1}, as said here and then we calculate:

Precision = True positive / (True positive + False positive)

Recall = True positive / (True positive + False negative)

• B) Or for many different thresholds of detection algorithms we calculate:

Precision = True positive / (True positive + False positive)

Recall = True positive / (True positive + False negative)

Where True positive = Number_of_detection with IoU > 0.5 as said here

• C) Or for many different thresholds of detection algorithms we calculate:

Precision = Intersect / Detected_box

Recall = Intersect / Object

As shown here?

1. Then we build Precision-Recall curve, as shown here:

1. Then we calculate AP (average precision) as average of 11 values of Precision at the points where Recall = {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}, i.e. AP = 1/11 ∑ recall∈{0,0.1,...,1} Precision(Recall)

(In general for each point, for example 0.3, we get MAX of Precision for Recall <= 0.3, instead of value of Precision at this point Recall=0.3)

1. And when we calculate AP only for 1 something object class on all images - then we get AP (average precision) for this class, for example, only for air.

So AP is a integral (area under the curve)

But when we calculate AP for all object classes on all images - then we get mAP (mean average precision) for all images dataset.

Questions:

1. Is it right, and if it isn't, then how to calculate mAP for Pascal VOC Challenge?
2. And which of the 3 formulas (A, B or C) is correct for calculating Precision and Recall, in paragraph 1?

• mAP = AVG(AP for each object class)
• AP = AVG(Precision for each of 11 Recalls {precision = 0, 0.1, ..., 1})
• PR-curve = Precision and Recall (for each Threshold that is in the Predictions bound-boxes)
• Precision = TP / (TP + FP)
• Recall = TP / (TP + FN)
• TP = number of detections with IoU>0.5
• FP = number of detections with IoU<=0.5 or detected more than once
• FN = number of objects that not detected or detected with IoU<=0.5

21

1. Yes your approach is right
2. Of A, B and C the right answer is B.

The explanation is the following: In order to calculate Mean Average Precision (mAP) in the context of Object Detection you must compute the Average Precision (AP) for each class, and then compute the mean across all classes. The key here is to compute the AP for each class, in general for computing Precision (P) and Recall (R) you must define what are: True Positives (TP), False Positives (FP), True Negative (TN) and False Negative (FN). In the setting of Object Detection of the Pascal VOC Challenge are the following:

• TP: are the Bounding Boxes (BB) that the intersection over union (IoU) with the ground truth (GT) is above 0.5
• FP: two cases (a) BB that the IoU with GT is below 0.5 (b) the BB that have IoU with a GT that has already been detected.
• TN: there are not true negative, the image are expected to contain at least one object
• FN: those ground truthes for which the method failed to produce a BB

Now each predicted BB have a confidence value for the given class. So the scoring method sort the predictions for decreasing order of confidence and compute the P = TP / (TP + FP) and R = TP / (TP + FN) for each possible rank k = 1 up to the number of predictions. So now you have a (P, R) for each rank those P and R are the "raw" Precision-Recall curve. To compute the interpolated P-R curve foreach value of R you select the maximum P that has a corresponding R' >= R.

There are two different ways to sample P-R curve points according to voc devkit doc. For VOC Challenge before 2010, we select the maximum P obtained for any R' >= R, which R belongs to 0, 0.1, ..., 1 (eleven points). The AP is then the average precision at each of the Recall thresholds. For VOC Challenge 2010 and after, we still select the maximum P for any R' >= R, while R belongs to all unique recall values (include 0 and 1). The AP is then the area size under P-R curve. Notice that in the case that you don't have a value of P with Recall above some of the thresholds the Precision value is 0.

For instance consider the following output of a method given the class "Aeroplane":

BB  | confidence | GT
----------------------
BB1 |  0.9       | 1
----------------------
BB2 |  0.9       | 1
----------------------
BB3 |  0.7       | 0
----------------------
BB4 |  0.7       | 0
----------------------
BB5 |  0.7       | 1
----------------------
BB6 |  0.7       | 0
----------------------
BB7 |  0.7       | 0
----------------------
BB8 |  0.7       | 1
----------------------
BB9 |  0.7       | 1
----------------------


Besides it not detected bounding boxes in two images, so we have FN = 2. The previous table is the ordered rank by confidence value of the predictions of the method GT = 1 means is a TP and GT = 0 FP. So TP=5 (BB1, BB2, BB5, BB8 and BB9), FP=5. For the case of rank=3 the precision drops because BB1 was already detected, so even if the object is indeed present it counts as a FP. .

rank=1  precision=1.00 and recall=0.14
----------
rank=2  precision=1.00 and recall=0.29
----------
rank=3  precision=0.66 and recall=0.29
----------
rank=4  precision=0.50 and recall=0.29
----------
rank=5  precision=0.40 and recall=0.29
----------
rank=6  precision=0.50 and recall=0.43
----------
rank=7  precision=0.43 and recall=0.43
----------
rank=8  precision=0.38 and recall=0.43
----------
rank=9  precision=0.44 and recall=0.57
----------
rank=10 precision=0.50 and recall=0.71
----------


Given the previous results: If we used the way before voc2010, the interpolated Precision values are 1, 1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0, 0, 0. Then AP = 5.5 / 11 = 0.5 for the class of "Aeroplanes". Else if we used the way since voc2010, the interpolated Precision values are 1, 1, 1, 0.5, 0.5, 0.5, 0 for seven unique recalls that are 0, 0.14, 0.29, 0.43, 0.57, 0.71, 1.Then AP = (0.14-0)*1 + (0.29-0.14)*1 + (0.43-0.29)*0.5 + (0.57-0.43)*0.5 + (0.71-0.57)*0.5 + (1-0.71)*0 = 0.5 for the class of "Aeroplanes".

Repeat for each class and then you have the (mAP).

More information can be found in the following links 1, 2. Also you should check the paper: The PASCAL Visual Object Classes Challenge: A Retrospective for a more detailed explanation.

Thank you very much! So we should calculate rank/precision/recall across all images, not for each image separately. And in your 1st table GT is equal to 1 if (IoU > 0.5), isn't it? Some clarifications, besides it not detected bounding boxes in two images, so we have FN = 2 - so if we have 2 images with 2 object on each = total of 4 objects, and we detected only 1 object, then how many FN will be, 2 as images or 3 as not detected objects? – Alex – 2017-12-01T11:10:41.397

1You're welcome! Yes you should compute across all images. And GT is 1 if IoU > 0.5. Last FN will be 3 for 3 not detected objects – Dani Mesejo – 2017-12-01T12:26:27.417

So, for Pascal VOC challenge http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?challengeid=11&compid=4 mAP = AVG(AP for each object class), and AP = AVG(Precision for each of 11 Recalls {0, 0.1, ..., 1}), where Precision = TP / (TP + FP) and Recall = TP / (TP + FN), where TP = number of detections with IoU>0.5, FP = number of detections with IoU<=0.5 and FN = number of objects not detected or detected with IoU<=0.5

– Alex – 2017-12-01T14:49:34.253

1

FN is the number of images were no prediction was made, FP is the number of detections with IoU <= 0.5 or detected more than once. See this pseudocode https://stats.stackexchange.com/a/263758/140597

– Dani Mesejo – 2017-12-01T15:53:06.340

"FN is the number of images were no prediction was made", but if 2 images has 3 objects, and all 3 objects are not detected, then you said "FN will be 3 for 3 not detected objects". So is FN the number of images were no prediction was made or number of objects that not detected? – Alex – 2017-12-01T17:37:37.140

2Sorry, your right is the number objects not detected. – Dani Mesejo – 2017-12-01T18:15:39.953

@feynman410 so mAP is computed from the class APs simply by taking the simple average as sum(APs)/num_classes? There is no weighting by the number of ground truth boxes in each class or anything? Just seems weird to me since some object classes appear a lot more in the data then others. – Alex – 2018-01-18T00:48:46.437

@feynman410 Thanks you for great explanation,for any one else who would like to see more example maybe lintel bit easier to understand with images I found this one : https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173

– Stav Bodik – 2018-05-27T17:07:55.113

>feynman410 example: if i understand correctly there are 5 different items in the example BB1, BB2, BB5, BB8 and BB9, which should have had easier names. therefore the actual precision recall values are as followes: rank=1 precision=1.0 and recall=0.2

## rank=8 precision=3/8 and recall=0.6

rank=9 p – Daniel ziv – 2018-05-20T20:31:08.597

1@feynman410 i got confused, can you please tell us where do you place in the table objects that were not detected, but should be? at the end of the table? (cause there is no score for them) – Martin Brisiak – 2018-07-31T10:29:37.823

1So "Precision" and "Recall" are computed separately for each class - in order to compute AP per class. Right? So are they computed separately on each image and then averaged, or are they computed over the total detections on all the images? – SomethingSomething – 2019-01-15T15:40:41.137

If a detection is recognized as 60% dog and 40% cat, so when thresholding in <40%, should I count the box twice, once for dog and once for cat? – SomethingSomething – 2019-01-15T15:43:24.730

14

There is a nice and detailed explanation with an easy to use code on my Github.