How to compare paired count data?


I am working with a machine learning approach that counts cars in images. I have a predicted dataset, which is the predicted output from the machine learning approach and a paired "true" dataset, which is the result of a human going through each image and counting the number of cars.

The following is a sample of what the datasets look like (note that the actual dataset has 2500 paired samples):

import pandas as pd

d = {'true': [0,0,0,1,1,0,1,0,0,0,0,0,0,0,4,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1], 
     'predicted': [0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1]}
df = pd.DataFrame(data=d)

    true  predicted
0      0          0
1      0          0
2      0          0
3      1          0
4      1          0
5      0          0
6      1          1
7      0          0
8      0          0
9      0          0
10     0          0
11     0          0
12     0          0
13     0          0
14     4          2
15     2          2
16     0          0
17     0          0
18     0          0
19     0          0
20     0          0
21     0          0
22     0          0
23     0          0
24     0          1
25     0          0
26     0          0
27     0          0
28     0          0
29     0          0
30     0          0
31     0          0
32     1          1

I am looking for a way to present the predicted approach to an audience so that they see if the predictions are statistically the same as the true observations and visualize any trends in the data (e.g. the predicted approach has a tendency to over or under predict). If these were categorical data, I would use a confusion matrix, however, I am not sure how to deal with these paired, discrete datasets that are heavily weighted with 0's.

What approach can I take to statistically compare the predicted vs true datasets?


Posted 2019-04-16T02:47:22.027

Reputation: 297

How about a confusion matrix along with a weighted F1 score? You can choose weights that try to reflect the class imbalance. Maybe look into metrics used within medical research, becaue they often have to deal with large class imbalance too. – n1k31t4 – 2019-04-26T09:50:18.503



For visualizing paired 1D data, i.e. true vs predicted counts, you may use something like ggpaired.

You may also visualize the distribution of differences, where each sample is the difference between the true count and its predicted value; excluding the zero differences could better accentuate any deviation.

For a statistical test, you may use Wilcoxon signed rank test (python), or alternatively Sign test if the distribution of differences (true count - predicted count) is not symmetric around the mean.

It is worth noting that a statistical test may lead to "not the same" (rejection of "the same" null hypothesis) for an arbitrarily small difference when a large enough sample is provided, so be careful about the interpretation of "significant". For example, the statement "predicted counts under-estimate the true counts" could be statistically significant but with mean difference 0.05. Although this difference is statistically significant, however, depending on the task, it might be negligible. Therefore, for example, reporting the "mean ± std" of difference along side the statistical test would help interpreting the "significant" better. Also, take a look at this post on effect size.


Posted 2019-04-16T02:47:22.027

Reputation: 7 434


You can use a simple error measure of $\sum (real.people-predicted.people)^2+\sum (^2$, the kind of problem you are dealing with has this objective function as the solved one.

Actually, the algorithms implement this measure as their objective function.

Juan Esteban de la Calle

Posted 2019-04-16T02:47:22.027

Reputation: 2 102

This approach would yield two numbers--one for each class. Would the results of your approach, for example, "person" -7 and "car" +4 be sufficient to describe the predicted accuracy? – Borealis – 2019-04-16T04:31:12.480

You are right, there is something to be corrected in the post.

I edited it, I put the square in the difference, this way the errors will not substract. – Juan Esteban de la Calle – 2019-04-16T04:45:26.500

I appreciate your help in this. I had to reword my question to clarify the problem I am trying to solve. – Borealis – 2019-04-26T03:56:00.630