## evaluation metrics for multiple values per session

I have an application that executes my foo() function several times for each user session. There are 2 alternate algorithms that i can implement as "foo" function and my goal is to evaluate them based on execution delay .

The number of times foo() is called per user session is variable but will not exceed 10000. Say delays values are:

Algo1: [ [12, 30, 20, 40, 24, 280] , [13, 14, 15, 100], [20, 40] ]
Algo2: [ [1, 10, 5, 4, 150, 20] , [14, 10, 20], [21, 33, 41, 79] ]


My question is whats the best metric to pick the winner ?

possible options

1. average from each session, and then evaluate cdf
2. median from each session and then evaluate cdf
3. anything else ?

Here is a suggestion:

Standardise everything (if you ommit this than some big number like 9999 can ruin everything), than take average value per user session. Than, optionally, mutliply this number by x/10 for example, where x is the sample size in the use session (think of it like evidence where more samples add more confidence) and finally average by number of sessions for the algorithm.

It is common to look at 90th or 99th percentile latency in computer systems.

A user won't notice the difference between a couple of milliseconds of lag but if a function occasionally takes several seconds that is very noticeable.