I do not know a standard answer to this, but I thought about it some times ago and I have some ideas to share.
When you have one confusion matrix, you have more or less a picture of how you classification model confuse (mis-classify) classes. When you repeat classification tests you will end up having multiple confusion matrices. The question is how to get a meaningful aggregate confusion matrix. The answer depends on what is the meaning of meaningful (pun intended). I think there is not a single version of meaningful.
One way is to follow the rough idea of multiple testing. In general, you test something multiple times in order to get more accurate results. As a general principle one can reason that averaging on the results of the multiple tests reduces the variance of the estimates, so as a consequence, it increases the precision of the estimates. You can proceed in this way, of course, by summing position by position and then dividing by the number of tests. You can go further and instead of estimating only a value for each cell of the confusion matrix, you can also compute some confidence intervals, t-values and so on. This is OK from my point of view. But it tell only one side of the story.
The other side of the story which might be investigated is how stable are the results for the same instances. To exemplify that I will take an extreme example. Suppose you have a classification model for 3 classes. Suppose that these classes are in the same proportion. If your model is able to predict one class perfectly and the other 2 classes with random like performance, you will end up having 0.33 + 0.166 + 0.166 = 0.66 misclassification ratio. This might seem good, but even if you take a look on a single confusion matrix you will not know that your performance on the last 2 classes varies wildly. Multiple tests can help. But averaging the confusion matrices would reveal this? My belief is not. The averaging will give the same result more or less, and doing multiple tests will only decrease the variance of the estimation. However it says nothing about the wild instability of prediction.
So another way to do compose the confusion matrices would better involve a prediction density for each instance. One can build this density by counting for each instance, the number of times it was predicted a given class. After normalization, you will have for each instance a prediction density rather a single prediction label. You can see that a single prediction label is similar with a degenerated density where you have probability of 1 for the predicted class and 0 for the other classes for each separate instance. Now having this densities one can build a confusion matrix by adding the probabilities from each instance and predicted class to the corresponding cell of the aggregated confusion matrix.
One can argue that this would give similar results like the previous method. However I think that this might be the case sometimes, often when the model has low variance, the second method is less affected by how the samples from the tests are drawn, and thus more stable and closer to the reality.
Also the second method might be altered in order to obtain a third method, where one can assign as prediction the label with highest density from the prediction of a given instance.
I do not implemented those things but I plan to study further because I believe might worth spending some time.