Given data that is labeled as outliers, how can I classify data as outliers?


I have a dataset that is a mixture of sparse binary features and quantitative features. I only have definite outliers labeled. How should I approach trying to classify unlabeled data?

I considered using OSVM or other methods of one-class classification.

However, in my data the normal data points are clustered close to the mean. The outliers are generally points that deviate from the mean in any direction. My problem is that the outliers form a sort of high dimensional doughnut around the normal data.

Considering that the deviations occur in all directions, what algorithms would be best suited to the task? Keep in mind that I have significantly less normal labeled data points for training although the normal points will outnumber the outliers in the unlabeled data.

PS I posted this question on Cross Validated as well. Which site should this question be posted on?

EDIT: Mahalanobis is able to work fairly well. However, I have the labeled outliers. Is there someway I could use them to improve accuracy?


Posted 2018-08-28T21:35:51.240

Reputation: 123

1Regarding: Which site should this question be posted on? I think either are suitable in this case, although it would be best not to cross-post, but wait for an answer on one for a while before trying the other. Cross-validated answers are more likely to be theory-based, answers here might be more practice-based, but there is a large overlap. – Neil Slater – 2018-08-28T22:32:44.530



If you are sure that your data are actually normally distributed and that your outliers actually form a high-dimensional ring around your "good" data, you simply require a distance-metric, e.g. Mahalanobis distance is suitable for normal data, and determine the threshold value for distance from the mean to consider all points beyond the high-dimensional ball or ellipsoid depending on the variance structure outliers according to your description.


Posted 2018-08-28T21:35:51.240

Reputation: 291

Yeah, right after I posted this I started using Mahalnobis distance. I wasn't completely sure it was correct though. Thanks. – Halbort – 2018-08-29T06:12:36.067

I was wondering if I could do better. Mahalanobis does not require any labeling. But, I have a bunch of labeled outliers. – Halbort – 2018-08-29T08:08:28.583

1The labelling can help you to calculate the correlation matrix required for Mahalanobis distance more accurately by leaving the known outliers out of the calculation and thus not overestimating the variance. – Alex2006 – 2018-08-29T09:17:49.030