What is the difference between outlier detection and anomaly detection?

10

4

I would like to know the difference in terms of applications (e.g. which one is credit card fraud detection?) and in terms of used techniques.

Example papers which define the task would be welcome.

Martin Thoma

Posted 2017-11-15T11:17:32.773

Reputation: 15 590

have you had a look at this? https://stats.stackexchange.com/questions/189664/difference-between-anomaly-and-outlier. It seems the answer to your question is there.

– moh – 2017-11-15T12:00:19.183

@Moh I've seen it and I think the answers are not very clear. This is why I asked for applications and techniques to be included in answer to my question. – Martin Thoma – 2017-11-15T12:04:15.283

Especially there seems to be no consensus if those two terms have different meanings or not. Let's see if this community finds a consensus / authoritative resources. – Martin Thoma – 2017-11-15T12:22:24.027

Answers

8

(I actually wanted to write this as an answer to the Cross Validated question: Difference between Anomaly and Outlier, but the question is protected - I think answering it here should be fine, despite the lower visibility)

People occasionally argue that there is no difference between an outlier and an anomaly by citing Charu Aggarwal, author of the Book "Outlier Analysis" - particularly, this statement:

Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature.

(Source: "Outlier Analysis" (Springer), Charu Aggarwal, 2017, http://charuaggarwal.net/outlierbook.pdf )

However, this statement does not imply that outliers and anomalies are the same thing - analogously to saying that "Dogs are sometimes referred to as animals" does not mean that they are the same thing.

It's hard to give a formal definition of the terms. The Wikipedia page about outliers refers to the Wikipedia page about anomaly detection and vice versa, and they both contain lots of possible definitions and interpretations of the terms. Things are becoming worse due to the domain-specific definitions and colloquialities, where it seems to be sufficient when two people of the same field roughly know what the other one is talking about...

However, Varun Chandola tries to give a more precise meaning to the term "anomaly" in his anomaly detection survey. Particularly, he classifies anomalies into three categories:

  • Point anomalies: An individual data instance can be considered as anomalous with respect to the rest of data
  • Contextual Anomalies: If a data instance is anomalous in a specific context (but not otherwise)
  • Collective Anomalies: If a collection of related data instances is anomalous with respect to the entire data set

(Summarized from "Anomaly Detection - A Survey", Varun Chandola et al, ACM Computing Surveys 2009, http://cucis.ece.northwestern.edu/projects/DMS/publications/AnomalyDetection.pdf )


Here, the term "point anomaly" seems to be closest to what I'd consider as a possible definition of the word "outlier". And this is in line with the statement by Aggarwal: An outlier is an anomaly. But not every anomaly is an outlier.

(The latter may depend on the definition of the word outlier. Of course, one can define it on a meta-level, and say that an outlier is whatever a certain outlier detection algorithm (or model) detects as such. But most definitions that I encountered so far are based on some sort of "distance", "dissimilarity", or "difference" from a "majority" of other data elements. That sounds reasonable...)

An example: There may be several data points:

14.5, 14.2, 14.4, 14.4, 14.4, 14.4, 14.4, 14.4, 14.4, 14.3, 14.2, 14.6

One can compute the mean and standard deviation and will have a hard time arguing why one of these points should be an "outlier".

For a sequence of data points like this

14.5, 14.2, 14.4, 14.4, -64564.4, 14.4, 14.4, 14.4, 14.4, 14.3, 14.2, 14.6

spotting "the outlier" should be easy.

However, assuming that the first sequence describes, for example, average daily outside temperatures, the fact that the exact same average temperature of 14.4 degrees was measured for a whole week could certainly be considered as an "anomaly".

(Probably a "collective anomaly" according to the definitions above, but I won't argue about that...)


Although I'm on thin ice when arguing about the precise or intuitive meaning of certain terms (because I'm neither a data science expert nor a native English speaker), this would mean that "anomaly" is a much broader term than "outlier". But maybe the data science community is just in the process of sorting out proper definitions of these terms.

Update:

Maybe my gut feeling about the literal meaning of certain words is wrong. But for me, the word "outlier" seems to say "lying somewhere out of (or far away from) something (based on some distance measure)". In that sense, the 14.4s in the first example are not "outliers" per se. But of course, things become tricky very quickly here: One could imagine a model for the data that contains the number of consecutive days with equal temperatures (as in a run length encoding). Computing this model for the given data would yield

1 * 14.5
1 * 14.2
7 * 14.4
1 * 14.3
1 * 14.2
1 * 14.6

where the value 7 does have large distance (difference) to the other values in the model. So the "collective anomaly" of 7 consecutive days with equal temperatures has been turned into a "point anomaly" by this transformation.

Marco13

Posted 2017-11-15T11:17:32.773

Reputation: 390

Very informative. What is refraining us from using "point outliers", "contextual outliers", and "collective outliers"? I think nothing is forcing a distinction. – Esmailian – 2019-03-06T12:35:57.660

@Esmailian I think that the distinction between "outlier" and "anomaly" can make sense. But giving a precise definition of each of these terms that is applicable in every context could be hard (or maybe impossible). I added a short Update pointing out what my interpretation/definition of the word "outlier" is, and how difficult it can be to apply such a definition rigorously... – Marco13 – 2019-03-06T15:01:08.940

The problem with this is, that it is a subjective interpretation. If you could underline the difference with exact citations, it would be much more helpful. – Code Pope – 2019-10-08T13:38:47.880

@CodePope What exactly does this refer to? I added four "citation", for that matter, but pointed out that even the most widely used definitions are vague and sometimes even contradicting each other. – Marco13 – 2019-10-08T13:48:39.847

Of course, you added four citations, but none of them says that there is a difference between outlier and anomaly or that outlier is a subelement of anomaly. Additionally, none of your citations and any other paper that I have read agrees with your interpretation that outliers are point anomalies. It is the common intuition that outliers are single points, but this is not what formal definitions imply. As an example: "An observation (or subset of observations) which appears to beinconsistent with the remainder of that set of data." (Barnet and Lewis - 1994) – Code Pope – 2019-10-08T15:23:09.367

Thus, if you know any paper which agrees with one of your statements (1. outlier is a subset (or subelement) of anomaly 2. outliers are point anomalies), then I would appreciate if you could cite it here. – Code Pope – 2019-10-08T15:28:17.660

I have not said that outliers are point anomalies. Not saying "are", but "...seems to be closest to what I'd consider as a possible definition..." was intentional - I'm very careful here, and tried to emphasize that the water is really muddy here. E.g. your definition (for "outlier"?, by Barnet) mentions a "subset of observations", which Chandola would probably refer to more specifically as a "Collective Anomaly". I do not claim that what I said are "The Only Correct Definitions®". Quite the contrary. There are multiple, vague definitions, and none that everybody will agree on. – Marco13 – 2019-10-08T15:31:32.617

Regarding one specific point: People have claimed that outliers and anomalies are essentially the same, seemingly because Aggarwal said "Outliers are ... anomalies". And I said that ""anomaly" [would be] a much broader term than "outlier"", because what Aggarwal said simply does not imply that they are "the same". Generally, "A is a B" does not mean "A equals B". – Marco13 – 2019-10-08T15:37:56.313

Regarding the point that ""anomaly" [would be] a much broader term than "outlier"": Here is a part of the book where Aggarwal defines the relationship between these two term just in opposite, at least for his book: "throughout this book, the term “outlier” refers to a data point that could either be considered an abnormality or noise, whereas an “anomaly” refers to a special kind of outlier that is of interest to an analyst". – Code Pope – 2019-10-08T18:53:28.363

@CodePope I see. One could argue that this is the point where it becomes domain specific (until now, I have just been talking about "points in n-D space", so to speak). For example: When you observe the light intensity in a room, and someone like the janitor (regularly) switches on the light between 7:00 and 7:01 am, then the sudden spike in light intensity (as a measured value) could be considered as an "outlier" according to any reasonable definition. But when you know that the janitor does that regularly, it's not an "anomaly" (for the analyst). But when it happens at 9:23pm, it may be. – Marco13 – 2019-10-08T22:57:41.360

7

Fundamentally there is no difference. Say you have data and you want to build a model of it. As the name suggests, modeling is about finding a model, that is, a simplified representation of your data. In turn, we can view the model as an underlying process that generated your data in the first place, plus some noise. From that point of view, the data you see was generated by the model - and we can say that some of the points you see are less likely to have been generated by your model than others.

For example, if you build a linear regression model, points far away from the regression line are less likely to have been generated by the model. That's what people mean when they talk about 'residuals' in normal statistical parlance. It's also called the likelihood of the data.

Data points that have low likelihood, according to the model you've created, are anomalies or outliers. From a model-building point of view, they are the same thing.

Colloquially, people use the term 'outlier' to mean "something I should remove from the dataset so that it doesn't skew my model I'm building", usually because they have a hunch that there is something wrong with that data and the model they want to build shouldn't need to account for it. An outlier is often considered to being a hinderance to building a model that describes the data overall - simply because the model will ALSO try to explain the outlier, which is not what the practitioner wants.

On the other hand, you can use the fact that a model also assigns a likelihood to each data point to your advantage - might build a model that describes a simpler trend in the data, and then actively look for existing or new values that have very low likelihood. These are what people mean when they say 'anomalies'. If your goal is to detect anomalies, especially in new data, this is a great thing. One person's outlier is another person's anomaly!

tom

Posted 2017-11-15T11:17:32.773

Reputation: 1 938

0

An outlier is a data point that is out of ordinary relatively.

An anomaly is a special case of outliers, they could have special/useful information or reasons.

jatin gupta

Posted 2017-11-15T11:17:32.773

Reputation: 1