Example papers which define the task would be welcome.
Example papers which define the task would be welcome.
(I actually wanted to write this as an answer to the Cross Validated question: Difference between Anomaly and Outlier, but the question is protected - I think answering it here should be fine, despite the lower visibility)
People occasionally argue that there is no difference between an outlier and an anomaly by citing Charu Aggarwal, author of the Book "Outlier Analysis" - particularly, this statement:
Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature.
(Source: "Outlier Analysis" (Springer), Charu Aggarwal, 2017, http://charuaggarwal.net/outlierbook.pdf )
However, this statement does not imply that outliers and anomalies are the same thing - analogously to saying that "Dogs are sometimes referred to as animals" does not mean that they are the same thing.
It's hard to give a formal definition of the terms. The Wikipedia page about outliers refers to the Wikipedia page about anomaly detection and vice versa, and they both contain lots of possible definitions and interpretations of the terms. Things are becoming worse due to the domain-specific definitions and colloquialities, where it seems to be sufficient when two people of the same field roughly know what the other one is talking about...
However, Varun Chandola tries to give a more precise meaning to the term "anomaly" in his anomaly detection survey. Particularly, he classifies anomalies into three categories:
(Summarized from "Anomaly Detection - A Survey", Varun Chandola et al, ACM Computing Surveys 2009, http://cucis.ece.northwestern.edu/projects/DMS/publications/AnomalyDetection.pdf )
Here, the term "point anomaly" seems to be closest to what I'd consider as a possible definition of the word "outlier". And this is in line with the statement by Aggarwal: An outlier is an anomaly. But not every anomaly is an outlier.
(The latter may depend on the definition of the word outlier. Of course, one can define it on a meta-level, and say that an outlier is whatever a certain outlier detection algorithm (or model) detects as such. But most definitions that I encountered so far are based on some sort of "distance", "dissimilarity", or "difference" from a "majority" of other data elements. That sounds reasonable...)
An example: There may be several data points:
14.5, 14.2, 14.4, 14.4, 14.4, 14.4, 14.4, 14.4, 14.4, 14.3, 14.2, 14.6
One can compute the mean and standard deviation and will have a hard time arguing why one of these points should be an "outlier".
For a sequence of data points like this
14.5, 14.2, 14.4, 14.4, -64564.4, 14.4, 14.4, 14.4, 14.4, 14.3, 14.2, 14.6
spotting "the outlier" should be easy.
However, assuming that the first sequence describes, for example, average daily outside temperatures, the fact that the exact same average temperature of
14.4 degrees was measured for a whole week could certainly be considered as an "anomaly".
(Probably a "collective anomaly" according to the definitions above, but I won't argue about that...)
Although I'm on thin ice when arguing about the precise or intuitive meaning of certain terms (because I'm neither a data science expert nor a native English speaker), this would mean that "anomaly" is a much broader term than "outlier". But maybe the data science community is just in the process of sorting out proper definitions of these terms.
Maybe my gut feeling about the literal meaning of certain words is wrong. But for me, the word "outlier" seems to say "lying somewhere out of (or far away from) something (based on some distance measure)". In that sense, the
14.4s in the first example are not "outliers" per se. But of course, things become tricky very quickly here: One could imagine a model for the data that contains the number of consecutive days with equal temperatures (as in a run length encoding). Computing this model for the given data would yield
1 * 14.5 1 * 14.2 7 * 14.4 1 * 14.3 1 * 14.2 1 * 14.6
where the value
7 does have large distance (difference) to the other values in the model. So the "collective anomaly" of 7 consecutive days with equal temperatures has been turned into a "point anomaly" by this transformation.
Fundamentally there is no difference. Say you have data and you want to build a model of it. As the name suggests, modeling is about finding a model, that is, a simplified representation of your data. In turn, we can view the model as an underlying process that generated your data in the first place, plus some noise. From that point of view, the data you see was generated by the model - and we can say that some of the points you see are less likely to have been generated by your model than others.
For example, if you build a linear regression model, points far away from the regression line are less likely to have been generated by the model. That's what people mean when they talk about 'residuals' in normal statistical parlance. It's also called the likelihood of the data.
Data points that have low likelihood, according to the model you've created, are anomalies or outliers. From a model-building point of view, they are the same thing.
Colloquially, people use the term 'outlier' to mean "something I should remove from the dataset so that it doesn't skew my model I'm building", usually because they have a hunch that there is something wrong with that data and the model they want to build shouldn't need to account for it. An outlier is often considered to being a hinderance to building a model that describes the data overall - simply because the model will ALSO try to explain the outlier, which is not what the practitioner wants.
On the other hand, you can use the fact that a model also assigns a likelihood to each data point to your advantage - might build a model that describes a simpler trend in the data, and then actively look for existing or new values that have very low likelihood. These are what people mean when they say 'anomalies'. If your goal is to detect anomalies, especially in new data, this is a great thing. One person's outlier is another person's anomaly!
An outlier is a data point that is out of ordinary relatively.
An anomaly is a special case of outliers, they could have special/useful information or reasons.