3
$\begingroup$

I'm working with big data time-series and am trying to detect outliers. Upon my research I've come across a variety of different simple methods (e.g. here and here) and I'm trying to understand the difference among them compared to most basic ones I've been using in the past (using quantiles and IQR). I'm especially wondering about the point of e.g. the Grubbs'/Dixon's tests and Rosner's test when the recommendation for the former is to validate the detections against boxplots and for Rosner's it requires an a priori estimate of outliers - e.g. as obtained from visual inspection of boxplots.

I guess I fail to see why I would use any of those tests when I could merely use 1.5*IQR +- Q1/Q3?

Any insight is highly appreciated.

$\endgroup$
3
  • 2
    $\begingroup$ All tests and methods mentioned by you are for one-dimensional data, and wouldn't take any time series structure into account. These can be used in various ways, depending on what your problem actually is: (1) Do you want to detect an outlier in a single time series? (2) Do you want to detect an outlying observation at a specific time point looking at various time series, or (3) do you want to detect a time series that is outlying from the others? (3) can't be done using IQR or the cited tests; for (1) these are questionable because the time order is informative. $\endgroup$ Commented yesterday
  • 3
    $\begingroup$ As is often flagged here, and elsewhere, whether points lie beyond Q3 + 1.5 IQR or Q1 $-$ 1.5 IQR was at most Tukey's rule of thumb suggestion for which points should be plotted individually on a box plot in a first pass exploratory analysis. It was not, and is not, a test in any but a uselessly loose sense of the term test, and is dangerously crude if used as a single criterion for deciding which points are problematic, let alone to be excluded or omitted. (What you mean with your notation is clear only to those who recognise the allusion.) $\endgroup$ Commented yesterday
  • $\begingroup$ Why are you testing for outliers at all? Also "big data" naturally has outliers, Tukey's guideline was not only for one dimensional data, it was for small data sets. With large N, even perfectly normal data will have lots of outliers per Tukey. $\endgroup$ Commented 23 hours ago

2 Answers 2

8
$\begingroup$

I would be hesitant to discuss outlier detection methods "in general". In my experience, how to usefully (!) detect issues in data (I heartily dislike the term "outlier") will depend heavily on the context in which your data arise.

For instance, I do forecasting for retail demands, mainly for highly granular data on SKU/day/store level. On this level, there are many products that do not sell at all at a given store on 75% or more of days. But then, of course the first and third quartile are all zero, so the IQR is also zero, and any purely IQR-based test will detect any nonzero demand as an "outlier", which of course makes no sense. Conversely, promotions can heavily increase demands (and also the variability of demands!), so any given promotional demand may be quite far away from non-promotional demands, but may or may not be an "outlier", or more precisely a data point that we may not want to train a forecasting model on.

Thus, which data cleaning method works best in a given situation really depends on the context, and I can easily imagine situations where one or another test works better. Especially in situations with "many" "different" time series, different tests may work best for different characteristics of a given time series.

$\endgroup$
1
  • 1
    $\begingroup$ +1 for "I heartily dislike the term "outlier"" Makes 2 of us :-) $\endgroup$ Commented 13 hours ago
5
$\begingroup$

I second the answer by @StephanKolassa. Also note that none of the tests and methods cited by you takes the time series structure of your data into account, which should be done if (a) you want to detect full time series that are outlying w.r.t. the overall set of series or (b) you want to detect an outlier in a single time series w.r.t. the other values in the same series, independently of other series. IQR and the tests above only make sense if the task is to detect an outlier at a given fixed time point where one series may be outlying w.r.t. the values of the others.

That said, you are right to question the use of the formal tests vs. IQR in practice, where these tests are not that often used. Note in particular that the tests start from the assumption that the non-outliers are normally distributed. Note that this should not be tested in advance using QQ-plots or normality tests like Shapiro-Wilk (as somewhat misleadingly recommended in the cited document), because it does not mean that the data overall are normally distributed (at least not if they have outliers). What it means is that the normal distribution is used as a "reference distribution", i.e., it relies on the decision of the user to brand observations outliers that deviate from normality in the "outer regions". The normal distribution is some kind of standard choice as "non-outlier reference" (the justification of which is a rather complicated and controversial issue), because it has most observations in the middle and observations far away form the mean (in terms of standard deviations) are very unlikely, but it may be inappropriate for example in case of essentially skewly distributed data. The choice of reference distribution should be made based on knowledge what kind of data behaviour is "standard" and how relevant "outlyingness" plays out in the given specific situation. Many choose the normal distribution without thinking much about this, which may or may not be appropriate. For the problem of detecting an outlier in a time series, the reference should probably be a time series model or something that takes the time series character into account.

Based on the normal distribution assumption, the tests will check whether the most extreme observation(s) are in line with what is expected in the extremes of the normal distribution. Under this assumption the tests can guarantee the chosen significance level as type I error rate, and via a p-value they can formalise how sure we can be that certain extreme observations deviate from normality. This is useful information in some applications and is not given by the boxplot rule, but note that just the fact that an observation deviates from what would be expected under normality doesn't necessarily make this point "problematic" or "special". Particularly with big data sets it doesn't even need to be particularly far away from the rest to be branded "outlier"; it is enough if the standard data generating process deviates somewhat from normality, which very often happens in practice. (Note that outlier identification also needs to depend on what the consequences are, i.e., do you want to remove all observations identified as outliers - often a bad idea -; would you do some steps to check how they came about to see whether they are OK - often a better idea?)

The boxplot rule is a quick exploratory tool, which can also be interpreted based on normality. For large numbers of observations from a normal distribution, it will identify somewhat less than 1% of the data as outlier. It is constructed to not rely that strongly on normality, but still has some "similar-to-normal" characteristics (basically there should still be more concentration in the middle, i.e., between first and third quartile, with density going down toward the outer regions, and the principle to identify outliers is symmetric; see Stephan Kolassa's answer for an example where this goes wrong). It is nice in many situations as a first orientation, but there is no guarantee that boxplot-outliers are "bad" or should be deleted. Still for first exploratory analysis its information is usually practically more useful (if not overinterpreted) than what you get from the formal tests.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.