Revisions to Running ANOVA - must I remove outliers?

deleted 1 character in body

Source Link

edited Dec 21, 2021 at 0:41

60k
2
41
101

By contrast, if weewe take ranks of the combined data, the ranks will run from 1 to 40, so outliers are not likely. Yet, the relative standing of the values in the two samples is preserved. Consequently, the pooled 2-sample t test (correctly) does not reject.

added 49 characters in body

Source Link

edited Dec 21, 2021 at 0:33

BruceET

60k
2
41
101

For simplicity, I will illustrate with pooled 2-sample t tests--instead of ANOVAs. Consider the following fictitious data comparing two samples of size 20 from the same exponential distribution. There is no difference between the two populations, so a tno test should not reject the null hypothesis. We look first at a pooled 2-sample t test.

HoweverNevertheless, the pooled 2-sample t test is (narrowly) significant at the 5% level.

added 10 characters in body

Source Link

edited Dec 20, 2021 at 19:40

BruceET

60k
2
41
101

One difficulty (out of several) with far outliers in what ought to be normal data is that the null hypothesis (no differences) willmay be rejected too often, leading to false discovery..

pv.r = replicate(105, t.test(rank(c(rexp(20,1),rexp(20,1)))~g, 
                 var.eq=T)$p.val)
mean(pv.r <= .05)
[1] 0.01904762

In this case, the true rejection rate when there is no difference between the two populations are the same is about 2%. Granted, about 5% would be better, but doing the pooled 2-sample t test on ranked data is better than ignoring the skewness of exponential data and resuting outliers.

One difficulty (out of several) with far outliers in what ought to be normal data is that the null hypothesis (no differences) will be rejected too often, leading to false discovery..

pv.r = replicate(105, t.test(rank(c(rexp(20,1),rexp(20,1)))~g, var.eq=T)$p.val)
mean(pv.r <= .05)
[1] 0.01904762

In this case, the true rejection rate when there is no difference between the two populations are the same is about 2%. Granted, about 5% would be better, but doing the pooled 2-sample t test on ranked data is better than ignoring the skewness of exponential data and resuting outliers.

One difficulty (out of several) with far outliers in what ought to be normal data is that the null hypothesis (no differences) may be rejected too often, leading to false discovery.

pv.r = replicate(105, t.test(rank(c(rexp(20,1),rexp(20,1)))~g, 
                 var.eq=T)$p.val)
mean(pv.r <= .05)
[1] 0.01904762

In this case, the true rejection rate when there is no difference between the two populations is about 2%. Granted, about 5% would be better, but doing the pooled 2-sample t test on ranked data is better than ignoring the skewness of exponential data and resuting outliers.

Source Link

answered Dec 20, 2021 at 19:32

BruceET

60k
2
41
101

Loading

Stack Exchange Network

Return to Answer