Outlier detection using the difference between two z-scores

Question

Long story short: Can you use the difference in z-score of two variables as an outlier detector.

I have this data set which had poor quality data. Lots of measurement/human error and probably also instrumental error. Because of this the different variables measured had large variance and where linear relationships were expected to occur between two variables, none or little was found.

After searching for valid methods to remove outliers and not finding anything particularly helpful i tried the old but golden method of z-score detection, where anything with a z-score of 2.5 or higher was removed. This was a bit helpful but not sufficient. However, while looking at two variables which i knew theoretically should have a strong positive linear correlation (r = 0.80 expected) but didn't, it occurred to me that their z-scores were wildly different meaning that when variable 1 had a score close to 1, variable 2 had something close to -2.5 for example.

I then thought that the problem with the data wasn't that the variables varied to much from their own mean, but that they varied differently between variables. I then took the difference of z-score between the two variables and made an arbitrary outlier threshold of 1 and -1, meaning anything above 1 and under -1 was removed.

Unsurprisingly, this removed about 50% of my observations but surprisingly this cleaned up the data set to such a degree that i had to double check everything to make sure what had exactly happened. The results were that the whole of the data set showed stronger relationships between variables which were expected and outliers in variables that were not even investigated prior had few outliers as well. After the z-score removal RMSECV (Cross validation) was halved.

I'm very worried that i've done a major NO-NO in statistical theory. Searching the web for this was fruitless as i couldn't find anyone trying this method out.

So my questions are:

Has anyone else tried this method of outlier removal
Have i just invented the wheel again, and this is a normal method of outlier removal ?
What can i do to verify that this isn't just plain cherry-picking

If by chance this is a never before heard of method (Illusions of grandeur) i propose this method be called the "Z-score split" as it halves your observations.

z-score, meaning presumably (value $-$ mean) / SD, is a lousy criterion for identifying problematic values. The difficulty is that the mean and SD may themselves be thrown off by such outliers. I am with @Peter Flom in distrusting all automated methods even for identifying possible outliers, but if compelled to choose I would start with (value $-$ median) / IQR assuming values are on an appropriate and approximately symmetric scale. — Nick Cox
– Nick Cox, Commented Jan 2, 2020 at 12:16
@Nick Cox, I've made the necessary edits. I agree that automated outlier removal is risky. I'll have to try and see if the modified z-score method is useful in my data. — Ari
– Ari, Commented Jan 2, 2020 at 12:18

Peter Flom · Accepted Answer · 2020-01-02 12:06:59Z

First, I wouldn't trust any automatic method of outlier removal. You might use an automatic method to identify points that might be outliers, but you should then look at each of those points before removing it and not remove it just because it is an outlier. Only remove points if you know they are errors.

Second, there are methods for dealing with data that has outliers. It seems like you are doing regression so you might look into robust regression or quantile regression.

Third (and maybe most important) if you have so many errors that result in outliers, you probably have errors that result in inliers. That is, wrong data that doesn't stand out. If your data is so full of errors, you might not be able to find any good relationships and you might want to abandon the project because any conclusions you come to will have a high risk of being wrong.

Flom I'll follow your suggestion and look into robust and quantile regression. I agree with you that the data is probably of poor quality and should be abandoned, but i have to try and see if something can come of it. — Ari
– Ari, Commented Jan 2, 2020 at 12:22

Stack Exchange Network

Outlier detection using the difference between two z-scores

1 Answer 1

Hot Network Questions

Outlier detection using the difference between two z-scores

1 Answer 1

Related

Hot Network Questions