Long story short: Can you use the difference in z-score of two variables as an outlier detector.
I have this data set which had poor quality data. Lots of measurement/human error and probably also instrumental error. Because of this the different variables measured had large variance and where linear relationships were expected to occur between two variables, none or little was found.
After searching for valid methods to remove outliers and not finding anything particularly helpful i tried the old but golden method of z-score detection, where anything with a z-score of 2.5 or higher was removed. This was a bit helpful but not sufficient. However, while looking at two variables which i knew theoretically should have a strong positive linear correlation (r = 0.80 expected) but didn't, it occurred to me that their z-scores were wildly different meaning that when variable 1 had a score close to 1, variable 2 had something close to -2.5 for example.
I then thought that the problem with the data wasn't that the variables varied to much from their own mean, but that they varied differently between variables. I then took the difference of z-score between the two variables and made an arbitrary outlier threshold of 1 and -1, meaning anything above 1 and under -1 was removed.
Unsurprisingly, this removed about 50% of my observations but surprisingly this cleaned up the data set to such a degree that i had to double check everything to make sure what had exactly happened. The results were that the whole of the data set showed stronger relationships between variables which were expected and outliers in variables that were not even investigated prior had few outliers as well. After the z-score removal RMSECV (Cross validation) was halved.
I'm very worried that i've done a major NO-NO in statistical theory. Searching the web for this was fruitless as i couldn't find anyone trying this method out.
So my questions are:
- Has anyone else tried this method of outlier removal
- Have i just invented the wheel again, and this is a normal method of outlier removal ?
- What can i do to verify that this isn't just plain cherry-picking
If by chance this is a never before heard of method (Illusions of grandeur) i propose this method be called the "Z-score split" as it halves your observations.