Long story short: Can you use the difference in z-score of two parametersvariables as an outlier detector.
I have this data set which had poor quality data. Lots of measurement/human error and probably also instrumental error. Because of this the different parametersvariables measured had large variance and where linear relationships were expected to occur between two parametersvariables, none or little was found.
After searching for valid methods to remove outliers and not finding anything particularly helpful i tried the old but golden method of z-score detection, where anything with a z-score of 2.5 or higher was removed. This was a bit helpful but not sufficient. However, while looking at two parametersvariables which i knew theoretically should have a strong positive linear correlation (r = 0.80 expected) but didn't, it occurred to me that their z-scores were wildly different meaning that when parametervariable 1 had a score close to 1, parametervariable 2 had something close to -2.5 for example.
I then thought that the problem with the data wasn't that the parametersvariables varied to much from their own mean, but that they varied differently between parametersvariables. I then took the difference of z-score between the two parametersvariables and made an arbitrary outlier threshold of 1 and -1, meaning anything above 1 and under -1 was removed.
Unsurprisingly, this removed about 50% of my observations but surprisingly this cleaned up the data set to such a degree that i had to double check everything to make sure what had exactly happened. The results were that the whole of the data set showed stronger relationships between parametersvariables which were expected and outliers in parametersvariables that were not even investigated prior had few outliers as well. After the z-score removal RMSECV (Cross validation) was halved.
I'm very worried that i've done a major NO-NO in statistical theory. Searching the web for this was fruitless as i couldn't find anyone trying this method out.
So my questions are:
- Has anyone else tried this method of outlier removal
- Have i just invented the wheel again, and this is a normal method of outlier removal ?
- What can i do to verify that this isn't just plain cherry-picking
If by chance this is a never before heard of method (Illusions of grandeur) i propose this method be called the "Z-score split" as it halves your observations.