Return to Question

deleted 11 characters in body

edited Jan 2, 2020 at 12:16

Long story short: Can you use the difference in z-score of two parametersvariables as an outlier detector.

I have this data set which had poor quality data. Lots of measurement/human error and probably also instrumental error. Because of this the different parametersvariables measured had large variance and where linear relationships were expected to occur between two parametersvariables, none or little was found.

After searching for valid methods to remove outliers and not finding anything particularly helpful i tried the old but golden method of z-score detection, where anything with a z-score of 2.5 or higher was removed. This was a bit helpful but not sufficient. However, while looking at two parametersvariables which i knew theoretically should have a strong positive linear correlation (r = 0.80 expected) but didn't, it occurred to me that their z-scores were wildly different meaning that when parametervariable 1 had a score close to 1, parametervariable 2 had something close to -2.5 for example.

I then thought that the problem with the data wasn't that the parametersvariables varied to much from their own mean, but that they varied differently between parametersvariables. I then took the difference of z-score between the two parametersvariables and made an arbitrary outlier threshold of 1 and -1, meaning anything above 1 and under -1 was removed.

Unsurprisingly, this removed about 50% of my observations but surprisingly this cleaned up the data set to such a degree that i had to double check everything to make sure what had exactly happened. The results were that the whole of the data set showed stronger relationships between parametersvariables which were expected and outliers in parametersvariables that were not even investigated prior had few outliers as well. After the z-score removal RMSECV (Cross validation) was halved.

I'm very worried that i've done a major NO-NO in statistical theory. Searching the web for this was fruitless as i couldn't find anyone trying this method out.

So my questions are:

Has anyone else tried this method of outlier removal
Have i just invented the wheel again, and this is a normal method of outlier removal ?
What can i do to verify that this isn't just plain cherry-picking

If by chance this is a never before heard of method (Illusions of grandeur) i propose this method be called the "Z-score split" as it halves your observations.

Long story short: Can you use the difference in z-score of two parameters as an outlier detector.

I have this data set which had poor quality data. Lots of measurement/human error and probably also instrumental error. Because of this the different parameters measured had large variance and where linear relationships were expected to occur between two parameters, none or little was found.

After searching for valid methods to remove outliers and not finding anything particularly helpful i tried the old but golden method of z-score detection, where anything with a z-score of 2.5 or higher was removed. This was a bit helpful but not sufficient. However, while looking at two parameters which i knew theoretically should have a strong positive linear correlation (r = 0.80 expected) but didn't, it occurred to me that their z-scores were wildly different meaning that when parameter 1 had a score close to 1, parameter 2 had something close to -2.5 for example.

I then thought that the problem with the data wasn't that the parameters varied to much from their own mean, but that they varied differently between parameters. I then took the difference of z-score between the two parameters and made an arbitrary outlier threshold of 1 and -1, meaning anything above 1 and under -1 was removed.

Unsurprisingly, this removed about 50% of my observations but surprisingly this cleaned up the data set to such a degree that i had to double check everything to make sure what had exactly happened. The results were that the whole of the data set showed stronger relationships between parameters which were expected and outliers in parameters that were not even investigated prior had few outliers as well. After the z-score removal RMSECV (Cross validation) was halved.

I'm very worried that i've done a major NO-NO in statistical theory. Searching the web for this was fruitless as i couldn't find anyone trying this method out.

So my questions are:

Has anyone else tried this method of outlier removal
Have i just invented the wheel again, and this is a normal method of outlier removal ?
What can i do to verify that this isn't just plain cherry-picking

If by chance this is a never before heard of method (Illusions of grandeur) i propose this method be called the "Z-score split" as it halves your observations.

Long story short: Can you use the difference in z-score of two variables as an outlier detector.

I have this data set which had poor quality data. Lots of measurement/human error and probably also instrumental error. Because of this the different variables measured had large variance and where linear relationships were expected to occur between two variables, none or little was found.

After searching for valid methods to remove outliers and not finding anything particularly helpful i tried the old but golden method of z-score detection, where anything with a z-score of 2.5 or higher was removed. This was a bit helpful but not sufficient. However, while looking at two variables which i knew theoretically should have a strong positive linear correlation (r = 0.80 expected) but didn't, it occurred to me that their z-scores were wildly different meaning that when variable 1 had a score close to 1, variable 2 had something close to -2.5 for example.

I then thought that the problem with the data wasn't that the variables varied to much from their own mean, but that they varied differently between variables. I then took the difference of z-score between the two variables and made an arbitrary outlier threshold of 1 and -1, meaning anything above 1 and under -1 was removed.

Unsurprisingly, this removed about 50% of my observations but surprisingly this cleaned up the data set to such a degree that i had to double check everything to make sure what had exactly happened. The results were that the whole of the data set showed stronger relationships between variables which were expected and outliers in variables that were not even investigated prior had few outliers as well. After the z-score removal RMSECV (Cross validation) was halved.

I'm very worried that i've done a major NO-NO in statistical theory. Searching the web for this was fruitless as i couldn't find anyone trying this method out.

So my questions are:

Has anyone else tried this method of outlier removal
Have i just invented the wheel again, and this is a normal method of outlier removal ?
What can i do to verify that this isn't just plain cherry-picking

If by chance this is a never before heard of method (Illusions of grandeur) i propose this method be called the "Z-score split" as it halves your observations.

Source Link

asked Jan 2, 2020 at 11:10

Ari

Outlier detection using the difference between two z-scores

Long story short: Can you use the difference in z-score of two parameters as an outlier detector.

After searching for valid methods to remove outliers and not finding anything particularly helpful i tried the old but golden method of z-score detection, where anything with a z-score of 2.5 or higher was removed. This was a bit helpful but not sufficient. However, while looking at two parameters which i knew theoretically should have a strong positive linear correlation (r = 0.80 expected) but didn't, it occurred to me that their z-scores were wildly different meaning that when parameter 1 had a score close to 1, parameter 2 had something close to -2.5 for example.

Unsurprisingly, this removed about 50% of my observations but surprisingly this cleaned up the data set to such a degree that i had to double check everything to make sure what had exactly happened. The results were that the whole of the data set showed stronger relationships between parameters which were expected and outliers in parameters that were not even investigated prior had few outliers as well. After the z-score removal RMSECV (Cross validation) was halved.

I'm very worried that i've done a major NO-NO in statistical theory. Searching the web for this was fruitless as i couldn't find anyone trying this method out.

So my questions are:

Has anyone else tried this method of outlier removal
Have i just invented the wheel again, and this is a normal method of outlier removal ?
What can i do to verify that this isn't just plain cherry-picking

If by chance this is a never before heard of method (Illusions of grandeur) i propose this method be called the "Z-score split" as it halves your observations.