remove outliers from df based on one column

Question

My df has a price column that looks like

0 2125.000000 1 14469.483703 2 14101.832820 3 20287.619019 4 14469.483703 ... 12561 2490.000000 12562 2931.283333 12563 1779.661017 12566 2200.000000 12567 2966.666667

I want to remove all the rows of df with outliers in price_m2 column. I tried two methods :

1st:

df_w_o = df[np.abs(df.price_m2-df.price_m2.mean())<=(1*df.price_m2.std())]

2nd :

df['z_score'] = (df['price_m2'] - df['price_m2'].mean()) / df['price_m2'].std() df_w_o = df[(df['z_score'] < 1) & (df['z_score'] > -1)]

When I check my min max after I get :

print(df_w_o.price_m2.min()) print(df_w_o.price_m2.max()) 0.0 25438.022812290565

Before the removal I get :

print(df.price_m2.min()) print(df.price_m2.max()) 0.0 589933.4267822268

This doesn't feel right, how can I get this large of a price range on data that are supposed to be about real estate. In this example 0 is the extreme low and remains after the outliers removal.

Remember that outilers are at > mean+2*std and < mean-2*std in a normal distribution, two tailed. — razimbres
– razimbres, Commented May 14, 2022 at 13:54
Do you mean that this df_w_o = df[(df['z_score'] < 1) & (df['z_score'] > -1)] should be df_w_o = df[(df['z_score'] < std) & (df['z_score'] > -std)]? My reasoning for using 1std is : since its a price set of a data for a narrow geographic area I assumed 1 time the std should be more accurate — SlimPun
– SlimPun, Commented May 14, 2022 at 17:54

Cole · Accepted Answer · 2022-05-14 13:48:54Z

The presumption is that the raw data the OP has is normally distributed and that there are no outliers. It is very possible that the high value of the original dataset, approximately 589933, is an outlier of the dataset. Let's create a Quantile-Quantile plot of a randomly created dataset:

import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm n = 100 np.random.seed(0) df = pd.DataFrame({"price": np.random.normal(25000, 3000, n)}) qqplt = sm.qqplot(df["price"], line = 's',fit = True) plt.show()

However, we can completely skew this with one single outlier.

outlier = 600000 df.loc[n] = outlier qqplt = sm.qqplot(df["price"], line = 's',fit = True) plt.show()

Anytime we talk about outlier removal and it "doesn't feel right", we really need to take a step back to look at the data. As @kndahl suggest, using a package that includes heuristics and methods for data removal is good. Otherwise, gut feelings should be backed up with your own statistical analysis.

Finally, as to why 0 was still in the final dataset, let's take another look. We will add 0 to the dataset and run your outlier removal. First, we'll look at running your default outlier removal then we will first remove the extremely high $600,000 before running your outlier method.

## simulated data with 0 also added df.loc[n+1] = 0 df_w_o = df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] print(f"With the high outlier of 600,000 still in the original dataset, the new range is \nMin:{df_w_o.price.min()}\nMax:{df_w_o.price.max()}") ## With the high outlier of 600,000 still in the original dataset, the new range is ## Min:0.0 ## Max:31809.263871962823 ## now lets remove the high outlier first before doing our outlier removal df = df.drop(n) df_w_o = df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] print(f"\n\nWith the outlier of 600,000 removed prior to analyzing the data, the new range is \nMin:{df_w_o.price.min()}\nMax:{df_w_o.price.max()}") ## With the outlier of 600,000 removed prior to analyzing the data, the new range is ## Min:21241.61391985022 ## Max:28690.87204218316

In this simulated case, the high outlier skewed the statistics so much that 0 was in the range of one standard deviation. Once we scrubbed the data before processing, that 0 was removed. Related, this may be better on Cross Validated with a more complete dataset provided.

This make sense. But I can't manually remove it because my database is very large and this is just a geographic sample (a circle of 1Km around the center of the query) . I need a solution that scale over the whole country. If I remove the top 1-2% an low 1-2% values in my sample before df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] , would it still be considered acceptable from a data analysis perspective? Or is it just bad practice?
Update : I did remove the top percentile before using the Z score outlier cleaning method and oh boy the results are so much more like what I was expecting in the first place!
I am not sure I would do that; I would want to look at the distribution. But, I would say you could do some cleanup - removing all prices that are 0 is sound. Maybe you could look at the top 10 values as well because it does not take many bad values to skew. Overall, you are trying to clean fake values out. Regardless, it's not like scrubbing the top percentile is the worst thing ever. If this answer helped, consider accepting.

razimbres · Accepted Answer · 2022-05-14 18:12:26Z

@SlimPun, this is what I meant:

import pandas as pd import numpy as np df=pd.DataFrame(np.random.normal(loc=10,scale=5,size=1000)) ## 1000 itens in price column df.columns=["Price"]

Replace outliers by nan:

df[(df.Price>(np.mean(df.Price)+2*np.std(df.Price))) | (df.Price<(np.mean(df.Price)-2*np.std(df.Price)))]=np.nan

Drop outliers

df=df.dropna(how='all') df.shape ## (951,1) - without outliers ** this can change according to your distribution given by numpy

animesh-sharma · Accepted Answer · 2023-02-17 20:39:22Z

This will clean the outlier using filtering for each numerical column that requires outlier treatment for data points that lie beyond the upper cap and the lower cap.

column_list = ['col1', 'col2'] def outlier_clean(df, column_list): for i in column_list: q1 = np.quantile(df[i], 0.25) q3 = np.quantile(df[i], 0.75) median = np.median(df[i]) IQR = q3 - q1 upper_cap = median + (1.5 * IQR) lower_cap = median - (1.5 * IQR) mask1 = df[i] < upper_cap mask2 =df[i] > lower_cap df = df[mask1 | mask2] return df df = outlier_clean(df, column_list)

Collectives™ on Stack Overflow

remove outliers from df based on one column

3 Answers 3

3 Comments

Comments

Comments

Hot Network Questions