1

I am trying to create the following function. However, when I assign the function to the original dataframe, it becomes empty.

def remove_outliers(feature, df): q1 = np.percentile(df[feature], 25) q2 = np.percentile(df[feature], 50) q3 = np.percentile(df[feature], 75) iqr = q3-q1 lower_whisker = df[df[feature] <= q1-1.5*iqr][feature].max() upper_whisker = df[df[feature] <= q3+1.5*iqr][feature].max() return df[(df[feature] < upper_whisker) & (df[feature]>lower_whisker)] 

I am assigning as follows:

train = remove_outliers('Power',train) 
6
  • I believe your issue is that your varibles lower_whisker and/or upper_whisker are set to NaN hence the result from the function is an empty DataFrame Commented Jun 5, 2020 at 22:22
  • Yeah as Cedric pointed out, either df[df[feature] <= q1-1.5*iqr][feature] or df[df[feature] <= q3+1.5*iqr][feature] is coming out as an empty dataframe causing your output to return a empty dataframe Commented Jun 5, 2020 at 22:32
  • Actually those variables I pointed out are meant to be numbers. And once those are set to NaN then the result is an emtpy DataFrame Commented Jun 5, 2020 at 22:38
  • The issue @Chaos_Adm is having is dependant on the data. If the data has values lower than 25 and higher than 75 no issues would arise from the OP code. Commented Jun 5, 2020 at 22:41
  • 1
    Okay my condition for lower_whisker was wrong lol. It should be train[train['Power'] >= q1-1.5*iqr]['Power'].min() Commented Jun 6, 2020 at 6:21

1 Answer 1

1

The problem you are facing is that either variable lower_whisker and/or upper_whisker are set to NaN hence the result from the function is an empty DataFrame. You can resolve this just checking for those results and then return the needed.

Below you can see a possible way to rewrite the function to resolve this:

def remove_outliers(feature, df): q1 = np.percentile(df[feature], 25) q2 = np.percentile(df[feature], 50) q3 = np.percentile(df[feature], 75) iqr = q3-q1 lower_whisker = df[df[feature] <= q1-1.5*iqr][feature].max() upper_whisker = df[df[feature] <= q3+1.5*iqr][feature].max() if lower_whisker is np.nan: return df[(df[feature]>lower_whisker)] elif upper_whisker is np.nan: return df[(df[feature] < upper_whisker)] else: return df[(df[feature] < upper_whisker) & (df[feature]>lower_whisker)] 
Sign up to request clarification or add additional context in comments.

2 Comments

How can they be set to NaN? There must be a lot of data points within the 25th and 75th percentile of data, unless I am missing something here?
Okay my condition for lower_whisker was wrong lol. It should be train[train['Power'] >= q1-1.5*iqr]['Power'].min()

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.