Returning dataframe from function is not working?

Question

I am trying to create the following function. However, when I assign the function to the original dataframe, it becomes empty.

def remove_outliers(feature, df): q1 = np.percentile(df[feature], 25) q2 = np.percentile(df[feature], 50) q3 = np.percentile(df[feature], 75) iqr = q3-q1 lower_whisker = df[df[feature] <= q1-1.5*iqr][feature].max() upper_whisker = df[df[feature] <= q3+1.5*iqr][feature].max() return df[(df[feature] < upper_whisker) & (df[feature]>lower_whisker)]

I am assigning as follows:

train = remove_outliers('Power',train)

I believe your issue is that your varibles lower_whisker and/or upper_whisker are set to NaN hence the result from the function is an empty DataFrame — Cedric Zoppolo
– Cedric Zoppolo, Commented Jun 5, 2020 at 22:22
Yeah as Cedric pointed out, either df[df[feature] <= q1-1.5*iqr][feature] or df[df[feature] <= q3+1.5*iqr][feature] is coming out as an empty dataframe causing your output to return a empty dataframe — Raghul Raj
– Raghul Raj, Commented Jun 5, 2020 at 22:32
Actually those variables I pointed out are meant to be numbers. And once those are set to NaN then the result is an emtpy DataFrame — Cedric Zoppolo
– Cedric Zoppolo, Commented Jun 5, 2020 at 22:38
The issue @Chaos_Adm is having is dependant on the data. If the data has values lower than 25 and higher than 75 no issues would arise from the OP code. — Cedric Zoppolo
– Cedric Zoppolo, Commented Jun 5, 2020 at 22:41
Okay my condition for lower_whisker was wrong lol. It should be train[train['Power'] >= q1-1.5*iqr]['Power'].min() — ChaoS Adm
– ChaoS Adm, Commented Jun 6, 2020 at 6:21

Cedric Zoppolo · Accepted Answer · 2020-06-05 22:34:57Z

The problem you are facing is that either variable lower_whisker and/or upper_whisker are set to NaN hence the result from the function is an empty DataFrame. You can resolve this just checking for those results and then return the needed.

Below you can see a possible way to rewrite the function to resolve this:

def remove_outliers(feature, df): q1 = np.percentile(df[feature], 25) q2 = np.percentile(df[feature], 50) q3 = np.percentile(df[feature], 75) iqr = q3-q1 lower_whisker = df[df[feature] <= q1-1.5*iqr][feature].max() upper_whisker = df[df[feature] <= q3+1.5*iqr][feature].max() if lower_whisker is np.nan: return df[(df[feature]>lower_whisker)] elif upper_whisker is np.nan: return df[(df[feature] < upper_whisker)] else: return df[(df[feature] < upper_whisker) & (df[feature]>lower_whisker)]

How can they be set to NaN? There must be a lot of data points within the 25th and 75th percentile of data, unless I am missing something here?
Okay my condition for lower_whisker was wrong lol. It should be train[train['Power'] >= q1-1.5*iqr]['Power'].min()

Collectives™ on Stack Overflow

Returning dataframe from function is not working?

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related