Faster way to remove outliers by group in large pandas DataFrame [duplicate]

Question

I have a relatively large DataFrame object (about a million rows, hundreds of columns), and I'd like to clip outliers in each column by group. By "clip outliers for each column by group" I mean - compute the 5% and 95% quantiles for each column in a group and clip values outside this quantile range.

Here's the setup I'm currently using:

def winsorize_series(s): q = s.quantile([0.05, 0.95]) if isinstance(q, pd.Series) and len(q) == 2: s[s < q.iloc[0]] = q.iloc[0] s[s > q.iloc[1]] = q.iloc[1] return s def winsorize_df(df): return df.apply(winsorize_series, axis=0)

and then, with my DataFrame called features and indexed by DATE, I can do

grouped = features.groupby(level='DATE') result = grouped.apply(winsorize_df)

This works, except that it's very slow, presumably due to the nested apply calls: one on each group, and then one for each column in each group. I tried getting rid of the second apply by computing quantiles for all columns at once, but got stuck trying to threshold each column by a different value. Is there a faster way to accomplish this procedure?

It seems like this question is addressing the tool of Winsorization (which I'm looking for right now) while the related question is removing rows from the data frame. Different questions, imo and linked, but one does not solve the others' problem. — Surya Narayanan
– Surya Narayanan, Commented Nov 16, 2022 at 19:23

unutbu · Accepted Answer · 2014-12-11 15:13:04Z

There is a winsorize function in scipy.stats.mstats which you might consider using. Note however, that it returns slightly different values than winsorize_series:

In [126]: winsorize_series(pd.Series(range(20), dtype='float'))[0] Out[126]: 0.95000000000000007 In [127]: mstats.winsorize(pd.Series(range(20), dtype='float'), limits=[0.05, 0.05])[0] Out[127]: 1.0

Using mstats.winsorize instead of winsorize_series is maybe (depending on N, M, P) ~1.5x faster:

import numpy as np import pandas as pd from scipy.stats import mstats def using_mstats_df(df): return df.apply(using_mstats, axis=0) def using_mstats(s): return mstats.winsorize(s, limits=[0.05, 0.05]) N, M, P = 10**5, 10, 10**2 dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P) df = pd.DataFrame(np.random.random((N, M)) , index=dates) df.index.names = ['DATE'] grouped = df.groupby(level='DATE')

In [122]: %timeit result = grouped.apply(winsorize_df) 1 loops, best of 3: 17.8 s per loop In [123]: %timeit mstats_result = grouped.apply(using_mstats_df) 1 loops, best of 3: 11.2 s per loop

Thanks, that's a good pointer, I didn't realize scipy had a winsorize function. However, I presume a more substantial speed up would be achieved if there's a way to do the operation in bulk on the DataFrame without having to operate column by column, similar to how one could standardize or normalize in bulk, e.g., stackoverflow.com/questions/12525722/normalize-data-in-pandas
the group by operation is by date, so each group only has one date. Do you mean to ask whether each group has the same number of rows? The answer to that is no, each date can (and typically does) have a different number of rows.
@YT As you alluded to in the OP, pandas now has a .clip() function that should work for you, especially when combined with .quantile().
See this question I just posted, then answered, using clip() and quantile() as suggested by @Zhang18 to handle missing values:stackoverflow.com/questions/50612095/…

tnf · Accepted Answer · 2019-04-16 20:50:19Z

Here is a solution without using scipy.stats.mstats:

def clip_series(s, lower, upper): clipped = s.clip(lower=s.quantile(lower), upper=s.quantile(upper), axis=1) return clipped # Manage list of features to be winsorized feature_list = list(features.columns) for f in feature_list: features[f] = clip_series(features[f], 0.05, 0.95)

mwolverine · Accepted Answer · 2017-09-06 17:05:40Z

I found a rather straightforward way to get this to work, using the transform method in pandas.

from scipy.stats import mstats def winsorize_series(group): return mstats.winsorize(group, limits=[lower_lim,upper_lim]) grouped = features.groupby(level='DATE') result = grouped.transform(winsorize_series)

HonzaB · Accepted Answer · 2017-12-19 18:20:13Z

Good way to approach this is with vectorization. And for that, I love to use np.where.

import pandas as pd import numpy as np from scipy.stats import mstats import timeit data = pd.Series(range(20), dtype='float') def WinsorizeCustom(data): quantiles = data.quantile([0.05, 0.95]) q_05 = quantiles.loc[0.05] q_95 = quantiles.loc[0.95] out = np.where(data.values <= q_05,q_05, np.where(data >= q_95, q_95, data) ) return out

For comparison, I wrapped the function from scipy in a function:

def WinsorizeStats(data): out = mstats.winsorize(data, limits=[0.05, 0.05]) return out

But as you can see, even though my function is pretty fast, its still far from the Scipy implementation:

%timeit WinsorizeCustom(data) #1000 loops, best of 3: 842 µs per loop %timeit WinsorizeStats(data) #1000 loops, best of 3: 212 µs per loop

If you are interested to read more about speeding up pandas code, I would suggest Optimization Pandas for speed and From Python to Numpy.

Collectives™ on Stack Overflow

Faster way to remove outliers by group in large pandas DataFrame [duplicate]

4 Answers 4

7 Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

1 Comment

Comments

Comments

Linked

Related