17

I have a relatively large DataFrame object (about a million rows, hundreds of columns), and I'd like to clip outliers in each column by group. By "clip outliers for each column by group" I mean - compute the 5% and 95% quantiles for each column in a group and clip values outside this quantile range.

Here's the setup I'm currently using:

def winsorize_series(s): q = s.quantile([0.05, 0.95]) if isinstance(q, pd.Series) and len(q) == 2: s[s < q.iloc[0]] = q.iloc[0] s[s > q.iloc[1]] = q.iloc[1] return s def winsorize_df(df): return df.apply(winsorize_series, axis=0) 

and then, with my DataFrame called features and indexed by DATE, I can do

grouped = features.groupby(level='DATE') result = grouped.apply(winsorize_df) 

This works, except that it's very slow, presumably due to the nested apply calls: one on each group, and then one for each column in each group. I tried getting rid of the second apply by computing quantiles for all columns at once, but got stuck trying to threshold each column by a different value. Is there a faster way to accomplish this procedure?

1
  • It seems like this question is addressing the tool of Winsorization (which I'm looking for right now) while the related question is removing rows from the data frame. Different questions, imo and linked, but one does not solve the others' problem. Commented Nov 16, 2022 at 19:23

4 Answers 4

9

There is a winsorize function in scipy.stats.mstats which you might consider using. Note however, that it returns slightly different values than winsorize_series:

In [126]: winsorize_series(pd.Series(range(20), dtype='float'))[0] Out[126]: 0.95000000000000007 In [127]: mstats.winsorize(pd.Series(range(20), dtype='float'), limits=[0.05, 0.05])[0] Out[127]: 1.0 

Using mstats.winsorize instead of winsorize_series is maybe (depending on N, M, P) ~1.5x faster:

import numpy as np import pandas as pd from scipy.stats import mstats def using_mstats_df(df): return df.apply(using_mstats, axis=0) def using_mstats(s): return mstats.winsorize(s, limits=[0.05, 0.05]) N, M, P = 10**5, 10, 10**2 dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P) df = pd.DataFrame(np.random.random((N, M)) , index=dates) df.index.names = ['DATE'] grouped = df.groupby(level='DATE') 

In [122]: %timeit result = grouped.apply(winsorize_df) 1 loops, best of 3: 17.8 s per loop In [123]: %timeit mstats_result = grouped.apply(using_mstats_df) 1 loops, best of 3: 11.2 s per loop 
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks, that's a good pointer, I didn't realize scipy had a winsorize function. However, I presume a more substantial speed up would be achieved if there's a way to do the operation in bulk on the DataFrame without having to operate column by column, similar to how one could standardize or normalize in bulk, e.g., stackoverflow.com/questions/12525722/normalize-data-in-pandas
Are there the same number of dates in each group?
the group by operation is by date, so each group only has one date. Do you mean to ask whether each group has the same number of rows? The answer to that is no, each date can (and typically does) have a different number of rows.
@YT As you alluded to in the OP, pandas now has a .clip() function that should work for you, especially when combined with .quantile().
See this question I just posted, then answered, using clip() and quantile() as suggested by @Zhang18 to handle missing values:stackoverflow.com/questions/50612095/…
|
8

Here is a solution without using scipy.stats.mstats:

def clip_series(s, lower, upper): clipped = s.clip(lower=s.quantile(lower), upper=s.quantile(upper), axis=1) return clipped # Manage list of features to be winsorized feature_list = list(features.columns) for f in feature_list: features[f] = clip_series(features[f], 0.05, 0.95) 

1 Comment

Can you add a short description?
6

I found a rather straightforward way to get this to work, using the transform method in pandas.

from scipy.stats import mstats def winsorize_series(group): return mstats.winsorize(group, limits=[lower_lim,upper_lim]) grouped = features.groupby(level='DATE') result = grouped.transform(winsorize_series) 

Comments

4

Good way to approach this is with vectorization. And for that, I love to use np.where.

import pandas as pd import numpy as np from scipy.stats import mstats import timeit data = pd.Series(range(20), dtype='float') def WinsorizeCustom(data): quantiles = data.quantile([0.05, 0.95]) q_05 = quantiles.loc[0.05] q_95 = quantiles.loc[0.95] out = np.where(data.values <= q_05,q_05, np.where(data >= q_95, q_95, data) ) return out 

For comparison, I wrapped the function from scipy in a function:

def WinsorizeStats(data): out = mstats.winsorize(data, limits=[0.05, 0.05]) return out 

But as you can see, even though my function is pretty fast, its still far from the Scipy implementation:

%timeit WinsorizeCustom(data) #1000 loops, best of 3: 842 µs per loop %timeit WinsorizeStats(data) #1000 loops, best of 3: 212 µs per loop 

If you are interested to read more about speeding up pandas code, I would suggest Optimization Pandas for speed and From Python to Numpy.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.