I have a relatively large DataFrame object (about a million rows, hundreds of columns), and I'd like to clip outliers in each column by group. By "clip outliers for each column by group" I mean - compute the 5% and 95% quantiles for each column in a group and clip values outside this quantile range.
Here's the setup I'm currently using:
def winsorize_series(s): q = s.quantile([0.05, 0.95]) if isinstance(q, pd.Series) and len(q) == 2: s[s < q.iloc[0]] = q.iloc[0] s[s > q.iloc[1]] = q.iloc[1] return s def winsorize_df(df): return df.apply(winsorize_series, axis=0) and then, with my DataFrame called features and indexed by DATE, I can do
grouped = features.groupby(level='DATE') result = grouped.apply(winsorize_df) This works, except that it's very slow, presumably due to the nested apply calls: one on each group, and then one for each column in each group. I tried getting rid of the second apply by computing quantiles for all columns at once, but got stuck trying to threshold each column by a different value. Is there a faster way to accomplish this procedure?