7

I'd like to winsorize several columns of data in a pandas Data Frame. Each column has some NaN, which affects the winsorization, so they need to be removed. The only way I know how to do this is to remove them for all of the data, rather than remove them only column-by-column.

MWE:

import numpy as np import pandas as pd from scipy.stats.mstats import winsorize # Create Dataframe N, M, P = 10**5, 4, 10**2 dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P) df = pd.DataFrame(np.random.random((N, M)) , index=dates) df.index.names = ['DATE'] df.columns = ['one','two','three','four'] # Now scale them differently so you can see the winsorization df['four'] = df['four']*(10**5) df['three'] = df['three']*(10**2) df['two'] = df['two']*(10**-1) df['one'] = df['one']*(10**-4) # Create NaN df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan 

Here is the baseline distribution:

df.quantile([0, 0.01, 0.5, 0.99, 1]) 

output:

 one two three four 0.00 2.336618e-10 2.294259e-07 0.002437 2.305353 0.01 9.862626e-07 9.742568e-04 0.975807 1003.814520 0.50 4.975859e-05 4.981049e-02 50.290946 50374.548980 0.99 9.897463e-05 9.898590e-02 98.978263 98991.438985 1.00 9.999983e-05 9.999966e-02 99.996793 99999.437779 

This is how I'm winsorizing:

def using_mstats(s): return winsorize(s, limits=[0.01, 0.01]) wins = df.apply(using_mstats, axis=0) wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1]) 

Which gives this:

Out[356]: one two three four 0.00 0.000001 0.001060 1.536882 1003.820149 0.01 0.000001 0.001060 1.536882 1003.820149 0.25 0.000025 0.024975 25.200378 25099.994780 0.50 0.000050 0.049810 50.290946 50374.548980 0.75 0.000075 0.074842 74.794537 75217.343920 0.99 0.000099 0.098986 98.978263 98991.436957 1.00 0.000100 0.100000 99.996793 98991.436957 

Column four is correct because it has no NaN but the others are incorrect. The 99th percentile and Max should be the same. The observations counts are identical for both:

In [357]: df.count() Out[357]: one 90700 two 91600 three 63500 four 100000 dtype: int64 In [358]: wins.count() Out[358]: one 90700 two 91600 three 63500 four 100000 dtype: int64 

This is how I can 'solve' it, but at the cost of losing a lot of my data:

wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0) wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1]) 

Output:

Out[360]: one two three four 0.00 9.686203e-07 0.000928 0.965702 1005.209503 0.01 9.686203e-07 0.000928 0.965702 1005.209503 0.25 2.486052e-05 0.024829 25.204032 25210.837443 0.50 4.980946e-05 0.049894 50.299004 50622.227179 0.75 7.492750e-05 0.075059 74.837900 75299.906415 0.99 9.895563e-05 0.099014 98.972310 99014.311761 1.00 9.895563e-05 0.099014 98.972310 99014.311761 In [361]: wins2.count() Out[361]: one 51700 two 51700 three 51700 four 51700 dtype: int64 

How can I winsorize the data, by column, that is not NaN, while maintaining the data shape (i.e. not removing rows)?

1 Answer 1

14

As often happens, simply creating the MWE helped clarify. I need to use clip() in combination with quantile() as below:

df2 = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99), axis=1) df2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1]) 

Output:

 one two three four 0.00 9.862626e-07 0.000974 0.975807 1003.814520 0.01 9.862666e-07 0.000974 0.975816 1003.820092 0.25 2.485043e-05 0.024975 25.200378 25099.994780 0.50 4.975859e-05 0.049810 50.290946 50374.548980 0.75 7.486737e-05 0.074842 74.794537 75217.343920 0.99 9.897462e-05 0.098986 98.978245 98991.436977 1.00 9.897463e-05 0.098986 98.978263 98991.438985 In [384]: df2.count() Out[384]: one 90700 two 91600 three 63500 four 100000 dtype: int64 

The numbers are different from above because I have maintained all of the data in each column that is not missing (NaN).

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.