I'd like to winsorize several columns of data in a pandas Data Frame. Each column has some NaN, which affects the winsorization, so they need to be removed. The only way I know how to do this is to remove them for all of the data, rather than remove them only column-by-column.
MWE:
import numpy as np import pandas as pd from scipy.stats.mstats import winsorize # Create Dataframe N, M, P = 10**5, 4, 10**2 dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P) df = pd.DataFrame(np.random.random((N, M)) , index=dates) df.index.names = ['DATE'] df.columns = ['one','two','three','four'] # Now scale them differently so you can see the winsorization df['four'] = df['four']*(10**5) df['three'] = df['three']*(10**2) df['two'] = df['two']*(10**-1) df['one'] = df['one']*(10**-4) # Create NaN df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan Here is the baseline distribution:
df.quantile([0, 0.01, 0.5, 0.99, 1]) output:
one two three four 0.00 2.336618e-10 2.294259e-07 0.002437 2.305353 0.01 9.862626e-07 9.742568e-04 0.975807 1003.814520 0.50 4.975859e-05 4.981049e-02 50.290946 50374.548980 0.99 9.897463e-05 9.898590e-02 98.978263 98991.438985 1.00 9.999983e-05 9.999966e-02 99.996793 99999.437779 This is how I'm winsorizing:
def using_mstats(s): return winsorize(s, limits=[0.01, 0.01]) wins = df.apply(using_mstats, axis=0) wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1]) Which gives this:
Out[356]: one two three four 0.00 0.000001 0.001060 1.536882 1003.820149 0.01 0.000001 0.001060 1.536882 1003.820149 0.25 0.000025 0.024975 25.200378 25099.994780 0.50 0.000050 0.049810 50.290946 50374.548980 0.75 0.000075 0.074842 74.794537 75217.343920 0.99 0.000099 0.098986 98.978263 98991.436957 1.00 0.000100 0.100000 99.996793 98991.436957 Column four is correct because it has no NaN but the others are incorrect. The 99th percentile and Max should be the same. The observations counts are identical for both:
In [357]: df.count() Out[357]: one 90700 two 91600 three 63500 four 100000 dtype: int64 In [358]: wins.count() Out[358]: one 90700 two 91600 three 63500 four 100000 dtype: int64 This is how I can 'solve' it, but at the cost of losing a lot of my data:
wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0) wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1]) Output:
Out[360]: one two three four 0.00 9.686203e-07 0.000928 0.965702 1005.209503 0.01 9.686203e-07 0.000928 0.965702 1005.209503 0.25 2.486052e-05 0.024829 25.204032 25210.837443 0.50 4.980946e-05 0.049894 50.299004 50622.227179 0.75 7.492750e-05 0.075059 74.837900 75299.906415 0.99 9.895563e-05 0.099014 98.972310 99014.311761 1.00 9.895563e-05 0.099014 98.972310 99014.311761 In [361]: wins2.count() Out[361]: one 51700 two 51700 three 51700 four 51700 dtype: int64 How can I winsorize the data, by column, that is not NaN, while maintaining the data shape (i.e. not removing rows)?