Pandas groupby quantiles with indices

Question

Say I have the following:

>>> numpy.random.seed(42) >>> df = pandas.DataFrame(numpy.random.randint(0, 100, 19), columns=['val'], index=pandas.date_range('2021-03-01', '2021-03-04', freq='4H')) >>> df val 2021-03-01 00:00:00 51 2021-03-01 04:00:00 92 2021-03-01 08:00:00 14 2021-03-01 12:00:00 71 2021-03-01 16:00:00 60 2021-03-01 20:00:00 20 2021-03-02 00:00:00 82 2021-03-02 04:00:00 86 2021-03-02 08:00:00 74 2021-03-02 12:00:00 74 2021-03-02 16:00:00 87 2021-03-02 20:00:00 99 2021-03-03 00:00:00 23 2021-03-03 04:00:00 2 2021-03-03 08:00:00 21 2021-03-03 12:00:00 52 2021-03-03 16:00:00 1 2021-03-03 20:00:00 87 2021-03-04 00:00:00 29 >>> df.groupby(pandas.Grouper(freq='1D')).quantile(0.95, interpolation='higher') val 2021-03-01 92 2021-03-02 99 2021-03-03 87 2021-03-04 29

How can I also get the indices where quantiles are located within each group? I.e. my desired output is:

 val idx 2021-03-01 92 2021-03-01 04:00:00 2021-03-02 99 2021-03-02 20:00:00 2021-03-03 87 2021-03-03 20:00:00 2021-03-04 29 2021-03-04 00:00:00

ALollz · Accepted Answer · 2021-03-04 15:52:07Z

Instead of quantile calculate the rank within each group and figure out which values are >= your quantile (since you use interpolate='higher'). Then sort the DataFrame, keep only rows above your quantile and take the first within group. Assigning a column as the index brings this along.

m = df.resample('D')['val'].rank(method='dense', pct=True).ge(0.95) df1 = df.assign(index=df.index)[m].sort_values('val') df1.groupby(df1.index.normalize()).first() val index 2021-03-01 92 2021-03-01 04:00:00 2021-03-02 99 2021-03-02 20:00:00 2021-03-03 87 2021-03-03 20:00:00 2021-03-04 29 2021-03-04 00:00:00

Thanks! Is there a reason df.assign(idx=df.index)[m].resample('D').min() would not work?
Am I right that sort_values(j) is being negated by resample('D') afterwards? E.g. for .ge(0.5) (or any value <= 0.5, which is the rank of 51, timewise the first value for 2021-03-01) I get 2021-03-01 51 2021-03-01 00:00:00. pandas.DataFrame(numpy.arange(24, 0, -1), columns=['val'], index=pandas.date_range('2021-03-01', periods=24, freq='H')).sort_values('val').resample('D').first() returns 2021-03-01 24
@levantpied yes you seem to be correct. Instead of the resample methods I've changed it to use a normal groupby which doesn't negate the sort

Quang Hoang · Accepted Answer · 2021-03-04 04:19:54Z

One option is to use groupby().transform:

q95 = (df.groupby(pd.Grouper(freq='1D'))['val'] .transform('quantile', q=0.95, interpolation='higher') ) df[df['val']== q95]

Output:

 val 2021-03-01 04:00:00 92 2021-03-02 20:00:00 99 2021-03-03 20:00:00 87 2021-03-04 00:00:00 29

Thanks! Am I right that this doesn't work when there are duplicates within groups (e.g. with df = pandas.DataFrame(numpy.tile(numpy.arange(3), 10)[0:19], columns=['j'], index=pandas.date_range('2021-03-01', '2021-03-04', freq='4H')))?

Collectives™ on Stack Overflow

Pandas groupby quantiles with indices

2 Answers 2

3 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Related