Skip to content

BUG: groupby sort issue when using rolling #50296

@domjanbaric

Description

@domjanbaric

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Simpler reproducible example

index = pd.MultiIndex.from_product( [[580, 546], [520, 655, 529], pd.date_range(dt.date(2020, 11, 1), dt.date(2020, 11, 3))], names=["machine_id",'product_id', "sample_date"], ) df = pd.DataFrame( data={'sales_qty': np.random.random(index.shape[0])}, index=index, ).reset_index() df = df.sort_values(by=['machine_id', 'product_id', 'sample_date']) df = df[df['machine_id'].ne(546) | df['product_id'].ne(529)] result = df.groupby(['machine_id','product_id'], sort=False)['sales_qty'].rolling(2, center=True).sum() print(result)

Reproducible Example

import pandas as pd df = pd.MultiIndex.from_product( [[580,546], [519,520,527,526,528,655,526,529,532],pd.date_range(dt.date(2020, 11, 1), dt.date(2022, 10, 31))], names=["machine_id",'product_id', "sample_date"], ) df = pd.DataFrame(index=df).reset_index() df["sample_date"] = pd.to_datetime(df["sample_date"]).dt.date df['sales_qty']=np.random.random(df.shape[0]) df=df.sort_values(by=['machine_id','product_id','sample_date']) df=df.reset_index(drop=True) df=df[~((df['machine_id']==546) & (df['product_id'].isin([526,529,532])))] df = df.groupby(['machine_id','product_id'],sort=False)['sales_qty'].rolling(7, center=True).sum().reset_index()[['machine_id','product_id']].drop_duplicates() df

Issue Description

When using pandas groupby with rolling, flag sort in groupby, sort order of group keys is changed. This occurs when we have more than 1 group key and values in key 2 are not equal for all values of key 1.

In example above we create dataframe with 4 columns, machine_id, product_id, sample_date and sales_qty. Dataframe consist of 2 different machine and 9 different products. Dataframe is sorted by: machine_id, product_id, sample_date. We want to calculate rolling mean for each machine_id-product_id group. However, if we remove some products for machine with lower machine_id (machine_id which is prior in sort, in this example 546) and then calculate rolling mean with flag sort= False in groupby order of group keys in result is actually changed. product_ids which were removed from machine with machine_id = 546 (product_id = 526,529,532) are at the end of dataframe.

 machine_id product_id 0 546 519 730 546 520 1460 546 527 2190 546 528 2920 546 655 3650 580 519 4380 580 520 5110 580 527 5840 580 528 6570 580 655 7300 580 526 <--- 8760 580 529 <--- 9490 580 532 <---

Expected Behavior

Because flag sort = False in groupby expected behaviour is that result will have same sort as the original dataframe, i.e. expected result is:

 machine_id product_id 0 546 519 730 546 520 1460 546 527 2190 546 528 2920 546 655 3650 580 519 4380 580 520 5110 580 526 <--- 6570 580 527 7300 580 528 8030 580 529 <--- 8760 580 532 <--- 9490 580 655

Installed Versions

INSTALLED VERSIONS

commit : 91111fd
python : 3.10.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-56-generic
Version : #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 63.2.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.3
numba : 0.56.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.42
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

Metadata

Metadata

Assignees

Labels

BugGroupbySortinge.g. sort_index, sort_valuesWindowrolling, ewma, expanding

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions