-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd import numpy as np #Create a dataframe with a categorical column with two categories and a (numpy) boolean column that is randomly True or False df = pd.DataFrame.from_dict({'category':['A']*10+['B']*10, 'bool_numpy': np.random.rand(20)>0.5}) #Now make another column that is a copy of the numpy boolean column, but converted to pyarrow df['bool_arrow'] = df['bool_numpy'].astype('bool[pyarrow]') print(df.head()) # category bool_numpy bool_arrow # 0 A True True # 1 A True True # 2 A True True # 3 A True True # 4 A False False #Now do a gruopby and aggregate to compute the fraction of True values in each column: true_fracs = df.groupby('category').agg(lambda x: x.sum()/x.count()) print(true_fracs) # bool_numpy bool_arrow # category # A 0.7 True # B 0.6 True #I expect both columns above to have identical floating-point values, not boolean.Issue Description
Doing a groupby and aggregation on a bool[pyarrow] column returns a different datatype than the same operation on a numpy bool column. In particular, it seems to always return another bool[pyarrow] regardless of the aggregation performed.
Expected Behavior
I would expect the same datatype and results to be returned regardless of the backend chosen. Specifically, I would expect the result for category 'A' to be the same as the result of the following calculation, which is the same regardless of backend:
print(df.query("category=='A'")[['bool_numpy','bool_arrow']].sum()/df[['bool_numpy','bool_arrow']].count()) # bool_numpy 0.7 # bool_arrow 0.7 # dtype: float64OR, if this is the intended behavior, I would expect this change to be prominently displayed in the groupby documentation.
Installed Versions
pandas : 2.0.1
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 57.5.0
pip : 23.0.1
Cython : 0.29.33
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.10.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.0
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : 2023.1.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.3.0
pyqt5 : None