-
- Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas (1.3.0rc1).
-
(optional) I have confirmed this bug exists on the master branch of pandas
Code Sample, a copy-pastable example
import pandas as pd data = pd.array([0, 1, 2, 3], dtype="Int32") df = expected = pd.DataFrame({"data": pd.Series(data)}) result = pd.DataFrame(index=df.index) result.loc[df.index, "data"] = df["data"] print(df["data"].dtype) # prints: Int32 print(result["data"].dtype) # prints: float64 <--Problem description
In my mind, this behavior seems unexpected because the provided dtype should be preserved and not coerced to the default type for an empty Series. This occurs for the nullable integer dtypes as well as Float32/Float64.
I came across this when trying to implement an ExtensionDtype that ended up failing on BaseSetitemTest. test_setitem_with_expansion_dataframe_column:
pandas/pandas/tests/extension/base/setitem.py
Lines 335 to 343 in 648eb40
| def test_setitem_with_expansion_dataframe_column(self, data, full_indexer): | |
| # https://github.com/pandas-dev/pandas/issues/32395 | |
| df = expected = pd.DataFrame({"data": pd.Series(data)}) | |
| result = pd.DataFrame(index=df.index) | |
| key = full_indexer(df) | |
| result.loc[key, "data"] = df["data"] | |
| self.assert_frame_equal(result, expected) |
Interestingly, in the tests for IntegerArray and FloatingArray, the test data includes NaN values which do not result in the coercion to float64:
import pandas as pd data = pd.array([0, pd.NaT, 2, 3], dtype="Int32") df = expected = pd.DataFrame({"data": pd.Series(data)}) result = pd.DataFrame(index=df.index) result.loc[df.index, "data"] = df["data"] print(df["data"].dtype) # prints: Int32 print(result["data"].dtype) # prints: Int32 <--My expectation was that the dtype should be preserved in such cases, with/without NaN values.
Expected Output
I would expect that the dtype of the pd.Series being added to result would be preserved, in the case of the minimal example, result["data"] should be Int32Dtype.
print(df["data"].dtype) # prints: Int32 print(result["data"].dtype) # prints: Int32 <--Output of pd.show_versions()
This was generated from the latest release candidate, but it appears to also occur on the master branch (1.4.0.dev0+56.g648eb40abc)
INSTALLED VERSIONS
commit : 2dd9e9b
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
Version : Darwin Kernel Version 17.7.0: Fri Oct 30 13:34:27 PDT 2020; root:xnu-4570.71.82.8~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.0rc1
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.1
hypothesis : None
sphinx : 3.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.05.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None