-
- Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: Add np.uintc to _factorizers in merge.py to fix KeyError when merging DataFrames with uintc columns #58727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… DataFrames with uintc columns
| hiii @myles the PR is ready to merge |
| thanks @Tirthchoksi22 the same issue was fixed for This is only for Windows and is it a regression from a previous release? if so could probably use the previous fix/tests/release note for this PR? |
| Yes, it appears that there is indeed a regression on Windows machines. The issue seems to have resurfaced after being resolved in a previous release.
|
| @simonjayhawkins After reviewing the code and considering the previous fix, it appears that the solution implemented in this pull request is indeed identical to the one used in the previous resolution of the regression.
|
| @simonjayhawkins Also guide me what to do next do i have to create new PR with previous fix/tests/release or this is ok ??as this would be my first Open Source Contribution |
| you can make changes to you branch and push those changes to update this PR. No need to close and open a new PR. All bug fixes and regression fixes need a test to ensure that the issue does not resurface here. In this case, looking at the fix for For the release note, if you (I can't check a windows only bug easily) determine for which pandas release it was a regression, or if that is many releases ago, then we maybe need a note either under the bug fixes or regression section of the release notes. If a bug fix would go in the 3.0 what's new. If a regression, would go in the 2.2.x release notes (but that depends if we are doing another patch release) cc @lithomas1 |
| @simonjayhawkins Thank you for your feedback. Upon further review, I realized that the changes made to the merge.py file were minimal and consisted of adding just one extra line ( Additionally, I'd like to confirm that the release note has already been included in the bug fixes and regression section, documenting the resolution of the issue. Given this information, I believe that the changes made align with the expectations outlined in your feedback. Please let me know if there are any further adjustments or considerations needed. Thank you for your continued guidance and support throughout this process. |
| ci is failing, /pre-commit |
so what to do now ?? |
| I thought that comment should have triggered the bot. |
| /pre-commit |
| @github-actions pre-commit |
I dont think there's a command like this |
| pre-commit.ci autofix |
for more information, see https://pre-commit.ci
| @pytest.mark.parametrize("d2", [np.int64, np.float64, np.float32, np.float16]) | ||
| def test_join_multi_dtypes(self, any_int_numpy_dtype, d2): | ||
| dtype1 = np.dtype(any_int_numpy_dtype) | ||
| def test_join_multi_dtypes(self, d1, d2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it appears that this has been updated since the last fix.
I doubt that this should be changed. it's probably that any_int_numpy_dtype (and therefore tm.UNSIGNED_INT_NUMPY_DTYPES ) should be updated.
However, tm.UNSIGNED_INT_NUMPY_DTYPES does not appear to contain np.intc either. (maybe how the regression was uncaught)
So looks like will need to add both np.intc and np.uintc to tm.UNSIGNED_INT_NUMPY_DTYPES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i will add it in tm.UNSIGNED_INT_NUMPY_DTYPES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the issue to me (as I don't have the platform to test)
Although my previous comment was to align better with the previous fix, I don't know why we need to explicitly cater for np.intc and np.uintc since we don't need to for the other Python API “C-like” names such as numpy.short.
So adding these does not really make sense to me as I don't really understand the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins what i think is the issue arises because certain functionalities within pandas were not properly handling np.intc and np.uintc data types. As a result, when these data types were encountered, it led to unexpected keyerrors.that's what i feel but don't know why we don't need to mention this in other Python API's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins I still didnt understand why regression occur on windows side perhaps compatibility issue ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins if you think to change this according to last fix then tell me as this fix will also work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functionalities within
pandaswere not properly handlingnp.intc
well it works on other platforms, so it maybe not a pandas issue?
| @simonjayhawkins Could you kindly confirm if the changes proposed in this pull request are satisfactory, or if there are any additional adjustments needed before merging? |
simonjayhawkins left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a release note. put in 3.0.0 notes for now.
| @pytest.mark.parametrize("d2", [np.int64, np.float64, np.float32, np.float16]) | ||
| def test_join_multi_dtypes(self, any_int_numpy_dtype, d2): | ||
| dtype1 = np.dtype(any_int_numpy_dtype) | ||
| def test_join_multi_dtypes(self, d1, d2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functionalities within
pandaswere not properly handlingnp.intc
well it works on other platforms, so it maybe not a pandas issue?
| pre-commit.ci autofix |
for more information, see https://pre-commit.ci
| I was browsing recent pull requests and have a Windows machine so thought I would give this a quick test for you. It looks like the difference is that on Windows np.uintc is not aliased to any of the types already in the _factorizers dictionary: Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> np.uintc is np.int64 False >>> np.uintc is np.longlong False >>> np.uintc is np.int32 False >>> np.uintc is np.int16 False >>> np.uintc is np.int8 False >>> np.uintc is np.uint64 False >>> np.uintc is np.uint32 False >>> np.uintc is np.uint16 False >>> np.uintc is np.uint8 False >>> np.uintc is np.bool_ False >>> np.uintc is np.float64 False >>> np.uintc is np.float32 False >>> np.uintc is np.complex64 False >>> np.uintc is np.complex128 False >>> np.uintc is np.object_ FalseWhereas at least in Linux it is aliased to the np.uint32 type: Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> np.uintc is np.int64 False >>> np.uintc is np.longlong False >>> np.uintc is np.int32 False >>> np.uintc is np.int16 False >>> np.uintc is np.int8 False >>> np.uintc is np.uint64 False >>> np.uintc is np.uint32 True >>> np.uintc is np.uint16 False >>> np.uintc is np.uint8 False >>> np.uintc is np.bool_ False >>> np.uintc is np.float64 False >>> np.uintc is np.float32 False >>> np.uintc is np.complex64 False >>> np.uintc is np.complex128 False >>> np.uintc is np.object_ FalseThis is probably related to Windows using the LLP64 64-bit data model vs LP64 used elsewhere (https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models). https://stackoverflow.com/questions/76155091/np-uint32-np-uintc-on-windows gives a possible explanation. |
| @Tirthchoksi22 on a procedural note, the |
ok sorry I didn't know about that |
| Can you please use that comment now because i didnt have install pre-commit in my local environment right now |
feel free to issue that command yourself. I was not suggesting that you could not use it. (However, if you look at the pandas documentation, you will see a section for setting up a development environment) |
Co-authored-by: William Ayd <william.ayd@icloud.com>
| pre-commit.ci autofix |
doc/source/whatsnew/v3.0.0.rst Outdated
| Reshaping | ||
| ^^^^^^^^^ | ||
| - Bug in :meth:`DataFrame.join` inconsistently setting result index name (:issue:`55815`) | ||
| - Fixed issue in `pd.merge` (`#58713`) where merging DataFrames with `np.intc` or `np.uintc` data types caused unexpected behavior or errors. Comprehensive testing now ensures consistent behavior across diverse data type combinations, enhancing stability and robustness of data merging operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you format like the notes around it.
start with "Bug in ...", keep the description short (one line) but be specific (mention "Windows"), and end with (:issue:`58713`)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok done
| pre-commit.ci autofix |
for more information, see https://pre-commit.ci
| @simonjayhawkins @WillAyd @mroeschke Guys any update ??? |
| I thought that it might be interesting to look at the differences in aliases between Windows and Linux for these integer types: Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from typing import reveal_type >>> import numpy as np >>> reveal_type(np.uintc(0)) Runtime type is 'uintc' 0 >>> reveal_type(np.intc(0)) Runtime type is 'intc' 0 >>> reveal_type(np.uint(0)) Runtime type is 'uint32' 0 >>> reveal_type(np.longlong(0)) Runtime type is 'int64' 0 >>> reveal_type(np.ulonglong(0)) Runtime type is 'uint64' 0Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from typing import reveal_type >>> import numpy as np >>> reveal_type(np.uintc(0)) Runtime type is 'uint32' 0 >>> reveal_type(np.intc(0)) Runtime type is 'int32' 0 >>> reveal_type(np.uint(0)) Runtime type is 'uint64' 0 >>> reveal_type(np.longlong(0)) Runtime type is 'longlong' 0 >>> reveal_type(np.ulonglong(0)) Runtime type is 'ulonglong' 0This implies that you will encounter the same behaviour as the initial bug report on Linux, but not Windows if the ulonglong type is used as this isn't aliased on that platform, and this is indeed the case: Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> import pandas as pd >>> df1 = pd.DataFrame({'a':['foo','bar'],'b':np.array([1,2], dtype=np.ulonglong)}) >>> df2 = pd.DataFrame({'a':['foo','baz'],'b':np.array([3,4], dtype=np.ulonglong)}) >>> df3=df1.merge(df2, how = 'outer') >>> print(df3) a b 0 bar 2 1 baz 4 2 foo 1 3 foo 3Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> import pandas as pd >>> df1 = pd.DataFrame({'a':['foo','bar'],'b':np.array([1,2], dtype=np.ulonglong)}) >>> df2 = pd.DataFrame({'a':['foo','baz'],'b':np.array([3,4], dtype=np.ulonglong)}) >>> df3=df1.merge(df2, how = 'outer') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 10487, in merge return merge( ^^^^^^ File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 183, in merge return op.get_result(copy=copy) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 883, in get_result join_index, left_indexer, right_indexer = self._get_join_info() ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 1133, in _get_join_info (left_indexer, right_indexer) = self._get_join_indexers() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 1105, in _get_join_indexers return get_join_indexers( ^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 1703, in get_join_indexers zipped = zip(*mapped) ^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 1700, in <genexpr> _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how) File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 2477, in _factorize_keys klass, lk, rk = _convert_arrays_and_get_rizer_klass(lk, rk) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/pandas/core/reshape/merge.py", line 2556, in _convert_arrays_and_get_rizer_klass klass = _factorizers[lk.dtype.type] ~~~~~~~~~~~~^^^^^^^^^^^^^^^ KeyError: <class 'numpy.ulonglong'>This also suggests that if you were following the same pattern as with intc/uintc that the longlong assignment in _factorizers should be conditional. |
| Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.