Skip to content
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -458,6 +458,7 @@ Other
- Bug in :meth:`DataFrame.sort_index` when passing ``axis="columns"`` and ``ignore_index=True`` and ``ascending=False`` not returning a :class:`RangeIndex` columns (:issue:`57293`)
- Bug in :meth:`DataFrame.where` where using a non-bool type array in the function would return a ``ValueError`` instead of a ``TypeError`` (:issue:`56330`)
- Bug in :meth:`Index.sort_values` when passing a key function that turns values into tuples, e.g. ``key=natsort.natsort_key``, would raise ``TypeError`` (:issue:`56081`)
- Bug in :meth:`Series.unique` returning incorrect value for unique, non-UTF8 encodeable strings (:issue:`45929`)
- Bug in Dataframe Interchange Protocol implementation was returning incorrect results for data buffers' associated dtype, for string and datetime columns (:issue:`54781`)

.. ***DO NOT USE THIS SECTION***
Expand Down
8 changes: 7 additions & 1 deletion pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -1179,6 +1179,8 @@ cdef class StringHashTable(HashTable):
use_na_value = na_value is not None

# assign pointers and pre-filter out missing (if ignore_na)
# https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#caveats-when-using-a-python-string-in-a-c-context
keep_bad_unicode_refs = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't fix the underlying issue. It may make it less likely but the lifecycle is still not properly controlled

I think what needs to be done is something akin to:

'1 \udcd6a NY'.encode('utf8', errors="surrogatepass").decode('utf8', errors="surrogatepass")

I think you can just do that directly in Cython, but we can look at the underlying C API if needed

(side note - I would really love to get rid of get_c_string - not a value added layer of indirection)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to replace get_c_string with the associated C APIs, but if I understand https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#caveats-when-using-a-python-string-in-a-c-context correctly, since we're storing char pointers from these encoded python strings and using them later, we need to keep a reference to those Python strings

Copy link
Member

@WillAyd WillAyd Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK I see the difference - CPython can cache the UTF8 bytes of a string alongside the string object. When the string is not utf8 encodable and we have to create temporary objects we run into trouble. So the list here artificially extends the lifetime of new objects guaranteed to be utf8 encodable to rely on that caching mechanism

To be honest I would be -1 on this change and would rather call it a wontfix. It is a very niche issue that fights the internals of CPython (and for that matter pyarrow, whose strings are utf8).

If someone wanted mixed-encoding Python strings like this I think pa.binary() is a better data type choice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll close this PR then. My main motivation was to make test_unique_bad_unicode not flaky due to this issue, but I'll open a follow up PR making this test permanentlyxfail(strict=True)instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name test_unique_bad_unicode is a misnomer. The code point "\ud83d" exists in Unicode. It is a high surrogate

https://www.unicode.org/charts/PDF/UD800.pdf

The problem is that by itself that high surrogate doesn't mean anything (it would need to be paired with a low surrogate). As such, it doesn't represent any glyph in any encoding.

AFAIU if you wanted to keep that unicode code point, you would have to:

  1. Convert the Python Unicode object to bytes via str.encode(<encoding>, errors="surrogatepass")
  2. Run your algorithms against the surrogatepass bytes
  3. Convert your surrogatepassed bytes back to a unicode string via `bytes.decode(, errors="surrogatepass")

I realize step 2 may not exist today but is a good impetus to work on interop with pa.binary() if this is required

vecs = <const char **>malloc(n * sizeof(char *))
if vecs is NULL:
raise MemoryError()
Expand All @@ -1197,7 +1199,9 @@ cdef class StringHashTable(HashTable):
try:
v = get_c_string(<str>val)
except UnicodeEncodeError:
v = get_c_string(<str>repr(val))
rval = <str>repr(val)
keep_bad_unicode_refs.append(rval)
v = get_c_string(rval)
vecs[i] = v

# compute
Expand All @@ -1223,6 +1227,8 @@ cdef class StringHashTable(HashTable):
idx = self.table.vals[k]
labels[i] = idx

keep_bad_unicode_refs.clear()
del keep_bad_unicode_refs
free(vecs)

# uniques
Expand Down
22 changes: 21 additions & 1 deletion pandas/tests/base/test_unique.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,6 @@ def test_nunique_null(null_obj, index_or_series_obj):
assert obj.nunique(dropna=False) == max(0, num_unique_values)


@pytest.mark.single_cpu
@pytest.mark.xfail(using_pyarrow_string_dtype(), reason="decoding fails")
def test_unique_bad_unicode(index_or_series):
# regression test for #34550
Expand All @@ -116,6 +115,27 @@ def test_unique_bad_unicode(index_or_series):
tm.assert_numpy_array_equal(result, expected)


def test_unique_bad_unicode2(index_or_series):
# regression test for #45929
data_list = [
"1 \udcd6a NY",
"2 \udcd6b NY",
"3 \ud800c NY",
"4 \udcd6d NY",
"5 \udcc3e NY",
]

obj = index_or_series(data_list)
result = obj.unique()
if isinstance(obj, pd.Index):
expected = pd.Index(data_list, dtype=object)
tm.assert_index_equal(result, expected)
else:
expected = np.array(data_list, dtype=object)
tm.assert_numpy_array_equal(result, expected)


@pytest.mark.parametrize("dropna", [True, False])
def test_nunique_dropna(dropna):
# GH37566
ser = pd.Series(["yes", "yes", pd.NA, np.nan, None, pd.NaT])
Expand Down