Skip to content

Irregular errors when reading certain categorical strings from hdf #10366

@cottrell

Description

@cottrell

It seems that there is something bad happening when we use certain strings with special characters AND the empty string with categoricals:

 # -*- coding: latin-1 -*- import pandas import os examples = [ pandas.Series(['EÉ, 17', '', 'a', 'b', 'c'], dtype='category'), pandas.Series(['EÉ, 17', 'a', 'b', 'c'], dtype='category'), pandas.Series(['', 'a', 'b', 'c'], dtype='category'), pandas.Series(['EE, 17', '', 'a', 'b', 'c'], dtype='category'), pandas.Series(['øü', 'a', 'b', 'c'], dtype='category'), pandas.Series(['Aøü', '', 'a', 'b', 'c'], dtype='category'), pandas.Series(['EÉ, 17', 'øü', 'a', 'b', 'c'], dtype='category') ] def test_hdf(s): f = 'testhdf.h5' if os.path.exists(f): os.remove(f) s.to_hdf(f, 'data', format='table') return pandas.read_hdf(f, 'data') for i, s in enumerate(examples): flag = True e = '' try: test_hdf(s) except Exception as ex: e = ex flag = False print('%d: %s\t%s\t%s' % (i, 'pass' if flag else 'fail', s.tolist(), e))

Results in:

 0: fail ['EÉ, 17', '', 'a', 'b', 'c'] Categorical categories must be unique 1: pass ['EÉ, 17', 'a', 'b', 'c'] 2: pass ['', 'a', 'b', 'c'] 3: pass ['EE, 17', '', 'a', 'b', 'c'] 4: pass ['øü', 'a', 'b', 'c'] 5: fail ['Aøü', '', 'a', 'b', 'c'] Categorical categories must be unique 6: pass ['EÉ, 17', 'øü', 'a', 'b', 'c'] 

Not sure if I am using this incorrectly or if this is actually a corner case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions