18

This is a string I'm getting out of pandas.DataFrame.to_json(), putting it into redis, getting it out of redis elsewhere, and trying to read it via pandas.read_json():

DFJ {"args":{"0":"[]","1":"[]","2":"[]","3":"[]","4":"[]","5":"[]","6":"[]","7":"[]"},"date":{"0":1385944439000000000,"1":1385944439000000000,"2":1385944440000000000,"3":1385944440000000000,"4":1385944440000000000,"5":1385944440000000000,"6":1385944440000000000,"7":1385944440000000000},"host":{"0":"yy38.segm1.org","1":"kyy1.segm1.org","2":"yy10.segm1.org","3":"yy24.segm1.org","4":"yy24.segm1.org","5":"yy34.segm1.org","6":"yy15.segm1.org","7":"yy15.segm1.org"},"kwargs":{"0":"{}","1":"{}","2":"{}","3":"{}","4":"{}","5":"{}","6":"{}","7":"{}"},"operation":{"0":"x_gbinf","1":"x_initobj","2":"x_gobjParams","3":"gtfull","4":"x_gbinf","5":"gxyzinf","6":"deletemfg","7":"gxyzinf"},"thingy":{"0":"a13yy38","1":"a19kyy1","2":"a14yy10","3":"a14yy24","4":"a14yy24","5":"a12yy34","6":"a15yy15","7":"a15yy15"},"status":{"0":-101,"1":1,"2":-101,"3":-101,"4":-101,"5":-101,"6":1,"7":-101},"time":{"0":0.000801,"1":0.003244,"2":0.002247,"3":0.002787,"4":0.001067,"5":0.002652,"6":0.004371,"7":0.000602}} 

It seems like it does not have any unicode in it. Yet on trying to .read_json() it I get:

Traceback (most recent call last): File "./sqlprofile.py", line 160, in <module> maybe_save_dataframes(rconn, configd, results) File "./sqlprofile.py", line 140, in maybe_save_dataframes h5store.append(out_queue, df) File "/home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs) File "/home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs) File "/home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 2985, in write **kwargs) File "/home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 2717, in create_axes raise e TypeError: [unicode] is not implemented as a table column > /home/username/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py(2717)create_axes() -> raise e (Pdb) locals() 

This is what I'm getting in locals() - it seems that append_axis (column names?) values are unicode. Why?

{'append_axis': [u'args', u'date', u'host', u'kwargs', u'operation', u'thingy', u'status', u'time'], 'existing_table': None, 'blocks': [FloatBlock: [time], 1 x 8, dtype float64, ObjectBlock: [args, host, kwargs, operation, thingy], 5 x 8, dtype object, IntBlock: [status], 1 x 8, dtype int64, DatetimeBlock: [date], 1 x 8, dtype datetime64[ns]], 'axis': 1, 'self': frame_table (typ->appendable,nrows->None,ncols->1,indexers->[index]), 'axes': [0], 'kwargs': {}, 'klass': <class 'pandas.io.pytables.DataCol'>, 'block_obj': args date host kwargs operation thingy status time 0 [] 2013-12-02 00:33:59 yy38.segm1.org {} x_gbinf a13yy38 -101 0.000801 1 [] 2013-12-02 00:33:59 kyy1.segm1.org {} x_initobj a19kyy1 1 0.003244 2 [] 2013-12-02 00:34:00 yy10.segm1.org {} x_gobjParams a14yy10 -101 0.002247 3 [] 2013-12-02 00:34:00 yy24.segm1.org {} gtfull a14yy24 -101 0.002787 4 [] 2013-12-02 00:34:00 yy24.segm1.org {} x_gbinf a14yy24 -101 0.001067 5 [] 2013-12-02 00:34:00 yy34.segm1.org {} gxyzinf a12yy34 -101 0.002652 6 [] 2013-12-02 00:34:00 yy15.segm1.org {} deletemfg a15yy15 1 0.004371 7 [] 2013-12-02 00:34:00 yy15.segm1.org {} gxyzinf a15yy15 -101 0.000602, 'axis_labels': [u'args', u'date', u'host', u'kwargs', u'operation', u'thingy', u'status', u'time'], 'nan_rep': 'nan', 'data_columns': [], 'obj': args date host kwargs operation thingy status time 0 [] 2013-12-02 00:33:59 yy38.segm1.org {} x_gbinf a13yy38 -101 0.000801 1 [] 2013-12-02 00:33:59 kyy1.segm1.org {} x_initobj a19kyy1 1 0.003244 2 [] 2013-12-02 00:34:00 yy10.segm1.org {} x_gobjParams a14yy10 -101 0.002247 3 [] 2013-12-02 00:34:00 yy24.segm1.org {} gtfull a14yy24 -101 0.002787 4 [] 2013-12-02 00:34:00 yy24.segm1.org {} x_gbinf a14yy24 -101 0.001067 5 [] 2013-12-02 00:34:00 yy34.segm1.org {} gxyzinf a12yy34 -101 0.002652 6 [] 2013-12-02 00:34:00 yy15.segm1.org {} deletemfg a15yy15 1 0.004371 7 [] 2013-12-02 00:34:00 yy15.segm1.org {} gxyzinf a15yy15 -101 0.000602, 'validate': True, 'a': (1, [u'args', u'date', u'host', u'kwargs', u'operation', u'thingy', u'status', u'time']), 'index_axes_map': {0: name->index,cname->index,axis->0,pos->0,kind->integer}, 'b': ObjectBlock: [args, host, kwargs, operation, thingy], 5 x 8, dtype object, 'e': TypeError('[unicode] is not implemented as a table column',), 'name': None, 'existing_col': None, 'j': 2, 'i': 1, 'min_itemsize': None, 'col': name->values_block_1,cname->values_block_1,dtype->None,shape->None} 

How can I fix that? Is this a bug in Pandas / pytables?

Environment:

Python 2.7

pandas==0.12.0

tables==3.0.0

1
  • Without DFG at the beginning, read_json on the first string works fine for me in both 0.12 (and 0.13rc). Commented Dec 18, 2013 at 23:13

2 Answers 2

28

Seems your roundtripping IS causing some unicode. Not sure why that is, but easy to fix. You cannot store unicode in a HDFStore Table in python 2, (this works correctly in python 3 however). You could do it as a Fixed format if you want though (it would be pickled). See here.

In [33]: df = pd.read_json(s) In [25]: df Out[25]: args date host kwargs operation status thingy time 0 [] 2013-12-02 00:33:59 yy38.segm1.org {} x_gbinf -101 a13yy38 0.000801 1 [] 2013-12-02 00:33:59 kyy1.segm1.org {} x_initobj 1 a19kyy1 0.003244 2 [] 2013-12-02 00:34:00 yy10.segm1.org {} x_gobjParams -101 a14yy10 0.002247 3 [] 2013-12-02 00:34:00 yy24.segm1.org {} gtfull -101 a14yy24 0.002787 4 [] 2013-12-02 00:34:00 yy24.segm1.org {} x_gbinf -101 a14yy24 0.001067 5 [] 2013-12-02 00:34:00 yy34.segm1.org {} gxyzinf -101 a12yy34 0.002652 6 [] 2013-12-02 00:34:00 yy15.segm1.org {} deletemfg 1 a15yy15 0.004371 7 [] 2013-12-02 00:34:00 yy15.segm1.org {} gxyzinf -101 a15yy15 0.000602 [8 rows x 8 columns] In [26]: df.dtypes Out[26]: args object date datetime64[ns] host object kwargs object operation object status int64 thingy object time float64 dtype: object 

This is inferring the actual type of the object dtyped Series. They will come out as unicode only if at least 1 string is unicode (otherwise they would be inferred as string)

In [27]: df.apply(lambda x: pd.lib.infer_dtype(x.values)) Out[27]: args unicode date datetime64 host unicode kwargs unicode operation unicode status integer thingy unicode time floating dtype: object 

Here's how to 'fix' it

In [28]: types = df.apply(lambda x: pd.lib.infer_dtype(x.values)) In [29]: types[types=='unicode'] Out[29]: args unicode host unicode kwargs unicode operation unicode thingy unicode dtype: object In [30]: for col in types[types=='unicode'].index: ....: df[col] = df[col].astype(str) ....: 

Looks the same

In [31]: df Out[31]: args date host kwargs operation status thingy time 0 [] 2013-12-02 00:33:59 yy38.segm1.org {} x_gbinf -101 a13yy38 0.000801 1 [] 2013-12-02 00:33:59 kyy1.segm1.org {} x_initobj 1 a19kyy1 0.003244 2 [] 2013-12-02 00:34:00 yy10.segm1.org {} x_gobjParams -101 a14yy10 0.002247 3 [] 2013-12-02 00:34:00 yy24.segm1.org {} gtfull -101 a14yy24 0.002787 4 [] 2013-12-02 00:34:00 yy24.segm1.org {} x_gbinf -101 a14yy24 0.001067 5 [] 2013-12-02 00:34:00 yy34.segm1.org {} gxyzinf -101 a12yy34 0.002652 6 [] 2013-12-02 00:34:00 yy15.segm1.org {} deletemfg 1 a15yy15 0.004371 7 [] 2013-12-02 00:34:00 yy15.segm1.org {} gxyzinf -101 a15yy15 0.000602 [8 rows x 8 columns] 

But now infers correctly.

In [32]: df.apply(lambda x: pd.lib.infer_dtype(x.values)) Out[32]: args string date datetime64 host string kwargs string operation string status integer thingy string time floating dtype: object 
Sign up to request clarification or add additional context in comments.

10 Comments

THANK YOU! this works. Although this seems tb a strange omission, given python's JSON decoder logic: "json.loads(s) ... Deserialize s (a str or unicode instance containing a JSON document) to a Python object using this conversion table.". Conversion table: docs.python.org/2/library/json.html#json-to-py-table . This means that even in P2.7 the only string-type objects that come out of JSON decoding is unicode.
this does not use the python json decode, rather a custom decoder based on ujson. Haven't seen this before.
in your env that is giving a problem, print out pd.get_optinon('display.encoding'); should be he same as in ipython
I think tht is the default on some systems IIRC. I think you could just set this s soon as you import python. its in sys somewhere (you cn just set with pandas too, via set_option(...) that might fix it for you
In recent versions of pandas, this solution yields: FutureWarning: pandas.lib is deprecated and will be removed in a future version. You can access infer_dtype as pandas.api.types.infer_dtype Replace: df.apply(lambda x: pd.lib.infer_dtype(x.values)) with: df.apply(lambda x: pd.api.types.infer_dtype(x.values))
|
8

The above solution may cause some errors with unicode special characters. A similar solution to convert unicode to string that will not get hung up on unicode special characters:

for col in types[types=='unicode'].index: df[col] = df[col].apply(lambda x: x.encode('utf-8').strip()) 

This is due in part to how python handles unicode. More info on that in the Python Unicode How-To.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.