Converting pandas.DataFrame to bytes

Question

I need convert the data stored in a pandas.DataFrame into a byte string where each column can have a separate data type (integer or floating point). Here is a simple set of data:

df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a']) df['b'] = np.array([np.iinfo('u8').max, 230498234019, 32094812309], dtype='u8') df['c'] = np.array([1.324e10, 3.14159, 234.1341], dtype='f8')

and df looks something like this:

 a b c 0 10 18446744073709551615 1.324000e+10 1 15 230498234019 3.141590e+00 2 20 32094812309 2.341341e+02

The DataFrame knows about the types of each column df.dtypes so I'd like to do something like this:

data_to_pack = [tuple(record) for _, record in df.iterrows()] data_array = np.array(data_to_pack, dtype=zip(df.columns, df.dtypes)) data_bytes = data_array.tostring()

This typically works fine but in this case (due to the maximum value stored in df['b'][0]. The second line above converting the array of tuples to an np.array with a given set of types causes the following error:

OverflowError: Python int too large to convert to C long

The error results (I believe) in the first line which extracts the record as a Series with a single data type (defaults to float64) and the representation chosen in float64 for the maximum uint64 value is not directly convertible back to uint64.

1) Since the DataFrame already knows the types of each column is there a way to get around creating a row of tuples for input into the typed numpy.array constructor? Or is there a better way than outlined above to preserve the type information in such a conversion?

2) Is there a way to go directly from DataFrame to a byte string representing the data using the type information for each column.

ali_m · Accepted Answer · 2016-01-08 00:21:02Z

You can use df.to_records() to convert your dataframe to a numpy recarray, then call .tostring() to convert this to a string of bytes:

rec = df.to_records(index=False) print(repr(rec)) # rec.array([(10, 18446744073709551615, 13240000000.0), (15, 230498234019, 3.14159), # (20, 32094812309, 234.1341)], # dtype=[('a', '|u1'), ('b', '<u8'), ('c', '<f8')]) s = rec.tostring() rec2 = np.fromstring(s, rec.dtype) print(np.all(rec2 == rec)) # True

Zeno · Accepted Answer · 2022-04-27 18:33:01Z

2

import pandas as pd df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a']) df_byte = df.to_json().encode() print(df_byte)

answered Apr 27, 2022 at 18:33

Zeno

3712 silver badges5 bronze badges

1 Comment

user17242583 Over a year ago

While this code may solve the question, including an explanation of how and why this solves the problem would really help to improve the quality of your post. Remember that you are answering the question for readers in the future, not just the person asking now. Please edit your answer to add explanations and give an indication of what limitations and assumptions apply.

Collectives™ on Stack Overflow

Converting pandas.DataFrame to bytes

2 Answers 2

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Linked

Related