17

I need convert the data stored in a pandas.DataFrame into a byte string where each column can have a separate data type (integer or floating point). Here is a simple set of data:

df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a']) df['b'] = np.array([np.iinfo('u8').max, 230498234019, 32094812309], dtype='u8') df['c'] = np.array([1.324e10, 3.14159, 234.1341], dtype='f8') 

and df looks something like this:

 a b c 0 10 18446744073709551615 1.324000e+10 1 15 230498234019 3.141590e+00 2 20 32094812309 2.341341e+02 

The DataFrame knows about the types of each column df.dtypes so I'd like to do something like this:

data_to_pack = [tuple(record) for _, record in df.iterrows()] data_array = np.array(data_to_pack, dtype=zip(df.columns, df.dtypes)) data_bytes = data_array.tostring() 

This typically works fine but in this case (due to the maximum value stored in df['b'][0]. The second line above converting the array of tuples to an np.array with a given set of types causes the following error:

OverflowError: Python int too large to convert to C long 

The error results (I believe) in the first line which extracts the record as a Series with a single data type (defaults to float64) and the representation chosen in float64 for the maximum uint64 value is not directly convertible back to uint64.

1) Since the DataFrame already knows the types of each column is there a way to get around creating a row of tuples for input into the typed numpy.array constructor? Or is there a better way than outlined above to preserve the type information in such a conversion?

2) Is there a way to go directly from DataFrame to a byte string representing the data using the type information for each column.

2 Answers 2

14

You can use df.to_records() to convert your dataframe to a numpy recarray, then call .tostring() to convert this to a string of bytes:

rec = df.to_records(index=False) print(repr(rec)) # rec.array([(10, 18446744073709551615, 13240000000.0), (15, 230498234019, 3.14159), # (20, 32094812309, 234.1341)], # dtype=[('a', '|u1'), ('b', '<u8'), ('c', '<f8')]) s = rec.tostring() rec2 = np.fromstring(s, rec.dtype) print(np.all(rec2 == rec)) # True 
Sign up to request clarification or add additional context in comments.

Comments

2
import pandas as pd df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a']) df_byte = df.to_json().encode() print(df_byte) 

1 Comment

While this code may solve the question, including an explanation of how and why this solves the problem would really help to improve the quality of your post. Remember that you are answering the question for readers in the future, not just the person asking now. Please edit your answer to add explanations and give an indication of what limitations and assumptions apply.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.