Skip to main content
deleted 5 characters in body
Source Link
JohnnyM
  • 141
  • 4

Efficient and elegant:

tf = pd.DataFrame([ {'id': 1, 'nested': {'a': 1, 'b': 2} }, {'id': 2, 'nested': {'a': 2, 'b': 4} }, {'id': 3, 'nested': {'a': 3, 'b': 6} }, {'id': 4, 'nested': {'a': 4}}, ]) def unpack(df, column, fillna=None): ret = None if fillna is None: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems()))], axis=1) del ret[column] else: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(fillna)], axis=1) del ret[column] return ret unpack(tf, 'nested', 0) 

will yield

 id a b 0 1 1 2 1 2 2 4 2 3 3 6 3 4 4 0 

and seems to be quite efficient

tf = pd.DataFrame([ enter code here {'id': i, 'nested': {'a': i, 'b': i*2} } for i in xrange(100000) ]) %timeit unpack(tf, 'nested') # gives 10 loops, best of 3: 95 ms per loop 

If you want to create a projection (select subset of the keys in the nested dict) you can use apply before unpack or a column projection on the dataframe created inside unpack.

The main advantages of the above solution are:

  • it is much more generic - does not depend on the structure ofkeys in your nested document
  • it is efficient - uses (presumably optimized) pandas methods where-ever possible and generators/iterators
  • handles keys that do not exist only in some nested documents and lets you specify the way they should be handled (fillna value or NaN)
  • can be converted to a one-liner for the sake of brevity
  • does not reinvent anything
  • uses naming consistent with other libraries (dato (graphlab create), SFrame.unpack method)

Efficient and elegant:

tf = pd.DataFrame([ {'id': 1, 'nested': {'a': 1, 'b': 2} }, {'id': 2, 'nested': {'a': 2, 'b': 4} }, {'id': 3, 'nested': {'a': 3, 'b': 6} }, {'id': 4, 'nested': {'a': 4}}, ]) def unpack(df, column, fillna=None): ret = None if fillna is None: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems()))], axis=1) del ret[column] else: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(fillna)], axis=1) del ret[column] return ret unpack(tf, 'nested', 0) 

will yield

 id a b 0 1 1 2 1 2 2 4 2 3 3 6 3 4 4 0 

and seems to be quite efficient

tf = pd.DataFrame([ enter code here {'id': i, 'nested': {'a': i, 'b': i*2} } for i in xrange(100000) ]) %timeit unpack(tf, 'nested') # gives 10 loops, best of 3: 95 ms per loop 

If you want to create a projection (select subset of the keys in the nested dict) you can use apply before unpack or a column projection on the dataframe created inside unpack.

The main advantages of the above solution are:

  • it is much more generic - does not depend on the structure of your nested document
  • it is efficient - uses (presumably optimized) pandas methods where-ever possible and generators/iterators
  • handles keys that do not exist only in some nested documents and lets you specify the way they should be handled (fillna value or NaN)
  • can be converted to a one-liner for the sake of brevity
  • does not reinvent anything
  • uses naming consistent with other libraries (dato (graphlab create), SFrame.unpack method)

Efficient and elegant:

tf = pd.DataFrame([ {'id': 1, 'nested': {'a': 1, 'b': 2} }, {'id': 2, 'nested': {'a': 2, 'b': 4} }, {'id': 3, 'nested': {'a': 3, 'b': 6} }, {'id': 4, 'nested': {'a': 4}}, ]) def unpack(df, column, fillna=None): ret = None if fillna is None: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems()))], axis=1) del ret[column] else: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(fillna)], axis=1) del ret[column] return ret unpack(tf, 'nested', 0) 

will yield

 id a b 0 1 1 2 1 2 2 4 2 3 3 6 3 4 4 0 

and seems to be quite efficient

tf = pd.DataFrame([ enter code here {'id': i, 'nested': {'a': i, 'b': i*2} } for i in xrange(100000) ]) %timeit unpack(tf, 'nested') # gives 10 loops, best of 3: 95 ms per loop 

If you want to create a projection (select subset of the keys in the nested dict) you can use apply before unpack or a column projection on the dataframe created inside unpack.

The main advantages of the above solution are:

  • it is much more generic - does not depend on the keys in your nested document
  • it is efficient - uses (presumably optimized) pandas methods where-ever possible and generators/iterators
  • handles keys that do not exist only in some nested documents and lets you specify the way they should be handled (fillna value or NaN)
  • can be converted to a one-liner for the sake of brevity
  • does not reinvent anything
  • uses naming consistent with other libraries (dato (graphlab create), SFrame.unpack method)
added 568 characters in body
Source Link
JohnnyM
  • 141
  • 4

Efficient and elegant:

tf = pd.DataFrame([ {'id': 1, 'nested': {'a': 1, 'b': 2} }, {'id': 2, 'nested': {'a': 2, 'b': 4} }, {'id': 3, 'nested': {'a': 3, 'b': 6} }, {'id': 4, 'nested': {'a': 4}}, ]) def unpack(df, column, fillna=None): ret = None if fillna is None: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems()))], axis=1) del ret[column] else: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(fillna)], axis=1) del ret[column] return ret unpack(tf, 'nested', 0) 

will yield

 id a b 0 1 1 2 1 2 2 4 2 3 3 6 3 4 4 0 

and seems to be quite efficient

tf = pd.DataFrame([ enter code here {'id': i, 'nested': {'a': i, 'b': i*2} } for i in xrange(100000) ]) %timeit unpack(tf, 'nested') # gives 10 loops, best of 3: 95 ms per loop 

If you want to create a projection (select subset of the keys in the nested dict) you can use apply before unpack or a column projection on the dataframe created inside unpack.

The main advantages of the above solution are:

  • it is much more generic - does not depend on the structure of your nested document
  • it is efficient - uses (presumably optimized) pandas methods where-ever possible and generators/iterators
  • handles keys that do not exist only in some nested documents and lets you specify the way they should be handled (fillna value or NaN)
  • can be converted to a one-liner for the sake of brevity
  • does not reinvent anything
  • uses naming consistent with other libraries (dato (graphlab create), SFrame.unpack method)

Efficient and elegant:

tf = pd.DataFrame([ {'id': 1, 'nested': {'a': 1, 'b': 2} }, {'id': 2, 'nested': {'a': 2, 'b': 4} }, {'id': 3, 'nested': {'a': 3, 'b': 6} }, {'id': 4, 'nested': {'a': 4}}, ]) def unpack(df, column, fillna=None): ret = None if fillna is None: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems()))], axis=1) del ret[column] else: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(fillna)], axis=1) del ret[column] return ret unpack(tf, 'nested', 0) 

will yield

 id a b 0 1 1 2 1 2 2 4 2 3 3 6 3 4 4 0 

If you want to create a projection (select subset of the keys in the nested dict) you can use apply before unpack or a column projection on the dataframe created inside unpack.

Efficient and elegant:

tf = pd.DataFrame([ {'id': 1, 'nested': {'a': 1, 'b': 2} }, {'id': 2, 'nested': {'a': 2, 'b': 4} }, {'id': 3, 'nested': {'a': 3, 'b': 6} }, {'id': 4, 'nested': {'a': 4}}, ]) def unpack(df, column, fillna=None): ret = None if fillna is None: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems()))], axis=1) del ret[column] else: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(fillna)], axis=1) del ret[column] return ret unpack(tf, 'nested', 0) 

will yield

 id a b 0 1 1 2 1 2 2 4 2 3 3 6 3 4 4 0 

and seems to be quite efficient

tf = pd.DataFrame([ enter code here {'id': i, 'nested': {'a': i, 'b': i*2} } for i in xrange(100000) ]) %timeit unpack(tf, 'nested') # gives 10 loops, best of 3: 95 ms per loop 

If you want to create a projection (select subset of the keys in the nested dict) you can use apply before unpack or a column projection on the dataframe created inside unpack.

The main advantages of the above solution are:

  • it is much more generic - does not depend on the structure of your nested document
  • it is efficient - uses (presumably optimized) pandas methods where-ever possible and generators/iterators
  • handles keys that do not exist only in some nested documents and lets you specify the way they should be handled (fillna value or NaN)
  • can be converted to a one-liner for the sake of brevity
  • does not reinvent anything
  • uses naming consistent with other libraries (dato (graphlab create), SFrame.unpack method)
Source Link
JohnnyM
  • 141
  • 4

Efficient and elegant:

tf = pd.DataFrame([ {'id': 1, 'nested': {'a': 1, 'b': 2} }, {'id': 2, 'nested': {'a': 2, 'b': 4} }, {'id': 3, 'nested': {'a': 3, 'b': 6} }, {'id': 4, 'nested': {'a': 4}}, ]) def unpack(df, column, fillna=None): ret = None if fillna is None: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems()))], axis=1) del ret[column] else: ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(fillna)], axis=1) del ret[column] return ret unpack(tf, 'nested', 0) 

will yield

 id a b 0 1 1 2 1 2 2 4 2 3 3 6 3 4 4 0 

If you want to create a projection (select subset of the keys in the nested dict) you can use apply before unpack or a column projection on the dataframe created inside unpack.