Efficient and elegant:
tf = pd.DataFrame([
{'id': 1, 'nested': {'a': 1, 'b': 2} },
{'id': 2, 'nested': {'a': 2, 'b': 4} },
{'id': 3, 'nested': {'a': 3, 'b': 6} },
{'id': 4, 'nested': {'a': 4}},
])
def unpack(df, column, fillna=None):
ret = None
if fillna is None:
ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems()))], axis=1)
del ret[column]
else:
ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(fillna)], axis=1)
del ret[column]
return ret
unpack(tf, 'nested', 0)
will yield
id a b
0 1 1 2
1 2 2 4
2 3 3 6
3 4 4 0
and seems to be quite efficient
tf = pd.DataFrame([
enter code here
{'id': i, 'nested': {'a': i, 'b': i*2} }
for i in xrange(100000)
])
%timeit unpack(tf, 'nested') # gives
10 loops, best of 3: 95 ms per loop
If you want to create a projection (select subset of the keys in the nested dict) you can use `apply` before unpack or a column projection on the dataframe created inside unpack.
The main advantages of the above solution are:
- it is much more generic - does not depend on the keys in your nested document
- it is efficient - uses (presumably optimized) pandas methods where-ever possible and generators/iterators
- handles keys that do not exist only in some nested documents and lets you specify the way they should be handled (fillna value or NaN)
- can be converted to a one-liner for the sake of brevity
- does not reinvent anything
- uses naming consistent with other libraries (dato (graphlab create), SFrame.unpack method)