14

I have a large dataframe and I am storing a lot of redundant values that are making it hard to handle my data. I have a dataframe of the form:

import pandas as pd df = pd.DataFrame([["a","g","n1","y1"], ["a","g","n2","y2"], ["b","h","n1","y3"], ["b","h","n2","y4"]], columns=["meta1", "meta2", "name", "data"]) >>> df meta1 meta2 name data a g n1 y1 a g n2 y2 b h n1 y3 b h n2 y4 

where I have the names of the new columns I would like in name and the respective data in data.

I would like to produce a dataframe of the form:

df = pd.DataFrame([["a","g","y1","y2"], ["b","h","y3","y4"]], columns=["meta1", "meta2", "n1", "n2"]) >>> df meta1 meta2 n1 n2 a g y1 y2 b h y3 y4 

The columns called meta are around 15+ other columns that contain most of the data, and I don't think are particularly well suited to for indexing. The idea is that I have a lot of repeated/redundant data stored in meta at the moment and I would like to produce the more compact dataframe presented.

I have found some similar Qs but can't pinpoint what sort of operations I need to do: pivot, re-index, stack or unstack, etc.?

PS - the original index values are unimportant for my purposes.

Any help would be much appreciated.

Question I think is related:

I think the following Q is related to what I am trying to do, but I can't see how to apply it, as I don't want to produce more indexes.

3 Answers 3

18

If you group your meta columns into a list then you can do this:

metas = ['meta1', 'meta2'] new_df = df.set_index(['name'] + metas).unstack('name') print new_df data name n1 n2 meta1 meta2 a g y1 y2 b h y3 y4 

Which gets you most of the way there. Additional tailoring can get you the rest of the way.

print new_df.data.rename_axis([None], axis=1).reset_index() meta1 meta2 n1 n2 0 a g y1 y2 1 b h y3 y4 
Sign up to request clarification or add additional context in comments.

Comments

14

You can use pivot_table with reset_index and rename_axis (new in pandas 0.18.0):

print (df.pivot_table(index=['meta1','meta2'], columns='name', values='data', aggfunc='first') .reset_index() .rename_axis(None, axis=1)) meta1 meta2 n1 n2 0 a g y1 y2 1 b h y3 y4 

But better is use aggfunc join:

print (df.pivot_table(index=['meta1','meta2'], columns='name', values='data', aggfunc=', '.join) .reset_index() .rename_axis(None, axis=1)) meta1 meta2 n1 n2 0 a g y1 y2 1 b h y3 y4 

Explanation, why join is generally better as first:

If use first, you can lost all data which are not first in each group by index, but join concanecate them:

import pandas as pd df = pd.DataFrame([["a","g","n1","y1"], ["a","g","n2","y2"], ["a","g","n1","y3"], ["b","h","n2","y4"]], columns=["meta1", "meta2", "name", "data"]) print (df) meta1 meta2 name data 0 a g n1 y1 1 a g n2 y2 2 a g n1 y3 3 b h n2 y4 print (df.pivot_table(index=['meta1','meta2'], columns='name', values='data', aggfunc='first') .reset_index() .rename_axis(None, axis=1)) meta1 meta2 n1 n2 0 a g y1 y2 1 b h None y4 print (df.pivot_table(index=['meta1','meta2'], columns='name', values='data', aggfunc=', '.join) .reset_index() .rename_axis(None, axis=1)) meta1 meta2 n1 n2 0 a g y1, y3 y2 1 b h None y4 

5 Comments

what I have called meta is actually multiple other columns.
Sorry, column meta is not one, but more columns like meta1, meta2... ?
Great answer as always @jezrael
@piRSquared - Thank you.
aggfunc='first' is brilliant for text data!
0

You can also use DataFrame.pivot:

new_df = ( # Actual pivoting. df.pivot( index=['meta1', 'meta2'], columns='name', values='data' ) # Remove the column name that pandas adds. .rename_axis(None, axis=1) # Put back the new index as columns. .reset_index() ) print(new_df) 

Output:

 meta1 meta2 n1 n2 0 a g y1 y2 1 b h y3 y4 

I'm using pandas version 1.5.3.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.