I am trying to figure out how to use the Pandas to find the number of unique values in a DataFrame that has been grouped by two of its columns. My sample dataset looks like this:
df = pd.DataFrame( {"Header1" : [0,1,2,3,0,1,2,3], "Header2" : [0,1,0,1,0,1,0,1], "values" : [1,2,3,4,1,3,2,1]} ) I would love to be able to transform this to return a DataFrame that looks like this
output = pd.DataFrame( {"Header1" : [0,1,2,3], "Header2" : [0,1,0,1], "unique values" : [1,2,2,2]} ) So far what I have tried is using groupby and nunique:
pd_series = df.groupby(['Header1', 'Header2'])['values'].nunique() This returns the right answer but in a multi-indexed series data format that is very tricky to convert to a usable DataFrame. I've spent quite a lot of time trying to figure out how to correctly format the output with no luck. Instead of generating a DataFrame with the correct set of columns, pd_series.to_frame() produces a DataFrame with a single column named "values" with one row that contains the entire series object.
So far I am resorting to copy-pasting the results from nunique() into a new DataFrame. There must be a better way. Does anyone have any suggestions for how to do better with this?
as_index=False, i.e.df.groupby(['Header1', 'Header2'], as_index=False)['values'].nunique(), no?nameattribute of Series.reset_index as in this answer or this answernew_df = df.groupby(['Header1', 'Header2'])['values'].nunique().reset_index(name="unique values")