Grouping boxplots in seaborn when input is a DataFrame

Question

I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.

Here we go with a reproducible example that fails:

import seaborn as sns import pandas as pd df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1], [10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]], columns=['a1', 'a2', 'a3', 'a4', 'b']) # display(df) a1 a2 a3 a4 b 0 2 4 5 6 1 1 4 5 6 7 2 2 5 4 5 5 1 3 10 4 7 8 2 4 9 3 4 6 2 5 3 3 4 4 1 #Plotting by seaborn sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b)

What I get is something that completely ignores groupby option:

Failed groupby

Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :

sns.boxplot(df.a1, groupby=df.b)

seaborn that does not fail

So I would like to get all my columns in one plot (all columns come in a similar scale).

EDIT:

The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.

mwaskom · Accepted Answer · 2014-08-13 14:47:35Z

As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..

However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:

df_long = pd.melt(df, "b", var_name="a", value_name="c")

Then it's very simple to plot:

sns.factorplot("a", hue="b", y="c", data=df_long, kind="box")

enter image description here

This gets occasional upvotes, but FWIW nested boxplots have been possible in sns.boxplot since 0.6.

Trenton McKinney · Accepted Answer · 2023-11-11 23:23:18Z

You can directly use sns.boxplot, an axes-level function, or sns.catplot with kind='box', a figure-level function. See Figure-level vs. axes-level functions for further details

sns.catplot has the col and row variable, which are used to create subplots / facets with a different variable.

The default palette is determined by the type of variable, continuous (numeric) or categorical, passed to hue.

As explained by @mwaskom, you have to melt the sample dataframe into its "long-form" where each column is a variable and each row is an observation.

Tested in python 3.12.0, pandas 2.1.2, matplotlib 3.8.1, seaborn 0.13.0

df_long = pd.melt(df, "b", var_name="a", value_name="c") # display(df_long.head()) b a c 0 1 a1 2 1 2 a1 4 2 1 a1 5 3 2 a1 10 4 2 a1 9

`sns.boxplot`

fig, ax = plt.subplots(figsize=(5, 5)) sns.boxplot(x="a", hue="b", y="c", data=df_long, ax=ax) ax.spines[['top', 'right']].set_visible(False) sns.move_legend(ax, bbox_to_anchor=(1, 0.5), loc='center left', frameon=False)

`sns.catplot`

Create the same plot as sns.boxplot with fewer lines of code.

g = sns.catplot(kind='box', data=df_long, x='a', y='c', hue='b', height=5, aspect=1)

Resulting Plot

jrjc · Accepted Answer · 2014-08-13 14:35:35Z

Seaborn's groupby function takes Series not DataFrames, that's why it's not working.

As a work around, you can do this :

fig, ax = plt.subplots(1,2, sharey=True) for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)): sns.boxplot(grp[1], ax=ax[i])

it gives : sns

Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]

 a1 a2 a3 a4 0 2 4 5 6 1 4 5 6 7 2 5 4 5 5 3 10 4 7 8 4 9 3 4 6 5 3 3 4 4

Hope this helps

chrisb · Accepted Answer · 2014-08-13 11:55:26Z

It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.

Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.

g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b') g.map(sns.boxplot, 'value', 'variable')

faceted seaborn boxplot

It's actually not necessary to use FacetGrid directly if you want this kind of plot, you can use factorplot here too with col=b. (This isn't wrong, it's just more work than necessary).

ouflak · Accepted Answer · 2021-09-22 16:46:44Z

It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.

output_graph

Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):

combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1) cluster_data_df: DataFrame = DataFrame(combined_array) if you want to use labelled columns: column_names: List[str] = list(outcome_variable_names) column_names.append('cluster') cluster_data_df.set_axis(column_names, axis='columns', inplace=True) graph_data: DataFrame = pd.melt( frame=cluster_data_df, id_vars=['cluster'], # value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example # value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6'] var_name='psychometric_test', value_name='standard deviations from the mean' )

The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):

index	psychometric_tst	standard deviations from the mean
0	outcome_var_1	-1.276182
1	outcome_var_1	-1.118813
2	outcome_var_1	-1.276182
9754	outcome_var_6	0.892548
9755	outcome_var_6	1.420480

If you want to use indices with melt:

graph_data: DataFrame = pd.melt( frame=cluster_data_df, id_vars=cluster_data_df.columns[-1], # value_vars=cluster_data_df.columns[:-1], var_name='psychometric_test', value_name='standard deviations from the mean' )

And here's the graphing code: (Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):

# plot graph grouped by cluster sns.set_theme(style="ticks") fig = plt.figure(figsize=(10, 10)) fig.set(font_scale=1.2) fig.set_style("white") # create boxplot fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False, data=graph_data) # set box alpha: for patch in fig.ax.artists: r, g, b, a = patch.get_facecolor() patch.set_facecolor((r, g, b, .2)) # create scatterplot fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data, dodge=True, alpha=.25, zorder=1) # customise legend: cluster_n: int = dbscan_output.n_clusters ## create list with legend text i = 0 cluster_info: Dict[int, int] = dbscan_output.cluster_sizes # custom method legend_labels: List[str] = [] while i < cluster_n: label: str = f"cluster {i+1}, n = {cluster_info[i]}" legend_labels.append(label) i += 1 if -1 in cluster_info.keys(): cluster_n += 1 label: str = f"Unclustered, n = {cluster_info[-1]}" legend_labels.insert(0, label) ## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half) handles, labels = fig.ax.get_legend_handles_labels() index: int = int(cluster_n*(-1)) labels = legend_labels plt.legend(handles[index:], labels[0:]) plt.xticks(rotation=45) plt.show() asds

Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.

List item

Collectives™ on Stack Overflow

Grouping boxplots in seaborn when input is a DataFrame

5 Answers 5

1 Comment

`sns.boxplot`

`sns.catplot`

Resulting Plot

Comments

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

sns.boxplot

sns.catplot

Resulting Plot

Comments

Comments

1 Comment

Comments

Linked

Related

`sns.boxplot`

`sns.catplot`