26

I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.

Here we go with a reproducible example that fails:

import seaborn as sns import pandas as pd df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1], [10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]], columns=['a1', 'a2', 'a3', 'a4', 'b']) # display(df) a1 a2 a3 a4 b 0 2 4 5 6 1 1 4 5 6 7 2 2 5 4 5 5 1 3 10 4 7 8 2 4 9 3 4 6 2 5 3 3 4 4 1 #Plotting by seaborn sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b) 

What I get is something that completely ignores groupby option:

Failed groupby

Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :

sns.boxplot(df.a1, groupby=df.b) 

seaborn that does not fail

So I would like to get all my columns in one plot (all columns come in a similar scale).

EDIT:

The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.

5 Answers 5

27

As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..

However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:

df_long = pd.melt(df, "b", var_name="a", value_name="c") 

Then it's very simple to plot:

sns.factorplot("a", hue="b", y="c", data=df_long, kind="box") 

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

This gets occasional upvotes, but FWIW nested boxplots have been possible in sns.boxplot since 0.6.
11

You can directly use sns.boxplot, an axes-level function, or sns.catplot with kind='box', a figure-level function. See Figure-level vs. axes-level functions for further details

sns.catplot has the col and row variable, which are used to create subplots / facets with a different variable.

The default palette is determined by the type of variable, continuous (numeric) or categorical, passed to hue.

As explained by @mwaskom, you have to melt the sample dataframe into its "long-form" where each column is a variable and each row is an observation.

Tested in python 3.12.0, pandas 2.1.2, matplotlib 3.8.1, seaborn 0.13.0

df_long = pd.melt(df, "b", var_name="a", value_name="c") # display(df_long.head()) b a c 0 1 a1 2 1 2 a1 4 2 1 a1 5 3 2 a1 10 4 2 a1 9 

sns.boxplot

fig, ax = plt.subplots(figsize=(5, 5)) sns.boxplot(x="a", hue="b", y="c", data=df_long, ax=ax) ax.spines[['top', 'right']].set_visible(False) sns.move_legend(ax, bbox_to_anchor=(1, 0.5), loc='center left', frameon=False) 

sns.catplot

Create the same plot as sns.boxplot with fewer lines of code.

g = sns.catplot(kind='box', data=df_long, x='a', y='c', hue='b', height=5, aspect=1) 

Resulting Plot

enter image description here

Comments

8

Seaborn's groupby function takes Series not DataFrames, that's why it's not working.

As a work around, you can do this :

fig, ax = plt.subplots(1,2, sharey=True) for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)): sns.boxplot(grp[1], ax=ax[i]) 

it gives : sns

Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]

 a1 a2 a3 a4 0 2 4 5 6 1 4 5 6 7 2 5 4 5 5 3 10 4 7 8 4 9 3 4 6 5 3 3 4 4 

Hope this helps

Comments

5

It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.

Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.

g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b') g.map(sns.boxplot, 'value', 'variable') 

faceted seaborn boxplot

1 Comment

It's actually not necessary to use FacetGrid directly if you want this kind of plot, you can use factorplot here too with col=b. (This isn't wrong, it's just more work than necessary).
1

It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.

output_graph

Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):

combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1) cluster_data_df: DataFrame = DataFrame(combined_array) if you want to use labelled columns: column_names: List[str] = list(outcome_variable_names) column_names.append('cluster') cluster_data_df.set_axis(column_names, axis='columns', inplace=True) graph_data: DataFrame = pd.melt( frame=cluster_data_df, id_vars=['cluster'], # value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example # value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6'] var_name='psychometric_test', value_name='standard deviations from the mean' ) 

The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):

index cluster psychometric_tst standard deviations from the mean
0 0.0 outcome_var_1 -1.276182
1 0.0 outcome_var_1 -1.118813
2 0.0 outcome_var_1 -1.276182
9754 0.0 outcome_var_6 0.892548
9755 0.0 outcome_var_6 1.420480

If you want to use indices with melt:

graph_data: DataFrame = pd.melt( frame=cluster_data_df, id_vars=cluster_data_df.columns[-1], # value_vars=cluster_data_df.columns[:-1], var_name='psychometric_test', value_name='standard deviations from the mean' ) 

And here's the graphing code: (Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):

# plot graph grouped by cluster sns.set_theme(style="ticks") fig = plt.figure(figsize=(10, 10)) fig.set(font_scale=1.2) fig.set_style("white") # create boxplot fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False, data=graph_data) # set box alpha: for patch in fig.ax.artists: r, g, b, a = patch.get_facecolor() patch.set_facecolor((r, g, b, .2)) # create scatterplot fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data, dodge=True, alpha=.25, zorder=1) # customise legend: cluster_n: int = dbscan_output.n_clusters ## create list with legend text i = 0 cluster_info: Dict[int, int] = dbscan_output.cluster_sizes # custom method legend_labels: List[str] = [] while i < cluster_n: label: str = f"cluster {i+1}, n = {cluster_info[i]}" legend_labels.append(label) i += 1 if -1 in cluster_info.keys(): cluster_n += 1 label: str = f"Unclustered, n = {cluster_info[-1]}" legend_labels.insert(0, label) ## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half) handles, labels = fig.ax.get_legend_handles_labels() index: int = int(cluster_n*(-1)) labels = legend_labels plt.legend(handles[index:], labels[0:]) plt.xticks(rotation=45) plt.show() asds 

Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.

  1. List item

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.