How to create boxplots by group for all dataframe columns

Question

I have a dataframe with 150 columns and 800 rows. Each row represents a sample, which belongs to one of 5 classes. Therefore all samples are pre-classified. I need to create 150 boxplot charts, one for each column (variable), showing the distribution of the data between the classes, for that variable.

I managed to build a code to generate the graphs, but I have to adjust by hand each of the 150 lines to indicate the location of the graph, which is a sequence [0,0], [0,1], [0,2], [1,0], [1,1], [1,2] etc., as well as the y, which could come from a list, but I don't know how to do this.

Below is an example of how it looks like. The first 9 I did by hand, but to do the other 150 would be a lot of work. It should be possible to automate this, I think, but I don't know how. Does anyone have an idea?

fig, axes = plt.subplots(3, 3, figsize=(18, 12)) fig.suptitle('SAPIENS BOXPLOTS') sns.boxplot(ax=axes[0, 0], data=sapiens, x='classe', y='meanB0') sns.boxplot(ax=axes[0, 1], data=sapiens, x='classe', y='meanB1') sns.boxplot(ax=axes[0, 2], data=sapiens, x='classe', y='meanB2') sns.boxplot(ax=axes[1, 0], data=sapiens, x='classe', y='meanB3') sns.boxplot(ax=axes[1, 1], data=sapiens, x='classe', y='meanB4') sns.boxplot(ax=axes[1, 2], data=sapiens, x='classe', y='varB0') sns.boxplot(ax=axes[2, 0], data=sapiens, x='classe', y='varB1') sns.boxplot(ax=axes[2, 1], data=sapiens, x='classe', y='varB2') sns.boxplot(ax=axes[2, 2], data=sapiens, x='classe', y='varB3')

BOXPLOTS_SAPIENS

Trenton McKinney · Accepted Answer · 2021-06-09 01:43:43Z

Use seaborn.catplot with kind='box'
This requires converting the data from a wide to tidy (long) format using pandas.DataFrame.melt, as shown below. pandas.DataFrame.stack can also be used.
Tested with pandas v1.2.4, matplotlib v3.4.2, and seaborn v0.11.1

Imports & Test DataFrame

import pandas as pd import seaborn as sns import numpy as np # for sample data # set seed for reproducibility np.random.seed(1) # create arrays of random sample data cl = np.random.choice(range(1, 6), size=(100, 1)) d = np.random.random_sample(size=(100, 6)) # combine the two arrays data = np.concatenate([cl, d], axis=1) # create a dataframe sapiens = pd.DataFrame(data, columns=['classe', 'mB0', 'mB1', 'mB2', 'vB0', 'vB1', 'vB2']) classe mB0 mB1 mB2 vB0 vB1 vB2 0 4.0 0.647749 0.353939 0.763233 0.356532 0.752788 0.881342 1 5.0 0.011669 0.498109 0.073792 0.786951 0.064067 0.355310 2 1.0 0.941837 0.379803 0.762920 0.771595 0.301360 0.772739

Melt and Plot

If there’re extra columns that don't need to be plotted, some options are:
- use the value_vars parameter in .melt(), to specify the columns to use.
  - value_vars=['mB0', 'mB1', 'mB2', 'vB0', 'vB1', 'vB2'].
- use pandas.DataFrame.loc or pandas.DataFrame.iloc to select desired columns, before using .melt.
- use pandas.DataFrame.drop to remove unnecessary columns, before using .melt.
For data that needs to be scaled differently, use the sharey=False parameter
- sns.catplot(..., sharey=False)
- However, the issue with this is that it visually obfuscates the difference between the different distributions.
  - Alternatively, try p.set(yscale='log') or p.set(yscale='symlog'), the line creating the plot.
p.set_xticklabels(visible=True) should work to show xtick labels on all axes, but it's adding labels to the top and bottom, so an alternate option is provided below in the code.

# convert from wide format to tidy format sm = sapiens.melt(id_vars='classe') classe variable value 0 4.0 mB0 0.647749 1 5.0 mB0 0.011669 2 1.0 mB0 0.941837 3 2.0 mB0 0.152930 4 4.0 mB0 0.467393 # plot p = sns.catplot(kind='box', data=sm, x='classe', y='value', col='variable', col_wrap=3, height=4) # add figure level title p.fig.subplots_adjust(top=0.9) p.fig.suptitle('Sapiens', size=16) # enable tick labels for xticks on all axes for ax in p.axes.flat: ax.tick_params(labelbottom=True) p.tight_layout()

Gusti Adli · Accepted Answer · 2021-05-23 19:00:25Z

First, you need to assign columns of sapiens which will be your y for each boxplot. Assuming that your first column is classe and you want to plot every column after that column, this is how you do it:

# get y values y_labels = sapiens.columns[1:]

Next, decide on figsize, nrows, and ncols for plt.figsize. And finally start drawing using a loop.

import math # calculate figure size ncols = 3 nrows = math.ceil(len(y_labels) / 3) figsize = (ncols * 6, nrows * 4) # assign fig and axes fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize) fig.suptitle('SAPIENS BOXPLOTS') # set y_labels index y_idx = 0 # drawing plots for axs in axes: for ax in axs: sns.boxplot(ax=ax, data=sapiens, x='classe', y=y_labels[y_idx]) ## update y_idx y_idx += 1

a_guest · Accepted Answer · 2021-05-23 18:36:03Z

You can use a loop and use divmod to determine the axes:

for i, y in enumerate(y_labels): sns.boxplot(ax=axes[divmod(i, n_cols)], data=sapiens, x='classe', y=y)

Collectives™ on Stack Overflow

How to create boxplots by group for all dataframe columns

3 Answers 3

Imports & Test DataFrame

Melt and Plot

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Imports & Test DataFrame

Melt and Plot

1 Comment

Comments

Comments

Linked

Related