1

I would like to plot boxplots for several datasets based on a criterion. Imagine a dataframe similar to the example below:

df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)}) df = df[['Group','M','F']] Group M F 0 1 0.465636 0.537723 1 1 0.560537 0.727238 2 1 0.268154 0.648927 3 2 0.722644 0.115550 4 3 0.586346 0.042896 5 2 0.562881 0.369686 6 2 0.395236 0.672477 7 3 0.577949 0.358801 8 1 0.764069 0.642724 9 3 0.731076 0.302369 

In this case, I have three groups, so I would like to make a boxplot for each group and for M and F separately having the groups on Y axis and the columns of M and F colour-coded. This answer is very close to what I want to achieve, but I would prefer something more robust, applicable for larger dataframes with greater number of groups. I feel that groupby is the way to go, but I am not familiar with groupby objects and I am failing to even slice them. . The desirable output would look something like this:enter image description here

Looks like years ago, someone had the same problem, but got no answers :( Having a boxplot as a graphical representation of the describe function of groupby

My questions are:

  1. How to implement groupby to feed the desired data into the boxplot
  2. What is the correct syntax for the box plot if I want to control what is displayed and not just use default settings (which I don't even know what they are, I am finding the documentation rather vague. To be specific,can I have the box covering the mean +/- standard deviation, and keep the vertical line at median value?)
8
  • Did you try some code? what kind of problems / errors did you get? Commented May 24, 2017 at 7:57
  • import matplotlib.pyplot as plt and then df.boxplot(['M','F'],'Group') Commented May 24, 2017 at 8:18
  • this will generate 2 separate plots for male and female and on the basis of groups. Commented May 24, 2017 at 8:19
  • As you said, this generates separate subplots, it does not plot them together. Plus it does not address the point number 2. But thanks, for a simpler case it is good to know how easily it can be done. Commented May 24, 2017 at 8:29
  • please try this it will give you the 4 quartiles on your x axis df.boxplot(by='Group',vert=False) it would be difficult to get all the variables in a single plot as we are also applying groupby operation at the same time but we can get the multiple plots depends on the basis of variables grouped by grouping variable. Commented May 24, 2017 at 9:30

2 Answers 2

3

I think you should use Seaborn library that offers to create these type of customize plots.In your case i had first melted your dataframe to convert it into proper format and then created the boxplot of your choice.

import pandas as pd import matplotlib.pyplot as plt Import seaborn as sns dd=pd.melt(df,id_vars=['Group'],value_vars=['M','F'],var_name='sex') sns.boxplot(y='Group',x='value',data=dd,orient="h",hue='sex') 

The plot looks similar to your required plot. enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

Finally, I found a solution by slightly modifying this answer. It does not use groupby object, so it is more tedious to prepare the data, but so far it looks like the best solution to me. Here it is:

# here I prepare the data (group them manually and then store in lists)

Groups=[1,2,3] Columns=df.columns.tolist()[1:] print Columns Mgroups=[] Fgroups=[] for g in Groups: dfgc = df[df['Group']==g] m=dfgc['M'].dropna() f=dfgc['F'].dropna() Mgroups.append(m.tolist()) Fgroups.append(f.tolist()) fig=plt.figure() ax = plt.axes() def setBoxColors(bp,cl): plt.setp(bp['boxes'], color=cl, linewidth=2.) plt.setp(bp['whiskers'], color=cl, linewidth=2.5) plt.setp(bp['caps'], color=cl,linewidth=2) plt.setp(bp['medians'], color=cl, linewidth=3.5) bpl = plt.boxplot(Mgroups, positions=np.array(xrange(len(Mgroups)))*3.0-0.4,vert=False,whis='range', sym='', widths=0.6) bpr = plt.boxplot(Fgroups, positions=np.array(xrange(len(Fgroups)))*3.0+0.4,vert=False,whis='range', sym='', widths=0.6) setBoxColors(bpr, '#D7191C') # colors are from http://colorbrewer2.org/ setBoxColors(bpl, '#2C7BB6') # draw temporary red and blue lines and use them to create a legend plt.plot([], c='#D7191C', label='F') plt.plot([], c='#2C7BB6', label='M') plt.legend() plt.yticks(xrange(0, len(Groups) * 3, 3), Groups) plt.ylim(-3, len(Groups)*3) #plt.xlim(0, 8) plt.show() 

Resulting plot.

The result looks mostly like what I wanted (as far as I have been able to find, the box always ranges from first to third quartile, so it is not possible to set it to +/- standard deviation). So I am a bit disappointed there is no one-line solution, but I am glad it is possible. However, for hundreds of groups this would not be good enough...

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.