1

I'm trying to make a nested boxplot like in this SO-answer for matplotlib, but I have trouble figuring out how to create my dataframe.

Goal of this is to make some kind of sensitivity analysis of a PCA model representing object positions (in 3D); where I can see how well a PCA model is able to represent an arch-like distribution, based on the number of PCA components I'm using.

So I have an array of shape (n_pca_components, n_samples, n_objects) containing the distances of the objects to their 'ideal' position on an arch. What I am able to boxplot is this (example showing random data): non-nested-boxplot This is - I assume - an aggregated boxplot (statistics gathered over the first two axes of my array); I want to create a boxplot with the same x- and y-axes, but for each 'obj_..' I want a boxplot for each value along the first axis of my data (n_pca_components), i.e. something like this (where days correspond to 'obj_i's, 'total_bill' to my stored distances and 'smoker' to each entry along the first axis of my array.

nested-boxplot

I read around but got lost in the concepts of panda's multi-indexing, groupby, (un)stack, reset_index, ... All examples I see have a different data structure and I think that's where the problem lies, I haven't yet made the mental 'click' and am thinking in wrong data structures.

What I have so far is this (using random/example data):

n_pca_components = 5 # Let's say I want to make this analysis for using 3, 6, 9, 12, 15 PCA components n_objects = 14 # 14 objects per sample n_samples = 100 # 100 samples # Create random data mses = np.random.rand(n_pca_components, n_samples, n_objects) # Simulated errors # Create column names n_comps = [f'{(i+1) * 3}' for i in range(n_pca_components)] object_ids = [f'obj_{i}' for i in range(n_objects)] samples = [f'sample_{i}' for i in range(n_samples)] # Create panda dataframe mses_pd = mses.reshape(-1, 14) midx = pd.MultiIndex.from_product([n_comps, samples], names=['n_comps', 'samples']) mses_frame = pd.DataFrame(data=mses_pd, index=midx, columns=object_ids) # Make a nested boxplot with `object_ids` on the 'large' X-axis and `n_comps` on each 'nested' X-axis; and the box-statistics about the mses stored in `mses_frame` on the y-axis. # Things I tried (yes, I'm a complete pandas-newbie). I've been reading a lot of SO-posts and documentation but cannot seem to figure out how to do what I want. sns.boxplot(data=mses_frame, hue='n_comps') # ValueError: Cannot use `hue` without `x` and `y` sns.boxplot(data=mses_frame, hue='n_comps', x='object_ids') # ValueError: Could not interpret input 'object_ids' sns.boxplot(data=mses_frame, hue='n_comps', x=object_ids) # ValueError: Could not interpret input 'n_comps' sns.boxplot(data=mses_frame, hue=n_comps, x=object_ids) # ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). 
0

1 Answer 1

1

Is this what you want?

enter image description here

While I think seaborn can handle wide data, I personally find it easier to work with "tidy data" (or long data). To convert your dataframe from the "wide" to "long" you can use DataFrame.melt and make sure to preserve your input.

So

>>> mses_frame.melt(ignore_index=False) variable value n_comps samples 3 sample_0 obj_0 0.424960 sample_1 obj_0 0.758884 sample_2 obj_0 0.408663 sample_3 obj_0 0.440811 sample_4 obj_0 0.112798 ... ... ... 15 sample_95 obj_13 0.172044 sample_96 obj_13 0.381045 sample_97 obj_13 0.364024 sample_98 obj_13 0.737742 sample_99 obj_13 0.762252 [7000 rows x 2 columns] 

Again, seaborn probably can work with this somehow (maybe someone else can comment on this) but I find it easier to reset the index so your multi indices become columns

>>> mses_frame.melt(ignore_index=False).reset_index() n_comps samples variable value 0 3 sample_0 obj_0 0.424960 1 3 sample_1 obj_0 0.758884 2 3 sample_2 obj_0 0.408663 3 3 sample_3 obj_0 0.440811 4 3 sample_4 obj_0 0.112798 ... ... ... ... ... 6995 15 sample_95 obj_13 0.172044 6996 15 sample_96 obj_13 0.381045 6997 15 sample_97 obj_13 0.364024 6998 15 sample_98 obj_13 0.737742 6999 15 sample_99 obj_13 0.762252 [7000 rows x 4 columns] 

Now you can decide what you want to plot, I think you are saying you want

sns.boxplot(x="variable", y="value", hue="n_comps", data=mses_frame.melt(ignore_index=False).reset_index()) 

Let me know if I've misunderstood something

Sign up to request clarification or add additional context in comments.

1 Comment

That's exactly what I wanted! Thanks a bunch!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.