3

I would like to create a histogram plot comparing three groups. However, I'd like to normalize each histogram by the total number of counts within each group, not by the total number of counts. Here is the code that I have.

library(ggplot2) library(reshape2) # Creates dataset set.seed(9) df<- data.frame(values = c(runif(400,20,50),runif(300,40,80),runif(600,0,30)),labels = c(rep("med",400),rep("high",300),rep("low",600))) levs <- c("low", "med", "high") df$labels <- factor(df$labels, levels = levs) ggplot(df, aes(x=values, fill=labels)) + geom_histogram(aes(y=..density..), breaks= seq(0, 80, by = 2), alpha=0.2, position="identity") 

Which generates a histogram which appears to be normalized by density. enter image description here

However, I decided to cross check this density plot against my manual validation of that density. To do that I used the below code:

# Separates the low medium and high groups df1 <- df[df$labels == "low",] df2 <- df[df$labels == "med",] df3 <- df[df$labels == "high",] # creates histogram for each group that is normalized by the total number of counts hist_temp <- hist(df1$values, breaks=seq(0,80, by=2)) tdf <- data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts) colnames(tdf) <- c("bins","counts") tdf$norm <- tdf$counts/(sum(tdf$counts)) low1 <- tdf hist_temp <- hist(df2$values, breaks=seq(0,80, by=2)) tdf <- data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts) colnames(tdf) <- c("bins","counts") tdf$norm <- tdf$counts/(sum(tdf$counts)) med1 <- tdf hist_temp <- hist(df3$values, breaks=seq(0,80, by=2)) tdf <- data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts) colnames(tdf) <- c("bins","counts") tdf$norm <- tdf$counts/(sum(tdf$counts)) high1 <- tdf # Combines normalized histograms for each data frame and melts them into a single vector for plotting Tdata <- data.frame(low1$bins,low1$norm,med1$norm,high1$norm) colnames(Tdata) <- c("bin","low", "med", "high") Tdata<- melt(Tdata,id = "bin") levs <- c("low", "med", "high") Tdata$variable <- factor(Tdata$variable, levels = levs) # Plot the data ggplot(Tdata, aes(group=variable, colour= variable)) + geom_line(aes(x = bin, y = value)) 

Which generates: enter image description here

As you can see those are quite different from each other and I can't figure out why. The Y axis should be the same for both of them but it's not. So, assuming I didn't do some stupid math error, I believe I want the histogram to look like the line plot and I can't figure out a way to make that happen. Any help is appreciated and thank you in advance.


Edited to add further examples of what doesn't work:

I have also tried using the ..count../(sum(..count..)) approach with this code:

# Histogram where each histogram is divided by the total count of all groups ggplot(df, aes(x=values, fill=labels, group=labels)) + geom_histogram(aes(y=(..count../sum(..count..))), breaks= seq(0, 80, by = 2), alpha=0.2, position="identity") 

with these results: enter image description here

Which just normalizes to the total count of all histograms. This also does not reflect what I see in the line plot. Also, I've tried substituting ..ncount.. for ..count.. (in the numerator, denominator, and numerator and denominator) and that also does not recreate the results shown in the line graph.

Additionally, I've tried using "position=stack" rather than identity using the below code:

 ggplot(df, aes(x=values, fill=labels, group=labels)) + geom_histogram(aes(y=..density..), breaks= seq(0, 80, by = 2), alpha=0.2, position="stack") 

and got this result: enter image description here

Which also does not reflect the values shown in the line graph.


Progress made! Using the approach outlined at this post by Joran I can now generate the histogram that is the same as the line graph. Below is the code:

# Plot where each histogram is normalized by its own counts. ggplot(df,aes(x=values, fill=labels, group=labels)) + geom_histogram(data=subset(df, labels == 'high'), aes(y=(..count../sum(..count..))), breaks= seq(0, 80, by = 2), alpha = 0.2) + geom_histogram(data=subset(df, labels == 'med'), aes(y=(..count../sum(..count..))), breaks= seq(0, 80, by = 2), alpha = 0.2) + geom_histogram(data=subset(df, labels == 'low'), aes(y=(..count../sum(..count..))), breaks= seq(0, 80, by = 2), alpha = 0.2) + scale_fill_manual(values = c("blue","red","green")) 

Which produces this graph: enter image description here

However, I am STILL having trouble re-ordering the data so that the legend reads "low" then "med" then "high", instead of alphabetical order. I've already set the levels of the factors. (See first block of code). Any thoughts?

6
  • In the first chunk you used y = ..density.., which I am guessing accounts for the probability density. Try to add group = labels and use y = ..count../sum(..count..) instead. Commented Feb 22, 2018 at 8:51
  • Thank you for the reply. Unfortunately, that doesn't seem to create the line plot I'm looking for. I've added the results after the line above to show what that produces. Basically, ..count../sum(..count..) only seems to work if you have a single histogram. When you have multiples the sum(..count..) divides by the total sum of all histograms and gives you fractions that are too low. Commented Feb 22, 2018 at 15:51
  • If that is the case I have an solution that lacks elegance but shall work. Filter your data frame per label and use 3 layers instead of 1. Commented Feb 23, 2018 at 16:23
  • @VitorBianchiLanzetta thanks for the comment. that does seem to work. I'd still like something a little more... elegant if it were possible. It doesn't seem like too strange of a plot. Maybe I'll spend the time to edit ggplot2 source and submit it to the developers for them to include. Commented Feb 27, 2018 at 4:47
  • I got a feeling that maybe foreach is the way to go but I don't have the time to tackle it right now. You should try it :D Commented Mar 3, 2018 at 1:55

1 Answer 1

0

To use counts for each category, maybe position="stack"?

ggplot(df, aes(x=values, fill=labels)) + geom_histogram(aes(y=..density..), breaks= seq(0, 80, by = 2), alpha=0.4, position="stack") + geom_density(alpha=.2, position="stack") 

It gives me this distribution, but still seems different than your second plot.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the reply. I've added the code to my post as another method that doesn't seem to work. After sleeping on it and looking at the line plot code again I still don't see what I could be doing wrong. It also seems crazy that I can produce half a dozen different histograms, all normalized in different ways, and none of them are the way that makes the most sense to me.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.