1

Im having some troubles using factors in functions, or just to make use of them in basic calculations. I have a data-frame something like this (but with as many as 6000 different factors).

df<- data.frame( p <- runif(20)*100, q = sample(1:100,20, replace = T), tt = c("e","e","f","f","f","i","h","e","i","i","f","f","j","j","h","h","h","e","j","i"), ta = c("a","a","a","b","b","b","a","a","c","c","a","b","a","a","c","c","b","a","c","b")) colnames(df)<-c("p","q","ta","tt") 

Now price = p and quantity = q are my variables, and tt and ta are different factors.

Now, I would first like to find the average price per unit of q by each different factor in tt

(p*q ) / sum(q) by tt 

This would in this case give me a list of 3 different sums, by a, b and c (I have 6000 different factors so I need to do it smart :) ).

I have tried using split to make lists, and in this case i can get each individual tt factor to contain the prices and another for the quantity, but I cant seem to get them to for example make an average. I've also tried to use tapply, but again I can't see how I can incorporate factors into this?

EDIT: I can see I need to clearify:

I need to find 3 sums, the average price pr. q given each factor, so in this simplified case it would be:

a: Sum of p*q for (Row (1,2,3, 7, 11, 13,14,18) / sum (q for row Row (1,2,3, 7, 11, 13,14,18)

So the result should be the average price for a, b and c, which is just 3 values.

2 Answers 2

1

I'd use plyr to do this:

library(plyr) ddply(df, .(tt), mutate, new_col = (p*q) / sum(q)) p q ta tt new_col 1 73.92499 70 e a 11.29857879 2 58.49011 60 e a 7.66245932 3 17.23246 27 f a 1.01588711 4 64.74637 42 h a 5.93743967 5 55.89372 45 e a 5.49174103 6 25.87318 83 f a 4.68880732 7 12.35469 23 j a 0.62043207 8 1.19060 83 j a 0.21576367 9 84.18467 25 e a 4.59523322 10 73.59459 66 f b 10.07726727 11 26.12099 99 f b 5.36509998 12 25.63809 80 i b 4.25528535 13 54.74334 90 f b 10.22178577 14 69.45430 50 h b 7.20480246 15 52.71006 97 i b 10.60762667 16 17.78591 54 i c 5.16365066 17 0.15036 41 i c 0.03314388 18 85.57796 30 h c 13.80289670 19 54.38938 44 h c 12.86630433 20 44.50439 17 j c 4.06760541 

plyr does have a reputation for being slow, data.table provides similar functionality, but much higher performance.

Sign up to request clarification or add additional context in comments.

1 Comment

I need to divide p*q with the total sum of all the variables that belong to the factor a, b and c and so forth. So in my example im looking for 3 sums. I guess I should try to clearify.
0

If I understood corectly you'r problem this should be the answer. Give it a try and responde, that I can adjust it if it's needed.

myRes <- function(tt) { out <- NULL; qsum <- sum(as.numeric(df[,"q"])) psum <- sum(as.numeric(df[,"p"])) for (var in tt) { index <- which(df["tt"] == var) out <- c(out, ((qsum *psum) / sum(df[index,"q"]))) } return (out) } threeValue <- myRes(levels(df[, "tt"])); 

4 Comments

Have you checked if this solution?
Ya, I might be missing something, but, Im looking for 3 values, not 20 as I get from this function. I've updated the question.
I get a error at the very last line, "Error in [.data.frame(df, index, "q") : object 'index' not found"`
I was in hurry sorry. Now it should be fine!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.