R table manipulation

Question

I have a data.frame as below

PRODUCT=c(rep("A",4),rep("B",2)) ww1=c(201438,201440,201444,201446,201411,201412) ww2=ww1-6 DIFF=rep(6,6) DEMAND=rep(100,6) df=data.frame(PRODUCT,ww1,ww2,DIFF,DEMAND) df<- df[with(df,order(PRODUCT, ww1)),] df PRODUCT ww1 ww2 DIFF DEMAND 1 A 201438 201432 6 100 2 A 201440 201434 6 100 3 A 201444 201438 6 100 4 A 201446 201440 6 100 5 B 201411 201405 6 100 6 B 201412 201406 6 100

I want to add rows to it based upon the conditions below.

For any row in the data, if the product on the earlier row is the same as the product on the current row, but the ww1 on the earlier row is not same as the ww1-1 on the current row (basically ww1 difference is 1), then add a new row.

For the newly added row:

Product will be the same as product on earlier row. ww1 will be ww1 on the earlier row + 1 ww2 will be ww2 on the earlier row + 1 ww_diff will be 6 demand will be 0

The final output that I need is something like below:

PRODUCT ww1 ww2 WW_DIFF DEMAND A 201438 201432 6 100 A 201439 201433 6 0 A 201440 201434 6 100 A 201441 201435 6 0 A 201442 201436 6 100 A 201443 201437 6 0 A 201444 201438 6 100 A 201445 201439 6 0 A 201446 201440 6 100 B 201411 201405 6 100 B 201412 201406 6 100

As of now I am thinking of writing a macro in excel, but it will be very slow and therefore I would prefer a R solution

update1===============================

How could I add column seq? that column is 1 for earliest entry of ww1 of every product and then it increments by 1.

PRODUCT ww1 ww2 WW_DIFF DEMAND seq A 201438 201432 6 100 1 A 201439 201433 6 0 2 A 201440 201434 6 100 3 A 201441 201435 6 0 4 A 201442 201436 6 100 5 A 201443 201437 6 0 6 A 201444 201438 6 100 7 A 201445 201439 6 0 8 A 201446 201440 6 100 9 B 201411 201405 6 100 1 B 201412 201406 6 100 2

update2=======================================================

I am posting questions again (I unchecked previously accepted answer of alistaire as that answer is not working on my original data, it works only on small sample of data :(

In below solution by user alistaire, df3 <- right_join(df, data.frame(ww1=ww1big)) seem to be causing issue.

In a final solution, I would also prefer if columns are specified by their names. That way I won't be forced to arrange columns in a particular order.

Is there a reason you aren't filtering the data rather than adding the data to a df? — C Ried
– C Ried, Commented May 29, 2015 at 17:59
I didnt get you? I want to add those rows as they are not available currently. How would filtering solve my issue? — user2543622
– user2543622, Commented May 29, 2015 at 18:09
Is the table dfordered first by PRODUCT and then by ww1? — Stibu
– Stibu, Commented May 29, 2015 at 22:19
Just out of curiosity, is my answer valid for your purposes or has it some issue I have not contemplated? In my opinion, it is impolite that after taking my time to answer your question, you do not even reply telling me that something is wrong or that it does not properly fit your needs especially when you have posted other comments. I think these kinds of behaviours discourage people from answering. — Jon Nagra
– Jon Nagra, Commented Jun 5, 2015 at 9:20

David Arenburg · Accepted Answer · 2015-06-05 08:42:50Z

Here's a very similar data.table solution that I assume should be more efficient as I'm minimizing the calculations per group and using binary join instead.

library(data.table) setkey(setDT(df), PRODUCT, ww1) ## Sorting by `PRODUCT` and `ww1` indx <- setkey(df[, list(ww1 = seq.int(ww1[1L], ww1[.N], by = 1L)), by = PRODUCT]) ## running `seq.int` on `ww1` per group res <- df[indx][is.na(ww2), `:=`(ww2 = ww1 - 6L, DIFF = 6L, DEMAND = 0L)] ## filling the missing values res[, seq := seq_len(.N), by = PRODUCT] # Adding index res # PRODUCT ww1 ww2 DIFF DEMAND seq # 1: A 201438 201432 6 100 1 # 2: A 201439 201433 6 0 2 # 3: A 201440 201434 6 100 3 # 4: A 201441 201435 6 0 4 # 5: A 201442 201436 6 0 5 # 6: A 201443 201437 6 0 6 # 7: A 201444 201438 6 100 7 # 8: A 201445 201439 6 0 8 # 9: A 201446 201440 6 100 9 # 10: B 201411 201405 6 100 1 # 11: B 201412 201406 6 100 2

alistaire · Accepted Answer · 2015-05-30 00:20:50Z

Based on the instructions, you'd still have gaps in ww1 if there is more than one missing value in a row. Nevertheless, you can follow the stated logic exactly like this:

require(dplyr) df2 <- rbind(df, unique(do.call(rbind, lapply(seq(nrow(df)), function(x){ toAdd <- filter(df[1:x-1,], PRODUCT == df[x, 'PRODUCT'], ww1 != df[x,'ww1']-1) if(nrow(toAdd) > 0){ toAdd$ww1 <- toAdd$ww1+1 toAdd$ww2 <- toAdd$ww2+1 toAdd$DEMAND <- 0 toAdd } }))) )

which returns

> df2 PRODUCT ww1 ww2 DIFF DEMAND 1 A 201438 201432 6 100 2 A 201439 201433 6 0 3 A 201440 201434 6 100 4 A 201441 201435 6 0 5 A 201444 201438 6 100 6 A 201445 201439 6 0 7 A 201446 201440 6 100 8 B 201411 201405 6 100 9 B 201412 201406 6 100

If, on the other hand, you want rows for every value of ww1 between the min and max for each product, this will work:

require(dplyr) df <- group_by(df, PRODUCT) extremes <- summarise(df, maxw=max(ww1), minw=min(ww1)) ww1big <- do.call(c, lapply(seq(nrow(extremes)), function(x){ seq(extremes[[x, 3]], extremes[[x, 2]]) })) df3 <- right_join(df, data.frame(ww1=ww1big)) nullindex <- seq_along(df3$PRODUCT)[is.na(df3$PRODUCT)] # the `for` and `while` loops will be slow if the dataset is REALLY huge, but they're pretty simple nullreplace <- nullindex for(i in 1:length(nullreplace)){ while(is.na(df3[nullreplace[i], 1])){ nullreplace[i]<-nullreplace[i]-1 } } df3[nullindex, c(1, 4)] <- df3[nullreplace, c(1, 4)] df3[nullindex, 5] <- 0 df3[nullindex, 3] <- df3[nullreplace, 3] + (nullindex-nullreplace)

which returns:

> df3 Source: local data frame [11 x 5] Groups: PRODUCT PRODUCT ww1 ww2 DIFF DEMAND 1 A 201438 201432 6 100 2 A 201439 201433 6 0 3 A 201440 201434 6 100 4 A 201441 201435 6 0 5 A 201442 201436 6 0 6 A 201443 201437 6 0 7 A 201444 201438 6 100 8 A 201445 201439 6 0 9 A 201446 201440 6 100 10 B 201411 201405 6 100 11 B 201412 201406 6 100

Both solutions make use of the dplyr package, and neither is terribly elegant. They should both be fast, though, aside from the one for/while loop in the second selection, which is relatively simple. It could probably be rewritten with an *apply function if necessary, though it will be less readable. Both can handle additional products easily.

edit 1=========================

It's super easy, actually, because the data.frame is already grouped by product by dplyr, so all you need is

df3 <- mutate(df3, seq=seq_along(PRODUCT))

and you get

> df3 Source: local data frame [11 x 6] Groups: PRODUCT PRODUCT ww1 ww2 DIFF DEMAND seq 1 A 201438 201432 6 100 1 2 A 201439 201433 6 0 2 3 A 201440 201434 6 100 3 4 A 201441 201435 6 0 4 5 A 201442 201436 6 0 5 6 A 201443 201437 6 0 6 7 A 201444 201438 6 100 7 8 A 201445 201439 6 0 8 9 A 201446 201440 6 100 9 10 B 201411 201405 6 100 1 11 B 201412 201406 6 100 2

thanks :). For my actual data, I am planning to name columns as in the original question. I will also sort my data by Product and then within product by ww1. Would that be sufficient to run your scripts? Are you making any other assumptions? do i have to maintain sequence of columns?
Again, sorting is easy with dplyr; just run df3 <- arrange(df3, PRODUCT, ww1). Sequence of columns doesn't matter, but you do need to make sure you have dplyr installed first. install.packages(dplyr) should do the trick, as long as R knows where a CRAN mirror is. Hadley wrote a nice vignette if you need to do more.
Oops, actually the second solution does depend on column order. To make it order-independent, just replace the integers in subsets with the name of the column in quotations (or $ notation).
I didnt get you :(. Would it be possible to update your second solution. I use that only...
@allistaire i am adding a new column to my data based upon the post stackoverflow.com/questions/30553282/r-text-manipulation. Instead of column ww1, I want df$newww1 to be the column on which all calculations that you were doing above to work. But i am getting an error. Would it be possible to amend your solution to do the same? I get error in the 2nd statement below extremes <- summarise(df, maxw=max(newww1), minw=min(newww1)) ww1big <- do.call(c, lapply(seq(nrow(extremes)), function(x){ seq(extremes[[x, 3]], extremes[[x, 2]]) }))

Arun · Accepted Answer · 2015-06-06 08:55:07Z

I lately have to use big tables and have become a great fan of data.table package (it is really fast and allows creating new variables without allocating memory).

With it the solution would be as follows:

library(data.table) # convert to data.table dtable = as.data.table(df) # create the variables grouped by PRODUCT dtransf <- dtable[, .(ww1 = seq(min(ww1), max(ww1), 1L), ww2 = seq(min(ww2), max(ww2), 1L), DIFF = 6L, DEMAND = as.integer(seq(min(ww1), max(ww1),1L) %in% unique(ww1)) * 100), by = PRODUCT] #add the incremental counter dtransf[,seq := seq_len(.N), by = PRODUCT]

The code is a bit case specific (especially the DEMAND calculation), in a more complex situation you will probably need some join for inputing the right demand. Also, bare in mind that if there is some error in the dataset (for instance a ww1 and ww2 not having the same difference between elements) the code will fail.

nathanesau · Accepted Answer · 2015-06-03 15:13:17Z

# NEW SOLUTION nrows = length(df[,1]) newdf = df[1,] myseq = 1 for(i in 2:nrows) { currentRow = df[i,] tmpRow = df[i-1,] if(tmpRow$ww1 < (currentRow$ww1 - 1)) { tmp = (tmpRow$ww1+1):(currentRow$ww1-1) tmp.length = length(tmp) tmp.last = ifelse(length(myseq)==0, 1, tail(myseq,1)+1) myseq = c(myseq, tmp.last:(tmp.last + tmp.length)) tmpdf = data.frame(PRODUCT=rep(tmpRow$PRODUCT, tmp.length), ww1=tmp, ww2=tmp-6, DIFF=rep(6,tmp.length),DEMAND=rep(0,tmp.length)) newdf = rbind(newdf,tmpdf,currentRow) } else { if(tmpRow$ww1==currentRow$ww1-1) { myseq = c(myseq, tail(myseq,1)+1) } else { myseq = c(myseq,1) } newdf = rbind(newdf,currentRow) } } newdf = cbind(newdf, myseq) nrows = length(newdf[,1]) row.names(newdf) = 1:nrows # OLD SOLUTION nrows = length(df[,1]) newdf = df[1,] for(i in 2:nrows) { previousRow = df[i-1,] currentRow = df[i,] tmpRow = df[i-1,] if(tmpRow$ww1 < currentRow$ww1) { while(tmpRow$ww1 + 1 != currentRow$ww1) { tmpRow$ww1 = tmpRow$ww1 + 1 tmpRow$ww2 = tmpRow$ww2 + 1 # diff doesn't change tmpRow$DEMAND = 0 # rbind current row newdf=rbind(newdf,tmpRow) } } newdf=rbind(newdf,currentRow) } nrows = length(newdf[,1]) row.names(newdf) = 1:nrows

Just realized that you wanted a column "seq" - I'll update my code.
New solution adds on the new "seq" column. Also it is much faster - rather than calling rbind for every row, you can calculate a group of rows (stored in tmpdf) and bind these to the newdf

Collectives™ on Stack Overflow

R table manipulation

4 Answers 4

Comments

11 Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

11 Comments

Comments

2 Comments

Linked

Related