Create individual rows based on sum value for fake dataset

Question

I am creating a fake dataset, and would like to essentially disaggregate a sum to create dummy rows that I can populate with random dates.

For example, my df might look like this:

id orders skips joe 3 0 mary 2 1 jack 5 1

I want to produce is a data.frame or data.table that looks like this, where a successful order is 1 and a skip is 0:

id order joe 1 joe 1 joe 1 mary 1 mary 0 mary 1 jack 1 jack 1 jack 1 jack 1 jack 0 jack 1

ADDITION: Ideally, the 0 values would be randomly mixed/sandwiched between 1 values if possible. This is due to a quirk of what the dataset will be used for in a problem set.

In a perfect world, I'd then assign a random start_date from a given range to each order within id, such that:

id order date joe 1 1/2/2016 joe 1 1/3/2016 joe 1 1/8/2016 mary 1 1/10/2016 mary 0 1/3/2016 mary 1 1/5/2016 jack 1 1/7/2016 jack 1 1/2/2016 jack 1 1/1/2016 jack 1 1/10/2016 jack 0 1/12/2016 jack 1 1/15/2016

I initially thought that I could use a combination of dcast and reshape to trick R into making the dataset, e.g.dcast(df,id~orders,fun.aggregate=length) but this took me down the wrong path.

But, one must walk before they crawl. Anyone able to help?

@josliber I've added a few of my ideas (dcast and reshape) but didn't want to send anyone down a rabbit hole that I knew to be wrong. Hopefully this helps! — roody
– roody, Commented Feb 29, 2016 at 0:14
x <- Vectorize(rep)(setNames(rep(1:0, nrow(df)), rep(df[, 1], each = 2)), (t(df[, -1]))); data.frame(id = names(x), order = x) — rawr
– rawr, Commented Feb 29, 2016 at 0:23

nrussell · Accepted Answer · 2016-02-29 00:32:36Z

Here's one approach with data.table:

dt[, .(order = rep(c(1, 0), c(orders, skips))), by = "id"] # id order #1: joe 1 #2: joe 1 #3: joe 1 #4: mary 1 #5: mary 1 #6: mary 0 #7: jack 1 #8: jack 1 #9: jack 1 #10: jack 1 #11: jack 1 #12: jack 0

Data:

library(data.table) dt <- fread( "id orders skips joe 3 0 mary 2 1 jack 5 1" )

Not in my question now (I'll go back and edit), but do you have any thoughts on how I could make it so that the 0's are in middle rows when possible? e.g., row #6 would be row #5. This is a quirk of the problem that the fake dataset will be used for.

alistaire · Accepted Answer · 2016-02-29 01:04:15Z

You can do it in base R using tapply (or split and lapply, if you prefer) and then rbinding everything back together:

df2 <- do.call(rbind, tapply(df, df$id, function(x){ data.frame(id = rep(x$id, sum(x$orders, x$skips)), order = sample(rep(c(1, 0), c(x$orders, x$skips))) ) })) rownames(df2) <- NULL

where tapply runs the anonymous function across groups of df$id, and do.call(rbind, rearranges the list back into a single data.frame. The anonymous function makes a data.frame by repeating id the necessary number of times and using sample to shuffle 0 and 1 repeated orders and skips numbers of times, respectively.

One hiccup, which should be fixable: rbind automatically creates row names, which are ugly and unnecessary. There is an argument to turn this off, but I can't get it arranged in the do.call structure properly, so the above just erases them in a second line. (If you know the right place to stick make.row.names = FALSE, comment and I'll edit.)

The result:

> df2 id order 1 jack 0 2 jack 1 3 jack 1 4 jack 1 5 jack 1 6 jack 1 7 joe 1 8 joe 1 9 joe 1 10 mary 1 11 mary 0 12 mary 1

Collectives™ on Stack Overflow

Create individual rows based on sum value for fake dataset

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related