0

I am currently having issues with performance in one of my scripts. I made the script as a result of this question, but I have been unable to increase its performance and figured increasing its performance is a different question than actually writing the code.

I wrote the code to generate a dummy webshop dataset with a hidden pattern hat can be found with clustering as an example in one of my courses. It, however, does not allow me to go beyond ~ 40,000 transactions with a reasonable runtime (i.e. a few hours).

This issue is as follows, using these parameters I will build a transaction/customer/product table:

set.seed(1) # Set seed to make reproducible Parameters <- data.frame( CustomerType = c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"), PropCustTypes = c(.10, .45, .30, .15), # Probability for being in each group. BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability for each group ByDirectCustomer = c(0.60, .30, 0.15, 0.05), # of coming through channel X ByPartnerBlog = c(0.30, .30, 0.35, 0.35), # Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate. Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing. stringsAsFactors=FALSE) # Some other parameters for later use. NumDays = 1000 NumTransactions = 100000 # Note that more than these will be made, it's a starting point (excluding annual growth, weekend increases etc.) SalesMultiplierWeekends = 1.5 # For example, I want more in weekends StartDate <- as.Date("2009-01-04") NumProducts <- 150 AnnualGrowth <- .1 # I also want an annual growth trend 

I start with a 'Days' dataframe along with an almost equal division of total transactions over all days.

days <- data.frame( # Define the days day = StartDate+1:NumDays, DaysSinceStart = StartDate+1:NumDays - StartDate, # Used to compute a rising trend CustomerRate = NumTransactions/NumDays) days$nPurchases <- rpois(NumDays, days$CustomerRate) days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*SalesMultiplierWeekends) days$nPurchases <- as.integer(days$nPurchases+days$nPurchases * (days$DaysSinceStart/365)*AnnualGrowth) 

Next I generate the transactions using this table:

Transactions <- data.frame( ID = 1:sum(days$nPurchases), Date = rep(days$day, times=days$nPurchases), CustomerType = sample(Parameters$CustomerType, sum(days$nPurchases), replace=TRUE, prob=Parameters$PropCustTypes), NewCustomer = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)), CustomerID = NA, # Will be assigned later, NewCustomer: 0.8 and .2 ProductID = NA, # insinuate new/existing customers above ReferredBy = NA) Transactions$CustomerType <- as.character(Transactions$CustomerType) 

Now I'd like to dynamically assign products and customers to each transaction in order to make my pattern recognizable in the transaction dataset. I first make a product table from which I can choose, having convenient release dates so that I will be able to select a product for each transaction based on this info.

StartProductRelease <- StartDate-(365*2*max(Parameters$Timeliness)/12) ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease)) Upper <- max(ReleaseRange) Lower <- min(ReleaseRange) Products <- data.frame( ID = 1:NumProducts, DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts, as.numeric(Upper-Lower), as.numeric(Upper-Lower)/NumProducts))), SuggestedPrice = rnorm(NumProducts, 100, 50)) Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 15 # Cap ProductPrice at 10$ 

Next I build a table of customers, deriving from the number of 'new customers' in the transaction dataset.

Customers <- data.frame( ID=(1:sum(Transactions$NewCustomer)), CustomerType = sample(Parameters$CustomerType, size=sum(Transactions$NewCustomer), replace=TRUE, prob=Parameters$PropCustTypes) ); Customers$CustomerType <- as.character(Customers$CustomerType) 

I want to dynamically assign Customers and Products to each transaction, sampled from the 'Products' and 'Customers' dataframe in order to maintain the overall parameters I have defined above. I'd like to vectorize this, but I have no idea on how I would do so (I've already excluded as much as I could from the for loop). The part outside of the for loop:

ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog") Transactions <- merge(Transactions,Parameters, by="CustomerType") # Parameters are now Transactions$Discount <- rnorm(length(Transactions$ID), # assigned to each transaction Transactions$Discount,Transactions$Discount/20) Transactions$Timeliness <- rnorm(length(Transactions$ID), Transactions$Timeliness, Transactions$Timeliness/6) 

Now the performance issues start to arise, the for loop:

for (i in 1:nrow(Transactions)){ # Only sample customers which share the same 'CustomerType' as the transaction Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID, 1,replace=FALSE) # Sample the 'ReferredBy' based upon the proportions described in 'Parameters' Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE, prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")]) # Only sample products in the required range to maintain the 'timeliness' parameter. CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30) ProductReleaseRange <- as.Date(CenteredAround+c(-15:15)) Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE) } 

This concludes to my final question: how would I vectorize the last part here? I've been able to munge millions of rows with data.table in seconds, it just seems weird that I'm unable to conduct such a relatively simple task so slow.

  • For loop / filling 100 rows: ~ 18 Seconds
  • For loop / filling 200 rows: ~ 37 Seconds
  • For loop / filling 1000 rows: ~ 3 minutes
  • For loop / filling 300000 rows: No idea, can't get that far?

Why is it running so slow and how can I solve this? Any help is greatly appreciated.

3
  • 1
    One guess: try to preallocate the length of things like Transactions[1:N,]$ReferredBy so that you're not re-allocating every time thru the loop (adding a new element each time). Similarly, can you generate either a pile of samples or at least a vector of random numbers (to use as indices) outside the loop rather than calling sample a zillion times? Commented Apr 18, 2014 at 12:55
  • I understand what you're saying and how it would be better. I'm having difficulties converting it to R code though, would you mind providing a bit more info as for how I would generate the pile of samples/vector of random numbers-- while maintaining the same functionality? Commented Apr 18, 2014 at 13:01
  • 1
    You could also try upgrading to R 3.1, or avoiding the data frame and just using vectors. Commented Apr 18, 2014 at 13:16

2 Answers 2

1

Below is how you would do the first part using data.table, adding CustomerID to the Transactions table. I have changed some names and dropped the placeholder columns as they will be added through the data.table joins.

Tr <- data.table(Transactions) Tr[, CustomerID:=NULL] Tr[, ProductID:=NULL] Tr[, ReferredBy:=NULL] ## see @Arun's comment for a more compact way to do this Cs <- data.table(Customers) setnames(Cs, 'ID', 'CustomerID') ## So we avoid duplicate with Tr ## Add customer ID, matching customer types setkey(Tr, CustomerType) setkey(Cs, CustomerType) # Make an index Transaction ID -> Customer ID # Large interim matrix should not be formed, but I am not sure TrID2CustID <- Cs[Tr, allow.cartesian=T][, list(CustomerID=sample(CustomerID, 1)), by=ID] setkey(TrID2CustID, ID) setkey(Tr, ID) Tr <- Tr[TrID2CustID] 

There is a large matrix that is the cartesian product of your Transactions and Customers tables (about 15M rows) which would exhaust the memory if it is explicitly computed. Judging by the fact that this takes about a second, I'd say it is not computed, but I am not sure.

I will work on the rest and edit the answer if I come up with the solutions quickly, but this ought to show you how to do this using data.table.

UPDATE 1: adding ReferredBy

Since the referral probabilities only vary by CustomerType, you can generate the referrals in blocks with replacement (much faster than by individual ID)

setkey(Tr, CustomerType) Tr[, ReferredBy:=sample(ReferredByOptions, replace=TRUE, size=.N, prob=c(BySearchEngine[1], ByDirectCustomer[1], ByPartnerBlog[1])), by=CustomerType] 

UPDATE 2: adding ProductID

This is proving trickier to do in a neat cartesian-product sort of way. I cannot think of an elegant way to generate the 31 dates (-15:15) for each purchase (melted matrix would probably be too big). The code below works as intended but is not as fast as the previous 2:

Pr <- data.table(Products) setnames(Pr, 'ID', 'ProductID') ## not necessary here, but good practice CenteredAround <- as.Date(Tr$Date - 30*Tr$Timeliness) setkey(Tr, ID) Tr[, ProductID:=sample(Pr[abs(Pr$DateReleased - CenteredAround[.I]) <= 15, ProductID], 1), by=ID] 
Sign up to request clarification or add additional context in comments.

2 Comments

Note that you can do: dt[, c("a", "b", "c") := NULL] or dt[, `:=`(a=NULL, b=NULL, c=NULL)].
@Arun thanks. I left it like that for clarity, in case the MV1990 is not familiar with the syntax. Do you have a suggestion on how the last part can be sped up (without explicitly expanding the index of Pr)?. Is there a way to have not precise matches on a join but, say, a band of values not too far from the indexing key?
1

A very simple optimization is to avoid modifying the data frame in the loop, as others have suggested. At least prior to R3.1, modifying a data frame is really expensive, so that's the last thing you want to be doing in a loop. Also, based on Hadley's comments and release notes for R3.1, it may be the case that modifying data frames is not as expensive with R3.1, but I haven't tested.

Here we get around the data frame modification by storing interim results in vectors, and then only inserting into the data frame after the loop. Consider:

system.time({ custId <- Transactions$CustomerID refBy <- Transactions$ReferredBy productID <- Transactions$ProductID for (i in 1:100){ # Only sample customers which share the same 'CustomerType' as the transaction custId <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID, 1,replace=FALSE) # Sample the 'ReferredBy' based upon the proportions described in 'Parameters' refBy <- sample(ReferredByOptions,1,replace=FALSE, prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")]) # Only sample products in the required range to maintain the 'timeliness' parameter. CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30) ProductReleaseRange <- as.Date(CenteredAround+c(-15:15)) productID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE) } Transactions$CustomerID <- custId Transactions$ReferredBy <- refBy Transactions$ProductID <- productID }) 

Which times in at:

user system elapsed 0.66 0.06 0.71 

The corresponding time with your original code is:

user system elapsed 5.01 1.78 6.79 

So close to a 10x improvement with a minor change (avoiding modifying the data frame repeatedly).

I'm sure you can get further improvements, but this is a real low hanging fruit you can easily implement.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.