I am currently having issues with performance in one of my scripts. I made the script as a result of this question, but I have been unable to increase its performance and figured increasing its performance is a different question than actually writing the code.
I wrote the code to generate a dummy webshop dataset with a hidden pattern hat can be found with clustering as an example in one of my courses. It, however, does not allow me to go beyond ~ 40,000 transactions with a reasonable runtime (i.e. a few hours).
This issue is as follows, using these parameters I will build a transaction/customer/product table:
set.seed(1) # Set seed to make reproducible Parameters <- data.frame( CustomerType = c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"), PropCustTypes = c(.10, .45, .30, .15), # Probability for being in each group. BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability for each group ByDirectCustomer = c(0.60, .30, 0.15, 0.05), # of coming through channel X ByPartnerBlog = c(0.30, .30, 0.35, 0.35), # Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate. Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing. stringsAsFactors=FALSE) # Some other parameters for later use. NumDays = 1000 NumTransactions = 100000 # Note that more than these will be made, it's a starting point (excluding annual growth, weekend increases etc.) SalesMultiplierWeekends = 1.5 # For example, I want more in weekends StartDate <- as.Date("2009-01-04") NumProducts <- 150 AnnualGrowth <- .1 # I also want an annual growth trend I start with a 'Days' dataframe along with an almost equal division of total transactions over all days.
days <- data.frame( # Define the days day = StartDate+1:NumDays, DaysSinceStart = StartDate+1:NumDays - StartDate, # Used to compute a rising trend CustomerRate = NumTransactions/NumDays) days$nPurchases <- rpois(NumDays, days$CustomerRate) days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*SalesMultiplierWeekends) days$nPurchases <- as.integer(days$nPurchases+days$nPurchases * (days$DaysSinceStart/365)*AnnualGrowth) Next I generate the transactions using this table:
Transactions <- data.frame( ID = 1:sum(days$nPurchases), Date = rep(days$day, times=days$nPurchases), CustomerType = sample(Parameters$CustomerType, sum(days$nPurchases), replace=TRUE, prob=Parameters$PropCustTypes), NewCustomer = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)), CustomerID = NA, # Will be assigned later, NewCustomer: 0.8 and .2 ProductID = NA, # insinuate new/existing customers above ReferredBy = NA) Transactions$CustomerType <- as.character(Transactions$CustomerType) Now I'd like to dynamically assign products and customers to each transaction in order to make my pattern recognizable in the transaction dataset. I first make a product table from which I can choose, having convenient release dates so that I will be able to select a product for each transaction based on this info.
StartProductRelease <- StartDate-(365*2*max(Parameters$Timeliness)/12) ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease)) Upper <- max(ReleaseRange) Lower <- min(ReleaseRange) Products <- data.frame( ID = 1:NumProducts, DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts, as.numeric(Upper-Lower), as.numeric(Upper-Lower)/NumProducts))), SuggestedPrice = rnorm(NumProducts, 100, 50)) Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 15 # Cap ProductPrice at 10$ Next I build a table of customers, deriving from the number of 'new customers' in the transaction dataset.
Customers <- data.frame( ID=(1:sum(Transactions$NewCustomer)), CustomerType = sample(Parameters$CustomerType, size=sum(Transactions$NewCustomer), replace=TRUE, prob=Parameters$PropCustTypes) ); Customers$CustomerType <- as.character(Customers$CustomerType) I want to dynamically assign Customers and Products to each transaction, sampled from the 'Products' and 'Customers' dataframe in order to maintain the overall parameters I have defined above. I'd like to vectorize this, but I have no idea on how I would do so (I've already excluded as much as I could from the for loop). The part outside of the for loop:
ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog") Transactions <- merge(Transactions,Parameters, by="CustomerType") # Parameters are now Transactions$Discount <- rnorm(length(Transactions$ID), # assigned to each transaction Transactions$Discount,Transactions$Discount/20) Transactions$Timeliness <- rnorm(length(Transactions$ID), Transactions$Timeliness, Transactions$Timeliness/6) Now the performance issues start to arise, the for loop:
for (i in 1:nrow(Transactions)){ # Only sample customers which share the same 'CustomerType' as the transaction Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID, 1,replace=FALSE) # Sample the 'ReferredBy' based upon the proportions described in 'Parameters' Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE, prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")]) # Only sample products in the required range to maintain the 'timeliness' parameter. CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30) ProductReleaseRange <- as.Date(CenteredAround+c(-15:15)) Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE) } This concludes to my final question: how would I vectorize the last part here? I've been able to munge millions of rows with data.table in seconds, it just seems weird that I'm unable to conduct such a relatively simple task so slow.
- For loop / filling 100 rows: ~ 18 Seconds
- For loop / filling 200 rows: ~ 37 Seconds
- For loop / filling 1000 rows: ~ 3 minutes
- For loop / filling 300000 rows: No idea, can't get that far?
Why is it running so slow and how can I solve this? Any help is greatly appreciated.
Transactions[1:N,]$ReferredByso that you're not re-allocating every time thru the loop (adding a new element each time). Similarly, can you generate either a pile of samples or at least a vector of random numbers (to use as indices) outside the loop rather than callingsamplea zillion times?