2

I am running sparklyr in local mode from RStudio in Windows 10:

spark_version <- "2.1.0" sc <- spark_connect(master = "local", version = spark_version) df <- data.frame(id = c(1, 1, 2, 2), county_code = c(1, 20, 321, 2)) sprintf("%03d",as.numeric(df$county_code)) df_tbl = copy_to(sc,df, "df_tbl", overwrite = TRUE) df_tbl %>% summarise(sum = sum(county_code)) %>% collect() ## this works ## this does not: df_tbl %>% spark_apply(function(e) data.frame(sprintf("%03d",as.numeric(e$county_code), e), names = c('county_code_fips', colnames(e)))) 

The last line returns the following error:

Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\Users\janni\AppData\Local\Temp\RtmpELRVxu\file4ab817055ccc_spark.log': Permission denied 

This happens on both my laptop and desktop. I tried running RStudio as an administrator, but it would not change anything.

1 Answer 1

1

It seems the problem comes from the way names specified for the spark_apply.

One option is that you can do without this without the names argument.

df_tbl %>% spark_apply(function(e) data.frame(county_code_fips = sprintf("%03d",as.numeric(e$county_code)), e)) ## Source: spark<?> [?? x 3] # county_code_fips id county_code # <chr> <dbl> <dbl> #1 001 1 1 #2 020 1 20 #3 321 2 321 #4 002 2 2 

Since names does not have access to the e in the function inside the spark_apply, you have to use the names from the tbl.

df_tbl %>% spark_apply(function(e) data.frame(sprintf("%03d",as.numeric(e$county_code)), e), names = c('county_code_fips', colnames(df_tbl))) ## Source: spark<?> [?? x 3] # county_code_fips id county_code # <chr> <dbl> <dbl> #1 001 1 1 #2 020 1 20 #3 321 2 321 #4 002 2 2 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! I initially tried to understand what is the output you desire, and then saw what is wrong with your code. There is no names argument for data.frame. There is one for spark_apply, however, the scope of e does not extend to it, so you have to use colnames (df_tbl). Hope that helps.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.