Say I have the below table
library(jsonify) library(data.table) df <- data.table(col = c('{"geo":"USA","class":"A","score":"99"}' , '{"geo":"Hawaii","class":"B","score":"83"}' ) ); df col <char> 1: {"geo":"USA","class":"A","score":"99"} 2: {"geo":"Hawaii","class":"B","score":"83"} and I wanted to extract certain fields into columns
# fields to extract x <- c('geo', 'class') # extract df[, (x) := {y = lapply(col, \(j) from_json(j)) |> lapply(`[`, x) a = sapply(y, `[`, x[1]) b = sapply(y, `[`, x[2]) .(a, b) } ] Above works but 2 problems:
- Manual. If I had, say, 40 fields to extract I would have to write out many lines of code
- Not memory efficient nor fast
Real data set has millions of rows and each row has about 50 fields. Currently takes about 15 mins to parse 2 fields for 1e5 rows. Further more, even after completion, my PC is lethargic. Ctrl + Alt + Del confirms a lot of memory is taken up. gc() does not solve either.
Below are code to create larger sample data
# Function to generate sample data generate_sample_data <- \(n_rows = 10) { # Generate random data for each field geo_options <- c("USA", "Hawaii", "Canada", "Mexico", "UK") class_options <- c("A", "B", "C", "D") # Create the data table df <- data.table(col = replicate(n_rows , paste0('{"geo":"', sample(geo_options, 1), '",' , '"class":"', sample(class_options, 1), '",' , '"score":"', sample(60:100, 1), '",' , '"extra_field":"', sample(LETTERS, 1), '",' , '"timestamp":"', as.character(Sys.time() + sample(0:100000, 1)), '"}' ) ) ) return(df) } # Generate a data table with 15 rows df <- generate_sample_data(1e6); df[1:3]