1

Say I have the below table

library(jsonify) library(data.table) df <- data.table(col = c('{"geo":"USA","class":"A","score":"99"}' , '{"geo":"Hawaii","class":"B","score":"83"}' ) ); df col <char> 1: {"geo":"USA","class":"A","score":"99"} 2: {"geo":"Hawaii","class":"B","score":"83"} 

and I wanted to extract certain fields into columns

# fields to extract x <- c('geo', 'class') # extract df[, (x) := {y = lapply(col, \(j) from_json(j)) |> lapply(`[`, x) a = sapply(y, `[`, x[1]) b = sapply(y, `[`, x[2]) .(a, b) } ] 

Above works but 2 problems:

  1. Manual. If I had, say, 40 fields to extract I would have to write out many lines of code
  2. Not memory efficient nor fast

Real data set has millions of rows and each row has about 50 fields. Currently takes about 15 mins to parse 2 fields for 1e5 rows. Further more, even after completion, my PC is lethargic. Ctrl + Alt + Del confirms a lot of memory is taken up. gc() does not solve either.

Below are code to create larger sample data

# Function to generate sample data generate_sample_data <- \(n_rows = 10) { # Generate random data for each field geo_options <- c("USA", "Hawaii", "Canada", "Mexico", "UK") class_options <- c("A", "B", "C", "D") # Create the data table df <- data.table(col = replicate(n_rows , paste0('{"geo":"', sample(geo_options, 1), '",' , '"class":"', sample(class_options, 1), '",' , '"score":"', sample(60:100, 1), '",' , '"extra_field":"', sample(LETTERS, 1), '",' , '"timestamp":"', as.character(Sys.time() + sample(0:100000, 1)), '"}' ) ) ) return(df) } # Generate a data table with 15 rows df <- generate_sample_data(1e6); df[1:3] 

1 Answer 1

2

try using fromJSON function

jsonlite::fromJSON(sprintf("[%s]", toString(df[[1]]))) geo class score extra_field timestamp 1 Canada C 63 R 2025-01-21 02:56:42.58203 2 UK A 71 Z 2025-01-21 12:15:20.582483 3 Hawaii D 79 K 2025-01-21 14:43:41.582883 4 Canada C 68 M 2025-01-21 01:40:20.583218 5 Mexico B 92 X 2025-01-21 05:26:35.583423 6 Hawaii D 71 R 2025-01-21 05:09:51.583672 7 Canada D 95 F 2025-01-21 10:36:04.583794 8 UK B 84 O 2025-01-21 13:11:29.583986 9 Mexico A 92 W 2025-01-21 15:21:13.58412 10 UK A 75 Z 2025-01-21 12:47:03.584297 11 Hawaii B 70 T 2025-01-20 13:31:34.584423 12 UK A 88 Q 2025-01-21 09:46:32.584641 13 Mexico B 64 K 2025-01-21 11:03:20.584838 14 Hawaii D 63 J 2025-01-21 13:53:28.585002 15 UK D 83 B 2025-01-21 04:03:55.585202 
Sign up to request clarification or add additional context in comments.

5 Comments

A far cry (run time) from what I had. Impressive. Thank you. Is there a way to parse specified columns only (to further improve run time and lessen memory used). Currently, I am selecting columns AFTER parsing.
@SweepyDodo the implementation of fromJSON does not support that. Though it takes like 10secs for the data you posted to be parsed. So its quite efficient.
Yes, I looked at ?fromJSON and did not see an argument Only other possibility can think of is use regex to extract relevant parts first But that would defeat the whole point (1. could be slow 2. if going down this route, why not use regex to complete entire task). Anywho, good to confirm. Thank you
@You could squeeze a little speed by using read.dcf after translating the string into relevant format, but its not worth it. The difference seems to be 3 secs only. But thats because read.dcf with given fields returns a matrix rather than a dataframe. meaning everything is of type character if one of them is a character. Doint type conversion and transforming to a dataframe will make the 3 secs not wrth it
If needs be, I will use parallel running Thank you, Onyambu

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.