Parse JSON efficiently

Question

Say I have the below table

library(jsonify) library(data.table) df <- data.table(col = c('{"geo":"USA","class":"A","score":"99"}' , '{"geo":"Hawaii","class":"B","score":"83"}' ) ); df col <char> 1: {"geo":"USA","class":"A","score":"99"} 2: {"geo":"Hawaii","class":"B","score":"83"}

and I wanted to extract certain fields into columns

# fields to extract x <- c('geo', 'class') # extract df[, (x) := {y = lapply(col, \(j) from_json(j)) |> lapply(`[`, x) a = sapply(y, `[`, x[1]) b = sapply(y, `[`, x[2]) .(a, b) } ]

Above works but 2 problems:

Manual. If I had, say, 40 fields to extract I would have to write out many lines of code
Not memory efficient nor fast

Real data set has millions of rows and each row has about 50 fields. Currently takes about 15 mins to parse 2 fields for 1e5 rows. Further more, even after completion, my PC is lethargic. Ctrl + Alt + Del confirms a lot of memory is taken up. gc() does not solve either.

Below are code to create larger sample data

# Function to generate sample data generate_sample_data <- \(n_rows = 10) { # Generate random data for each field geo_options <- c("USA", "Hawaii", "Canada", "Mexico", "UK") class_options <- c("A", "B", "C", "D") # Create the data table df <- data.table(col = replicate(n_rows , paste0('{"geo":"', sample(geo_options, 1), '",' , '"class":"', sample(class_options, 1), '",' , '"score":"', sample(60:100, 1), '",' , '"extra_field":"', sample(LETTERS, 1), '",' , '"timestamp":"', as.character(Sys.time() + sample(0:100000, 1)), '"}' ) ) ) return(df) } # Generate a data table with 15 rows df <- generate_sample_data(1e6); df[1:3]

Onyambu · Accepted Answer · 2025-01-20 20:15:00Z

2

try using fromJSON function

jsonlite::fromJSON(sprintf("[%s]", toString(df[[1]]))) geo class score extra_field timestamp 1 Canada C 63 R 2025-01-21 02:56:42.58203 2 UK A 71 Z 2025-01-21 12:15:20.582483 3 Hawaii D 79 K 2025-01-21 14:43:41.582883 4 Canada C 68 M 2025-01-21 01:40:20.583218 5 Mexico B 92 X 2025-01-21 05:26:35.583423 6 Hawaii D 71 R 2025-01-21 05:09:51.583672 7 Canada D 95 F 2025-01-21 10:36:04.583794 8 UK B 84 O 2025-01-21 13:11:29.583986 9 Mexico A 92 W 2025-01-21 15:21:13.58412 10 UK A 75 Z 2025-01-21 12:47:03.584297 11 Hawaii B 70 T 2025-01-20 13:31:34.584423 12 UK A 88 Q 2025-01-21 09:46:32.584641 13 Mexico B 64 K 2025-01-21 11:03:20.584838 14 Hawaii D 63 J 2025-01-21 13:53:28.585002 15 UK D 83 B 2025-01-21 04:03:55.585202

answered Jan 20 at 20:15

Onyambu

80.3k3 gold badges29 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Sweepy Dodo Jan 20 at 21:14

A far cry (run time) from what I had. Impressive. Thank you. Is there a way to parse specified columns only (to further improve run time and lessen memory used). Currently, I am selecting columns AFTER parsing.

Onyambu Jan 20 at 22:33

@SweepyDodo the implementation of fromJSON does not support that. Though it takes like 10secs for the data you posted to be parsed. So its quite efficient.

Sweepy Dodo Jan 20 at 22:41

Yes, I looked at ?fromJSON and did not see an argument Only other possibility can think of is use regex to extract relevant parts first But that would defeat the whole point (1. could be slow 2. if going down this route, why not use regex to complete entire task). Anywho, good to confirm. Thank you

Onyambu Jan 20 at 23:05

@You could squeeze a little speed by using read.dcf after translating the string into relevant format, but its not worth it. The difference seems to be 3 secs only. But thats because read.dcf with given fields returns a matrix rather than a dataframe. meaning everything is of type character if one of them is a character. Doint type conversion and transforming to a dataframe will make the 3 secs not wrth it

Sweepy Dodo Jan 22 at 20:23

If needs be, I will use parallel running Thank you, Onyambu

Collectives™ on Stack Overflow

Parse JSON efficiently

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related