quick way to read a large flat file into r as.numeric [duplicate]

Question

I have a large (450MB / 250 million rows) flat file of 1s and 0s that looks like this...

 1 0 0 1 0 1 0 etc...

I am using the following method to read it into R...

dat <- as.numeric(readLines("my_large_file"))

I am getting the desired data structure but it takes a long time. Any suggestions for a quicker method to achieve the same result?

NB. The order of the 1s and 0s is important to conserve. I would consider options in either python of unix command line but the final data structure is required in R for plotting a graph.

fread in data.table is pretty good at reading large files relatively fast — baptiste
– baptiste, Commented Jul 21, 2014 at 19:18

Rich Scriven · Accepted Answer · 2014-07-21 19:52:26Z

You might do better with scan for numeric files where you just want a vector returned.

scan("my_large_file", what = integer())

The what argument will speed up the reading of your file even more (as opposed to leaving it out), since you are effectively telling R that it will be reading integer values. scan also has many other arguments that come in handy with large numeric files (e.g. skip, nlines, etc.)

In addition, as mentioned by @baptiste in the comments,

library(data.table) fread("my_large_file")

blows both readLines and scan away (on my machine).

NOTE: Probably a typo, but in your original post, I think readlines should be readLines

If it's just 1 and 0, specifying integer as the type will be faster, since they require half the memory vs doubles.
Thanks, @RichardScriven, fread completes the job in seconds.

2 revs · Accepted Answer · 2014-07-21 19:46:54Z

Timings comparing a couple options. First, some data.

set.seed(21) x <- sample.int(2, 25e6, TRUE) - 1L writeLines(as.character(x),"data")

Now, some benchmarks (each run from a new R session to avoid the file being cached).

> system.time(r <- as.numeric(readLines("data"))) user system elapsed 5.235 0.447 5.681 > system.time(r <- scan("data",what=numeric())) Read 25000000 items user system elapsed 4.199 0.286 4.483 > system.time(r <- scan("data",what=integer())) Read 25000000 items user system elapsed 3.134 0.081 3.214 > require(data.table) > system.time(r <- fread("data")$V1) user system elapsed 0.412 0.026 0.439

And verification:

> num <- as.numeric(readLines("data")) > int <- as.integer(readLines("data")) > sn <- scan("data",what=numeric()) Read 25000000 items > si <- scan("data",what=integer()) Read 25000000 items > dti <- fread("data")$V1 > identical(num,sn) [1] TRUE > identical(int,si) [1] TRUE > identical(int,dti) [1] TRUE

Thanks, I just learned like three new things just from this answer.
@RichardScriven: I hope they were not only new, but useful. ;)
I find it depressing that a computer would take "a while" to generate a text file with a few million 0s and 1s.
@baptiste: write is a wrapper to cat, which doesn't seem to be optimized for this sort of thing. writeLines is faster.
something else doesn't seem to be optimised for this: I've been unable to use my computer for the past 10 minutes, it froze when I tried to produce a black-and-white raster image with that number of pixels. Bad idea, next time I'll use my camera and shoot at the sky for randomness.

Collectives™ on Stack Overflow

quick way to read a large flat file into r as.numeric [duplicate]

2 Answers 2

2 Comments

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Linked

Related