1

I have a large (450MB / 250 million rows) flat file of 1s and 0s that looks like this...

 1 0 0 1 0 1 0 etc... 

I am using the following method to read it into R...

dat <- as.numeric(readLines("my_large_file")) 

I am getting the desired data structure but it takes a long time. Any suggestions for a quicker method to achieve the same result?

NB. The order of the 1s and 0s is important to conserve. I would consider options in either python of unix command line but the final data structure is required in R for plotting a graph.

1
  • 5
    fread in data.table is pretty good at reading large files relatively fast Commented Jul 21, 2014 at 19:18

2 Answers 2

5

You might do better with scan for numeric files where you just want a vector returned.

scan("my_large_file", what = integer()) 

The what argument will speed up the reading of your file even more (as opposed to leaving it out), since you are effectively telling R that it will be reading integer values. scan also has many other arguments that come in handy with large numeric files (e.g. skip, nlines, etc.)

In addition, as mentioned by @baptiste in the comments,

library(data.table) fread("my_large_file") 

blows both readLines and scan away (on my machine).

NOTE: Probably a typo, but in your original post, I think readlines should be readLines

Sign up to request clarification or add additional context in comments.

2 Comments

If it's just 1 and 0, specifying integer as the type will be faster, since they require half the memory vs doubles.
Thanks, @RichardScriven, fread completes the job in seconds.
5

Timings comparing a couple options. First, some data.

set.seed(21) x <- sample.int(2, 25e6, TRUE) - 1L writeLines(as.character(x),"data") 

Now, some benchmarks (each run from a new R session to avoid the file being cached).

> system.time(r <- as.numeric(readLines("data"))) user system elapsed 5.235 0.447 5.681 > system.time(r <- scan("data",what=numeric())) Read 25000000 items user system elapsed 4.199 0.286 4.483 > system.time(r <- scan("data",what=integer())) Read 25000000 items user system elapsed 3.134 0.081 3.214 > require(data.table) > system.time(r <- fread("data")$V1) user system elapsed 0.412 0.026 0.439 

And verification:

> num <- as.numeric(readLines("data")) > int <- as.integer(readLines("data")) > sn <- scan("data",what=numeric()) Read 25000000 items > si <- scan("data",what=integer()) Read 25000000 items > dti <- fread("data")$V1 > identical(num,sn) [1] TRUE > identical(int,si) [1] TRUE > identical(int,dti) [1] TRUE 

5 Comments

Thanks, I just learned like three new things just from this answer.
@RichardScriven: I hope they were not only new, but useful. ;)
I find it depressing that a computer would take "a while" to generate a text file with a few million 0s and 1s.
@baptiste: write is a wrapper to cat, which doesn't seem to be optimized for this sort of thing. writeLines is faster.
something else doesn't seem to be optimised for this: I've been unable to use my computer for the past 10 minutes, it froze when I tried to produce a black-and-white raster image with that number of pixels. Bad idea, next time I'll use my camera and shoot at the sky for randomness.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.