I am trying to read a single column of a CSV file to R as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.
What is my motivation? I have two files; one called Main.csv which is 300000 rows and 500 columns, and one called Second.csv which is 300000 rows and 5 columns. If I system.time() the command read.csv("Second.csv"), it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv (which is 20% the size of Second.csv since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.
Method 1
colClasses <- rep('NULL',500) colClasses[1] <- NA system.time( read.csv("Main.csv",colClasses=colClasses) ) # 40+ seconds, unacceptableMethod 2
read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable
How to reduce this time? I am hoping for an R solution.
iowith all of:Python,Java,R.require(data.table); fread( "path/to/file/Main.csv" )will give you an instant speed improvement.data.framein a special binary format that can be used later for reading only a few variables at a time. Basically it's a wrapper aroundsave/readRDSand writing the columns to separate files etc. More details: stackoverflow.com/questions/4756989/…scan(pipe("cut -f1 -d, Main.csv"))might be worth a try.