Comparing speed of fread vs. read.table for reading the first 1M rows out of 100M

Question

I have a 14GB data.txt file. I was comparing the speed of fread and read.table by reading the first 1M rows. It looks like fread is much slower although it is not supposed to be. It takes some time until the percentage counts show up.

What could be the reason? I thought it was supposed to be super fast... I am using a Windows OS computer.

Define "much slower" - if it's measured in microseconds then I wouldn't be losing sleep. Also, without example code noone can verify what you're doing. — thelatemail
– thelatemail, Commented Aug 28, 2015 at 5:04
@thelatemail: I have a data table 100M rows, 60 columns. This is 14 GB. When I read first 1M rows, it takes 1.5-2 mins (there is a wait time until the percentage count shows) whereas read.table takes less than a minute. Irrespective of this comparison, I have been hearing from others that fread is reading their 4GB table in 40 sec. There is something wrong that I can't figure out. — KTY
– KTY, Commented Aug 28, 2015 at 5:13
This is the code I use: data=read.table('data.txt',sep=',',nrow=1000000,header=TRUE,stringsAsFactors=FALSE) data=fread('data.txt',sep=',',nrow=1000000) — KTY
– KTY, Commented Aug 28, 2015 at 5:14

Community · Accepted Answer · 2017-05-23 12:33:15Z

9

fread mmaps the file. This takes some time, and will map the whole file. This means subsequent "read-ins" will be faster.

read.table does not mmap the whole file. It can read in the file line by line [and stop at line 1000000].

You can see some background on mmap at mmap() vs. reading blocks

The examples in the help from fread highlight this behaiviour

edited May 23, 2017 at 12:33

CommunityBot

11 silver badge

answered Aug 28, 2015 at 5:17

mnel

116k28 gold badges269 silver badges255 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

KTY Over a year ago

So if I will read the file only once, can we say that using fread won't give much of an advantage?

mnel Over a year ago

@KTY, if you are only trying to read in the first million lines, and only once, then you may have found a case where fread won't give and advantage. If you want to read the whole file, or read the rest of the lines in subsequently, then fread should almost definitely be faster.

KTY Over a year ago

yes, it seems like the main difference comes when reading big files...now reading the whole 14GB file, it is very fast compared to read.table. Thanks for the information on mmap.

Matt Dowle Over a year ago

@KTY We could speed up reading the first N rows. Just wasn't a priority as normally you want to read the whole file. I filed a feature request #1300.

Collectives™ on Stack Overflow

Comparing speed of fread vs. read.table for reading the first 1M rows out of 100M

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related