Loading large training data with R [duplicate]

Question

I have the following line of code:

train <- read.csv("avito_train.tsv", sep='\t', stringsAsFactors = F)

The training file is around 3 GB. It takes a really long time to load all of that data.

My question is, would a proper data scientist load all of the data or only use a subset? I notice I could use nrows parameter to specify a maximum number of rows to read.

I also believe that loading all of this data into a corpus (as I have to do) will probably be very time consuming. Is there a general consensus on the recommended strategy of writing machine learning programs with large training and testing data?

there are ways to optimize data loading in R. The more you tell R about the structure of the data it has to load, the faster it will be able to load it. Specifically, if you can tell it how many columns there are, what the data types are for each of the columns, and if you also happen to know how many rows the data has R will have a easier time loading it. — bdeonovic
– bdeonovic, Commented Jun 26, 2014 at 13:09
Thanks; any code samples would be helpful to me, as well; though, I assume you are referring to the parameters on the read.csv() method. — user1477388
– user1477388, Commented Jun 26, 2014 at 13:10
Thanks I'll try to provide a good example. Yes I was referring to read.csv(). Actually I was reffering to read.table, and read.csv() is just a wrapper function with certain preset defaults. — bdeonovic
– bdeonovic, Commented Jun 26, 2014 at 13:11
Have a look at fread() from the package data.table. It is much faster than read.csv(). Also, you can try to keep as much of your data as possible in binary format that can be loaded into R faster, using the functions load() and save(). — konvas
– konvas, Commented Jun 26, 2014 at 13:14
@konvas Thanks. I forgot that I could use the save() functionality within R. That should be helpful, too. — user1477388
– user1477388, Commented Jun 26, 2014 at 13:23

Community · Accepted Answer · 2017-05-23 11:49:36Z

There are ways to optimize data loading in R. The more you tell R about the structure of the data it has to load, the faster it will be able to load it. Specifically, if you can tell it how many columns there are, what the data types are for each of the columns, and if you also happen to know how many rows the data has R will have a easier time loading it.

Here is an example where I was trying to increase the speed of a file to load. First I get the number of columns in the file:

ncols <- length(read.table(file,header=TRUE,sep="\t", nrows=1, na.strings=c("null","NA"),comment="",quote=""))

In this particular file I knew that I did not want the first 2 columns, and the rest were numeric, so I used the following read.table call

tumor_data <- read.table(file,header=TRUE,sep="\t", colClasses=c("NULL","NULL",rep("numeric",ncols-2)), na.strings=c("null","NA"),comment="",quote="")

Note this file wasn't quite as large as yours, so I imagine loading a 3Gb file is still going to take awhile.

You might look at the answers on this page Quickly reading very large tables as dataframes in R which I found helpful. If read.table doesn't work out for you you may want to consider using the sqldf package or somethign similar.

Thanks, I will try this out. So, the answer seems to be, "yes, definitely load all of the data but just change the way you go about it."
Using all the data available is always optimal, even if you are training on just a portion of it, you will probably want the other portion to test on etc. For development reasons to test out your code I can certainly envision only needing to load in a bit of the data to make sure everything runs as it is supposed to.

Collectives™ on Stack Overflow

Loading large training data with R [duplicate]

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related