I have the following line of code:
train <- read.csv("avito_train.tsv", sep='\t', stringsAsFactors = F) The training file is around 3 GB. It takes a really long time to load all of that data.
My question is, would a proper data scientist load all of the data or only use a subset? I notice I could use nrows parameter to specify a maximum number of rows to read.
I also believe that loading all of this data into a corpus (as I have to do) will probably be very time consuming. Is there a general consensus on the recommended strategy of writing machine learning programs with large training and testing data?
read.csv()method.fread()from the packagedata.table. It is much faster thanread.csv(). Also, you can try to keep as much of your data as possible in binary format that can be loaded into R faster, using the functionsload()andsave().save()functionality within R. That should be helpful, too.