7

I'm looking for a way of reading only a few columns from a csv file into R using shell() or pipe. I found this thread that explains how to accomplish that on Linux: Quicker way to read single column of CSV file

On Linux this works adding the what argument:

a <-as.data.frame(scan(pipe("cut -f1,2 -d, Main.csv"), what=list("character","character"),sep= ",")) 

However this doesn't seem to work on Windows. When using pipe("cut -f1 -d, Main.csv") the connection gets opened but it doesn't return anything.

What would be the functions/syntax I need to use in order to make this work on Windows.

Is it possible to accomplish this by using shell()?

2 Answers 2

7

Make sure that cut is on your path - its in Rtools. This works for me:

# check that cut is availble Sys.which("cut") # create test data Lines <- "a,b,c 1,2,3 4,5,6" cat(Lines, file = "in.csv") # read it DF <- read.csv(pipe("cut -f1,2 -d, in.csv")) 

Added

Rtools is now Rtools40 and cut is at C:\Rtools40\usr\bin\cut.exe .

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you G. I think I don't have installed RTools in my windows because it uses Cygwin and I had troubles in the past with Ruby's active record library. I'll have to double-check that when I get back to my Windows machine, but if that works for you probably that's the problem. Thanks again.
You can install Rtools without putting it on your path. In that case its unlikely there would be any conflicts. Once installed that way try: DF <- read.csv(pipe("\\Rtools\\bin\\cut -f1,2 -d, in.csv"))
Nice G. It works well. Below you can see the results of my tests. For a reason I don't understand yet, the number of rows is not accurate with most of the implementations. Only fread() was accurate and faster than the default read.csv(). Do you know the reason, or how I can improve that behaviour?
Try cutting down the example data until you get it to produce a not-intended number of rows with only a few rows of input.
Thank you G. Not sure if I'm going to explore this, but certainly if I do it I'll post back. Thanks again
3
> system.time(a <- read.csv("in.csv")) user system elapsed 1.24 0.04 1.26 > dim(a) [1] 4706 46 > system.time(b <-read.csv(pipe("C:/Rtools/bin/cut -f1,2 -d, in.csv"))) user system elapsed 0.22 1.27 2.37 Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : EOF within quoted string > dim(b) [1] 2726 2 > system.time(d <-as.data.frame(scan(pipe("C:/Rtools/bin/cut -f1,2 -d, in.csv"), + what=list("character","character"),sep= ","))) Read 1715 records user system elapsed 0.31 1.19 2.47 Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : EOF within quoted string > dim(d) [1] 1715 2 > library(data.table) data.table 1.9.2 For help type: help("data.table") Warning message: closing unused connection 3 (C:\Windows\system32\cmd.exe /c C:/Rtools/bin/cut -f1,2 -d, in.csv) > system.time(e <-fread("C:/Rtools/bin/cut -f1,2 -d, in.csv")) user system elapsed 0.02 0.01 0.80 > dim(e) [1] 4706 2 

2 Comments

It is more clean to use library(rbenchmark). Then you can show all functions at a time. Example: benchmark(f1(x), f2(x), f3(x), f4(x), columns = c("test", "replications", "elapsed", "relative"), order = "relative", relative = "elapsed", replications = 10). (In this case substitue f1(x) by fread("C:/Rtools/bin/cut -f1,2 -d, in.csv")), and so on. Anyway, data.table wins. +1
Thank you Luis. Didn't know that library!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.