Reading huge csv files efficiently?

Question

I know how to use pandas to read files with CSV extension. When reading a large file i get an out of memory error. The file is 3.8 million rows and 6.4 millions columns file. There is mostly genome data in the file of large populations.

How can I overcome the problem, what is standard practice and how do I select the appropriate tool for this. Can I process a file this big with pandas, or there is another tool?

Do you need to read the whole file? you can pass chunksize param to read_csv and process the chunks — EdChum
– EdChum, Commented Nov 13, 2015 at 9:56

Igor Barinov · Accepted Answer · 2015-11-13 09:30:13Z

You can use Apache Spark to distribute in-memory processing of csv files https://github.com/databricks/spark-csv. Take a look at ADAM's approach for distributed genomic data processing.

itzMEonTV · Accepted Answer · 2015-11-13 09:30:14Z

You can use python csv module

with open(filename, "r") as csvfile: datareader = csv.reader(csvfile) for i in datareader: #process each line #You now only hold one row in memory, instead of your thousands of lines

Collectives™ on Stack Overflow

Reading huge csv files efficiently?

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related