1

I know how to use pandas to read files with CSV extension. When reading a large file i get an out of memory error. The file is 3.8 million rows and 6.4 millions columns file. There is mostly genome data in the file of large populations.

How can I overcome the problem, what is standard practice and how do I select the appropriate tool for this. Can I process a file this big with pandas, or there is another tool?

2
  • 3
    Do you need to read the whole file? you can pass chunksize param to read_csv and process the chunks Commented Nov 13, 2015 at 9:56
  • Maybe help this question. Commented Nov 13, 2015 at 10:07

2 Answers 2

2

You can use Apache Spark to distribute in-memory processing of csv files https://github.com/databricks/spark-csv. Take a look at ADAM's approach for distributed genomic data processing.

Sign up to request clarification or add additional context in comments.

Comments

0

You can use python csv module

with open(filename, "r") as csvfile: datareader = csv.reader(csvfile) for i in datareader: #process each line #You now only hold one row in memory, instead of your thousands of lines 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.