Pandas error while reading 14 GB csv file on 200 GB RAM workstation

Question

This is my code to generate files by home id. Then I will analyze each home seperately.

import pandas as pd data = pd.read_csv("110homes.csv") for i in (np.unique(data['dataid'])): print i d1 = pd.DataFrame(data[data['dataid']==i]) k = str(i) d1.to_csv(k + ".csv")

However, I am getting this error. The machine has 200 GB RAM and it is showing memory error too:

 data = pd.read_csv("110homes.csv") File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 474, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 260, in _read return parser.read() File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 721, in read ret = self._engine.read(nrows) File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 1170, in read data = self._reader.read(nrows) File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7544) File "pandas/parser.pyx", line 819, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8137) File "pandas/parser.pyx", line 1833, in pandas.parser._concatenate_chunks (pandas/parser.c:22383) MemoryError

Nothing to do with your MemoryError, but have a look at the df.groupby method. This would make your code after the read more elegant — MaxNoe
– MaxNoe, Commented Feb 20, 2016 at 18:06
Are you using 32-bit Python? It has a 4G limit -- see pandas-memory-error — RootTwo
– RootTwo, Commented Feb 20, 2016 at 18:57
import struct print struct.calcsize("P") * 8 I ran this command and it shows 64. So that is not the problem I guess — dsl1990
– dsl1990, Commented Feb 20, 2016 at 19:43
Why is question downvoted? If you can't answer the question then atleast don't downvote it. It is ridiculous someone simply downvoting question without any reason. It's a legit question. — dsl1990
– dsl1990, Commented Feb 20, 2016 at 20:13

SRobertJames · Accepted Answer · 2016-12-29 04:24:55Z

Data in RAM can take a lot more space than on disk. Without seeing your 110homes.csv file, it's impossible to know details, but imagine that it consists of 10 floating point numbers per line, like: 0.0,1.0,2.0,.... In the CSV, each takes 3 bytes + 1 byte for the delimiter. In Python, each takes 8 bytes (on a 64 byte machine) for the float, plus 2 bytes per Unicode char (another 8 bytes), plus 8 bytes for string length, plus 8 bytes per pointer, plus bytes per row, etc.

Think about it like this: On a 64 bit machine, the minimum size for a pointer, a native int, or a native float, is 8 bytes. You need several of those per field, and several more per row. There's nothing unusual about taking 15x in RAM versus disk.

Do a simple test: Take the first 10% of the lines of your file, and monitor python via top as it processes. See how much RAM it uses. Does it use at least 20 GB?

Collectives™ on Stack Overflow

Pandas error while reading 14 GB csv file on 200 GB RAM workstation

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related