1

This is my code to generate files by home id. Then I will analyze each home seperately.

import pandas as pd data = pd.read_csv("110homes.csv") for i in (np.unique(data['dataid'])): print i d1 = pd.DataFrame(data[data['dataid']==i]) k = str(i) d1.to_csv(k + ".csv") 

However, I am getting this error. The machine has 200 GB RAM and it is showing memory error too:

 data = pd.read_csv("110homes.csv") File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 474, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 260, in _read return parser.read() File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 721, in read ret = self._engine.read(nrows) File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 1170, in read data = self._reader.read(nrows) File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7544) File "pandas/parser.pyx", line 819, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8137) File "pandas/parser.pyx", line 1833, in pandas.parser._concatenate_chunks (pandas/parser.c:22383) MemoryError 
4
  • Nothing to do with your MemoryError, but have a look at the df.groupby method. This would make your code after the read more elegant Commented Feb 20, 2016 at 18:06
  • 2
    Are you using 32-bit Python? It has a 4G limit -- see pandas-memory-error Commented Feb 20, 2016 at 18:57
  • import struct print struct.calcsize("P") * 8 I ran this command and it shows 64. So that is not the problem I guess Commented Feb 20, 2016 at 19:43
  • 1
    Why is question downvoted? If you can't answer the question then atleast don't downvote it. It is ridiculous someone simply downvoting question without any reason. It's a legit question. Commented Feb 20, 2016 at 20:13

1 Answer 1

1

Data in RAM can take a lot more space than on disk. Without seeing your 110homes.csv file, it's impossible to know details, but imagine that it consists of 10 floating point numbers per line, like: 0.0,1.0,2.0,.... In the CSV, each takes 3 bytes + 1 byte for the delimiter. In Python, each takes 8 bytes (on a 64 byte machine) for the float, plus 2 bytes per Unicode char (another 8 bytes), plus 8 bytes for string length, plus 8 bytes per pointer, plus bytes per row, etc.

Think about it like this: On a 64 bit machine, the minimum size for a pointer, a native int, or a native float, is 8 bytes. You need several of those per field, and several more per row. There's nothing unusual about taking 15x in RAM versus disk.

Do a simple test: Take the first 10% of the lines of your file, and monitor python via top as it processes. See how much RAM it uses. Does it use at least 20 GB?

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.