30

I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:

df = pd.read_csv('Check400_900.csv', sep='\t') 

Doesn't work so I found iterate and chunksize in a similar post so I used:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) 

All good, i can for example print df.get_chunk(5) and search the whole file with just:

for chunk in df: print chunk 

My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk.

plt.plot() print df.head() print df.describe() print df.dtypes customer_group3 = df.groupby('UserID') y3 = customer_group.size() 

How can i work with the df without having to load it into memory all at once?

1

4 Answers 4

37

Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):

Then use concat for all chunks to df, because type of output of function:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) 

isn't dataframe, but pandas.io.parsers.TextFileReader - source.

tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) print tp #<pandas.io.parsers.TextFileReader object at 0x00000000150E0048> df = pd.concat(tp, ignore_index=True) 

I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.

EDIT:

But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.

Sign up to request clarification or add additional context in comments.

8 Comments

He can just do tp.read() to get a dataframe but it probably won't fit into his memory so he asks if you could work without loading it fully.
I have tried that but i run out of memory again, 'MemoryError'
The size of my memory is 6gb
I get error NotImplementedError: 'nrows' and 'chunksize' can not be used together yet. Should i remove chunksize?
This is a terrible idea, for exactly the reason @hellpanderr suggested in the first comment. This has the same effect as just calling read_csv without using chunksize, except that it takes twice as much memory (because you now have to hold not only the giant DataFrame, but also all the chunks that add up to that DataFrame at the same time). This cannot possibly solve the OP's problem, it only makes it worse.
|
21

You do not need concat here. It's exactly like writing sum(map(list, grouper(tup, 1000))) instead of list(tup). The only thing iterator and chunksize=1000 does is to give you a reader object that iterates 1000-row DataFrames instead of reading the whole thing. If you want the whole thing at once, just don't use those parameters.

But if reading the whole file into memory at once is too expensive (e.g., takes so much memory that you get a MemoryError, or slow your system to a crawl by throwing it into swap hell), that's exactly what chunksize is for.

The problem is that you named the resulting iterator df, and then tried to use it as a DataFrame. It's not a DataFrame; it's an iterator that gives you 1000-row DataFrames one by one.

When you say this:

My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk

The answer is that you can't. If you can't load the whole thing into one giant DataFrame, you can't use one giant DataFrame. You have to rewrite your code around chunks.

Instead of this:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) print df.dtypes customer_group3 = df.groupby('UserID') 

… you have to do things like this:

for df in pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000): print df.dtypes customer_group3 = df.groupby('UserID') 

Often, what you need to do is aggregate some data—reduce each chunk down to something much smaller with only the parts you need. For example, if you want to sum the entire file by groups, you can groupby each chunk, then sum the chunk by groups, and store a series/array/list/dict of running totals for each group.

Of course it's slightly more complicated than just summing a giant series all at once, but there's no way around that. (Except to buy more RAM and/or switch to 64 bits.) That's how iterator and chunksize solve the problem: by allowing you to make this tradeoff when you need to.

1 Comment

Slow compared to what? Calling groupby on a 1000-row chunk takes exactly as much time as calling groupby on a 1000-row whole-file DataFrame. Which takes under 1ms on my laptop. If you have to do that 1000 times, it takes almost a second. Is that too slow to process a file that you otherwise couldn't process at all?
6

You need to concatenate the chucks. For example:

df2 = pd.concat([chunk for chunk in df]) 

And then run your commands on df2

2 Comments

Thanks for your response but again i get ''MemoryError''
This is just a slightly worse version of jezrael's answer.
2

This might not reply directly to the question, but when you have to load a big dataset it is a good practice, to covert the dtypes of your columns while reading the dataset. Also if you know which columns you need, use the usecols argument to load only those.

df = pd.read_csv("data.csv", usecols=['A', 'B', 'C', 'Date'], dtype={'A':'uint32', 'B':'uint8', 'C':'uint8' }, parse_dates=['Date'], # convert to datetime64 sep='\t' ) 

2 Comments

If this doesn't answer the question, you can add this information as a comment.
I must have at least 50 reputation to comment :/. I thought this suggestion would be useful to be here, for anyone working on "big" datasets. (I can comment only to my own answer)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.