How to input large data into python pandas using looping or parallel computing?

Question

I have a csv file of 8gb and I am not able to run the code as it shows memory error.

file = "./data.csv" df = pd.read_csv(file, sep="/", header=0, dtype=str)

I would like to split the files into 8 small files ("sorted by id") using python. And fianlly,have a loop so that the output file will have the output of all 8 files.

Or I would like to try parallel computing. Main goal is to process 8gb data in python pandas. Thank you.

My csv file contains numerous data with '/' as the comma separator,

id venue time code value ...... AAA Paris 28/05/2016 09:10 PAR 45 ...... 111 Budapest 14/08/2016 19:00 BUD 62 ...... AAA Tokyo 05/11/2016 23:20 TYO 56 ...... 111 LA 12/12/2016 05:55 LAX 05 ...... 111 New York 08/01/2016 04:25 NYC 14 ...... AAA Sydney 04/05/2016 21:40 SYD 2 ...... ABX HongKong 28/03/2016 17:10 HKG 5 ...... ABX London 25/07/2016 13:02 LON 22 ...... AAA Dubai 01/04/2016 18:45 DXB 19 ...... . . . .

Use itertools as the answer here explains stackoverflow.com/questions/16289859/… — vesperporta
– vesperporta, Commented Jul 6, 2017 at 10:26
do you actually need the 8 small files or you are going to use only the final file? — VinceP
– VinceP, Commented Jul 6, 2017 at 12:19
@Iris so essentially you want to sort your csv by id and save it to file? — VinceP
– VinceP, Commented Jul 14, 2017 at 8:50

SayPy · Accepted Answer · 2017-07-10 13:46:57Z

9

+25

import numpy as np from multiprocessing import Pool def processor(df): # Some work df.sort_values('id', inplace=True) return df size = 8 df_split = np.array_split(df, size) cores = 8 pool = Pool(cores) for n, frame in enumerate(pool.imap(processor, df_split), start=1): frame.to_csv('{}'.format(n)) pool.close() pool.join()

answered Jul 10, 2017 at 13:46

SayPy

5661 gold badge4 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

user7779326 Over a year ago

Hey ! this is cool!! i was looking for something similar ! But i get this error, frame.to_csv(output, sep="^", index=False.format(n)) AttributeError: 'bool' object has no attribute 'format'

user7779326 Over a year ago

where, output = "/file.csv"

SayPy Over a year ago

frame.to_csv(output, sep="^", index=False)

user7779326 Over a year ago

File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next raise value IndexError: positional indexers are out-of-bounds

SayPy Over a year ago

What inside of your processor function?

|

VinceP · Accepted Answer · 2017-07-06 12:10:14Z

Use the chunksize parameter to read one chunk at the time and save the files to disk. This will split the original file in equal parts by 100000 rows each:

file = "./data.csv" chunks = pd.read_csv(file, sep="/", header=0, dtype=str, chunksize = 100000) for it, chunk in enumerate(chunks): chunk.to_csv('chunk_{}.csv'.format(it), sep="/")

If you know the number of rows of the original file you can calculate the exact chunksize to split the file in 8 equal parts (nrows/8).

Wont this still consume too much memory though, since the entire dataframe is loaded before iterating an saving?
No. The whole point of chunking is that it does not load the entire dataframe into memory. The variable chunks in my answer is an iterable object which occupies virtually no memory (read more here)[pandas.pydata.org/pandas-docs/stable/io.html#io-chunking]. Only when you iterate through chunks you actually reading a chunk-sized version of the file into memory.

nitin · Accepted Answer · 2017-07-06 10:34:38Z

pandas read_csv has two argument options that you could use to do what you want to do:

nrows : to specify the number of rows you want to read skiprows : to specify the first row you want to read

Refer to documentation at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Quickbeam2k1 · Accepted Answer · 2017-07-06 10:44:12Z

You also might want to use the das framework and it's built in dask.dataframe. Essentially, the csv file is transformed into multiple pandas dataframes, each read in when necessary. However, not every pandas command is avaialble within dask.

gbajson · Accepted Answer · 2017-07-13 13:26:31Z

If you don't need all columns you may also use usecols parameter:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

usecols : array-like or callable, default None Return a subset of the columns. [...] Using this parameter results in much faster parsing time and lower memory usage.

Collectives™ on Stack Overflow

How to input large data into python pandas using looping or parallel computing?

5 Answers 5

10 Comments

2 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

10 Comments

2 Comments

Comments

Comments

Comments

Linked

Related