Python: How can I optimize this CSV parsing loop?

Question

I wrote this loop to parse a 1 million row .csv file. It works, but can only process about 7k lines/minute. Is there a reasonable way I can get this running faster?

The loop is currently transforming block of data into a row, and stripping out the extra characters, and writing each row to a new .csv file.

pattern = re.compile(r",{2,}") with open("OceanData.csv") as infile, open("OceanParsed.csv","w", newline="") as fout: outfile = csv.writer(fout) data =[] for line in infile: if line.startswith("#--------------------------------------------------------------------------------"): outfile.writerow(data) continue for ch in ["[","]","'"," ","\n"]: if ch in line: line = line.replace(ch,"") for i in line: line =re.sub(pattern,",", line) continue if not line: continue data.append(line)

Sample data: http://www.sharecsv.com/s/674dc42035c29eb4f250b5c2365c8dc6/OceanParseTest.csv

You are not parsing CSV. Nor are you really writing CSV (even though you are inserting commas between fields). Is there a specification for these data? — Sinan Ünür
– Sinan Ünür, Commented Jun 28, 2017 at 17:56
Can you give the link to download your file with 1M of lines ? — glegoux
– glegoux, Commented Jun 28, 2017 at 18:06
Your sample data does not appear to be valid CSV. Please edit your question to make it clear that your file does not conform to the CSV standard. (I would recommend not referencing CSV at all.) One thing I can recommend about solving the problem: strong consider changing your data format to something structured and more standard. Maybe an XML file where the sections are broken up into different elements or something. I dunno what exactly would work, but anything you can get into a more standard format is going to make dealing with it simpler. — jpmc26
– jpmc26, Commented Jun 28, 2017 at 18:08
@SinanÜnür Well crap, I guess I was trying to re-invent the wheel, through your link I found code in C to reformat everything to .csv. data.nodc.noaa.gov/woa/WOD13/PROGRAMS/csvfromwod.c — Mako212
– Mako212, Commented Jun 28, 2017 at 18:17

glegoux · Accepted Answer · 2017-06-28 21:29:29Z

Don't reinvent the wheel to read csv file.

You can use pandas.

import pandas as pd df = pd.read_csv('file.csv')

Or use csv standard library also.

To read a big csv file, if these methods above don't work. You can split your file into small files, create a process to read each file.

Your data sample.

I think your format file isn't a csv file. Then suppose that you have one section like this:

#--------------------------------------------------------------------------------,,,,,, CAST ,,9285001,WOD Unique Cast Number,WOD code,, NODC Cruise ID ,,US-10209 ,,,, Originators Station ID ,,82,,,integer, Originators Cruise ID ,, ,,,, Latitude ,,-76.477,decimal degrees,,, Longitude ,,166.3137,decimal degrees,,, Year ,,1997,,,, Month ,,1,,,, Day ,,1,,,, Time ,,3.9931,decimal hours (UT),,, METADATA,,,,,, Country ,, US,NODC code,UNITED STATES,, Accession Number ,,520,NODC code,,, Project ,,406,NODC code,RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPON SE IN ROSS SEA,, Platform ,,3596,OCL code,NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c .s.KUS1475;IMO900725,, Institute ,,431,NODC code,US DOC NOAA NESDIS,, Cast/Tow Number ,,1,,,, High resolution CTD - Bottle,,9182488,,,, probe_type ,,7,OCL_code,bottle/rossette/net,, scale ,Temperature,103,WOD code,Temperature: ITS-90,, Instrument ,Temperature,411,WOD code,CTD: SBE 911plus (Sea-Bird Electronics, Inc.), VARIABLES ,Depth ,F,O,Temperatur ,F,O UNITS ,m , , ,degrees C ,, Prof-Flag , ,0, , ,0, 1,0,0, ,-1.591,0, 2,5,0, ,-1.668,0, 3,10,0, ,-1.702,0, 4,15,0, ,-1.733,0, 5,20,0, ,-1.746,0, 6,25,0, ,-1.76,0, 7,30,0, ,-1.773,0, 8,35,0, ,-1.785,0, 9,40,0, ,-1.796,0, 10,45,0, ,-1.805,0, 11,50,0, ,-1.813,0, 12,55,0, ,-1.823,0, 13,60,0, ,-1.832,0, 14,65,0, ,-1.84,0, 15,70,0, ,-1.848,0, 16,75,0, ,-1.855,0, 17,80,0, ,-1.861,0, 18,85,0, ,-1.867,0, 19,90,0, ,-1.873,0, 20,95,0, ,-1.878,0, 21,100,0, ,-1.882,0, 22,125,0, ,-1.892,0, 23,150,0, , ---0---,0, 24,175,0, , ---0---,0, 25,200,0, , ---0---,0, 26,225,0, , ---0---,0, 27,250,0, , ---0---,0, 28,275,0, , ---0---,0, 29,300,0, , ---0---,0, 30,325,0, , ---0---,0, 31,350,0, , ---0---,0, 32,375,0, , ---0---,0, 33,400,0, , ---0---,0, 34,425,0, , ---0---,0, 35,450,0, , ---0---,0, 36,475,0, , ---0---,0, 37,500,0, , ---0---,0, 38,550,0, ,-1.898,0, END OF VARIABLES SECTION,,,,,,

Clean this section with :

format.sh:

#!/usr/bin/env bash # use : bash format.sh pathname cat "$1" | \ grep -v '^#\|^END' | \ sed 's/,/ /g' | tr -s " " | sed 's/ /,/'

To get :

CAST,9285001 WOD Unique Cast Number WOD code NODC,Cruise ID US-10209 Originators,Station ID 82 integer Originators,Cruise ID Latitude,-76.477 decimal degrees Longitude,166.3137 decimal degrees Year,1997 Month,1 Day,1 Time,3.9931 decimal hours (UT) METADATA, Country,US NODC code UNITED STATES Accession,Number 520 NODC code Project,406 NODC code RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPONSE IN ROSS SEA Platform,3596 OCL code NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c.s.KUS1475;IMO900725 Institute,431 NODC code US DOC NOAA NESDIS Cast/Tow,Number 1 High,resolution CTD - Bottle 9182488 probe_type,7 OCL_code bottle/rossette/net scale,Temperature 103 WOD code Temperature: ITS-90 Instrument,Temperature 411 WOD code CTD: SBE 911plus (Sea-Bird Electronics Inc.) VARIABLES,Depth F O Temperatur F O UNITS,m degrees C Prof-Flag,0 0 1,0 0 -1.591 0 2,5 0 -1.668 0 3,10 0 -1.702 0 4,15 0 -1.733 0 5,20 0 -1.746 0 6,25 0 -1.76 0 7,30 0 -1.773 0 8,35 0 -1.785 0 9,40 0 -1.796 0 10,45 0 -1.805 0 11,50 0 -1.813 0 12,55 0 -1.823 0 13,60 0 -1.832 0 14,65 0 -1.84 0 15,70 0 -1.848 0 16,75 0 -1.855 0 17,80 0 -1.861 0 18,85 0 -1.867 0 19,90 0 -1.873 0 20,95 0 -1.878 0 21,100 0 -1.882 0 22,125 0 -1.892 0 23,150 0 ---0--- 0 24,175 0 ---0--- 0 25,200 0 ---0--- 0 26,225 0 ---0--- 0 27,250 0 ---0--- 0 28,275 0 ---0--- 0 29,300 0 ---0--- 0 30,325 0 ---0--- 0 31,350 0 ---0--- 0 32,375 0 ---0--- 0 33,400 0 ---0--- 0 34,425 0 ---0--- 0 35,450 0 ---0--- 0 36,475 0 ---0--- 0 37,500 0 ---0--- 0 38,550 0 -1.898 0

If you are 1M of lines I suppose that you have around 15 000 sections.

I get that with :

for _ in `seq 1 15000`; do cat one_section.txt >> data.txt; done

Check:

grep -n ^# data.txt | cut -d : -f1 | wc -l wc -l data.txt ls -sh data.txt

Give well 15 000 sections, 960000 lines, and 34MB.

....

Collectives™ on Stack Overflow

Python: How can I optimize this CSV parsing loop?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related