Don't reinvent the wheel to read csv file.
You can use pandas.
import pandas as pd df = pd.read_csv('file.csv')
Or use csv standard library also.
To read a big csv file, if these methods above don't work. You can split your file into small files, create a process to read each file.
Your data sample.
I think your format file isn't a csv file. Then suppose that you have one section like this:
#--------------------------------------------------------------------------------,,,,,, CAST ,,9285001,WOD Unique Cast Number,WOD code,, NODC Cruise ID ,,US-10209 ,,,, Originators Station ID ,,82,,,integer, Originators Cruise ID ,, ,,,, Latitude ,,-76.477,decimal degrees,,, Longitude ,,166.3137,decimal degrees,,, Year ,,1997,,,, Month ,,1,,,, Day ,,1,,,, Time ,,3.9931,decimal hours (UT),,, METADATA,,,,,, Country ,, US,NODC code,UNITED STATES,, Accession Number ,,520,NODC code,,, Project ,,406,NODC code,RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPON SE IN ROSS SEA,, Platform ,,3596,OCL code,NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c .s.KUS1475;IMO900725,, Institute ,,431,NODC code,US DOC NOAA NESDIS,, Cast/Tow Number ,,1,,,, High resolution CTD - Bottle,,9182488,,,, probe_type ,,7,OCL_code,bottle/rossette/net,, scale ,Temperature,103,WOD code,Temperature: ITS-90,, Instrument ,Temperature,411,WOD code,CTD: SBE 911plus (Sea-Bird Electronics, Inc.), VARIABLES ,Depth ,F,O,Temperatur ,F,O UNITS ,m , , ,degrees C ,, Prof-Flag , ,0, , ,0, 1,0,0, ,-1.591,0, 2,5,0, ,-1.668,0, 3,10,0, ,-1.702,0, 4,15,0, ,-1.733,0, 5,20,0, ,-1.746,0, 6,25,0, ,-1.76,0, 7,30,0, ,-1.773,0, 8,35,0, ,-1.785,0, 9,40,0, ,-1.796,0, 10,45,0, ,-1.805,0, 11,50,0, ,-1.813,0, 12,55,0, ,-1.823,0, 13,60,0, ,-1.832,0, 14,65,0, ,-1.84,0, 15,70,0, ,-1.848,0, 16,75,0, ,-1.855,0, 17,80,0, ,-1.861,0, 18,85,0, ,-1.867,0, 19,90,0, ,-1.873,0, 20,95,0, ,-1.878,0, 21,100,0, ,-1.882,0, 22,125,0, ,-1.892,0, 23,150,0, , ---0---,0, 24,175,0, , ---0---,0, 25,200,0, , ---0---,0, 26,225,0, , ---0---,0, 27,250,0, , ---0---,0, 28,275,0, , ---0---,0, 29,300,0, , ---0---,0, 30,325,0, , ---0---,0, 31,350,0, , ---0---,0, 32,375,0, , ---0---,0, 33,400,0, , ---0---,0, 34,425,0, , ---0---,0, 35,450,0, , ---0---,0, 36,475,0, , ---0---,0, 37,500,0, , ---0---,0, 38,550,0, ,-1.898,0, END OF VARIABLES SECTION,,,,,,
Clean this section with :
format.sh:
#!/usr/bin/env bash # use : bash format.sh pathname cat "$1" | \ grep -v '^#\|^END' | \ sed 's/,/ /g' | tr -s " " | sed 's/ /,/'
To get :
CAST,9285001 WOD Unique Cast Number WOD code NODC,Cruise ID US-10209 Originators,Station ID 82 integer Originators,Cruise ID Latitude,-76.477 decimal degrees Longitude,166.3137 decimal degrees Year,1997 Month,1 Day,1 Time,3.9931 decimal hours (UT) METADATA, Country,US NODC code UNITED STATES Accession,Number 520 NODC code Project,406 NODC code RESEARCH ON OCEAN ATMOSPHERE VARIABILITY & ECOSYSTEM RESPONSE IN ROSS SEA Platform,3596 OCL code NATHANIEL B. PALMER (Icebr.;c.s.WBP3210;built 03.1992;old c.s.KUS1475;IMO900725 Institute,431 NODC code US DOC NOAA NESDIS Cast/Tow,Number 1 High,resolution CTD - Bottle 9182488 probe_type,7 OCL_code bottle/rossette/net scale,Temperature 103 WOD code Temperature: ITS-90 Instrument,Temperature 411 WOD code CTD: SBE 911plus (Sea-Bird Electronics Inc.) VARIABLES,Depth F O Temperatur F O UNITS,m degrees C Prof-Flag,0 0 1,0 0 -1.591 0 2,5 0 -1.668 0 3,10 0 -1.702 0 4,15 0 -1.733 0 5,20 0 -1.746 0 6,25 0 -1.76 0 7,30 0 -1.773 0 8,35 0 -1.785 0 9,40 0 -1.796 0 10,45 0 -1.805 0 11,50 0 -1.813 0 12,55 0 -1.823 0 13,60 0 -1.832 0 14,65 0 -1.84 0 15,70 0 -1.848 0 16,75 0 -1.855 0 17,80 0 -1.861 0 18,85 0 -1.867 0 19,90 0 -1.873 0 20,95 0 -1.878 0 21,100 0 -1.882 0 22,125 0 -1.892 0 23,150 0 ---0--- 0 24,175 0 ---0--- 0 25,200 0 ---0--- 0 26,225 0 ---0--- 0 27,250 0 ---0--- 0 28,275 0 ---0--- 0 29,300 0 ---0--- 0 30,325 0 ---0--- 0 31,350 0 ---0--- 0 32,375 0 ---0--- 0 33,400 0 ---0--- 0 34,425 0 ---0--- 0 35,450 0 ---0--- 0 36,475 0 ---0--- 0 37,500 0 ---0--- 0 38,550 0 -1.898 0
If you are 1M of lines I suppose that you have around 15 000 sections.
I get that with :
for _ in `seq 1 15000`; do cat one_section.txt >> data.txt; done
Check:
grep -n ^# data.txt | cut -d : -f1 | wc -l wc -l data.txt ls -sh data.txt
Give well 15 000 sections, 960000 lines, and 34MB.
....