I have a data dump that is a "messed up" CSV. (About 100 files, each with about 1000 lines of actual CSV data.)
The dump has some other text in addition to CSV. How can I extract the CSV part separately, programmatically?
As an example the data file looks like something like this
Session:1 Data collection date: 09-09-2016 Related questions: Question 1: parta, partb, partc, Question 2: parta, partb, partc "field1","field2","field3","field4" "data11","data12","data13","data14" "data21","data22","data23","data24" "data31","data32","data33","data34" "data41","data42","data43","data44" "data51","data52","data53","data54" I need to extract the csv part.
Caveats,
the text in the beginning is NOT limited to 4 - 5 lines.
the additional text is NOT just in the beginning of the file
I saw this post that suggests using re.split and/or csv.Sniffer, however my attempt was not fruitful.
with open("untitled.csv") as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(1024)) csvfile.seek(0) print(dialect.__dict__) csvstarts = False csvdump = [] for ln in csvfile.readlines(): toks = re.split(r'[,]', ln) print(toks) if toks[0] == '"field1"' and not csvstarts: # identify by the header line csvstarts = True continue if csvstarts: if toks[0] == '"field1"': # identify the start of subsequent csv data csvstarts = False continue csvdump.append(ln) # record the current line print(csvdump) For now I am able to identify the csv lines accurately ONLY if there is one bunch of data.
Is there anything better I can do?