0

I have a data dump that is a "messed up" CSV. (About 100 files, each with about 1000 lines of actual CSV data.)
The dump has some other text in addition to CSV. How can I extract the CSV part separately, programmatically?

As an example the data file looks like something like this

Session:1 Data collection date: 09-09-2016 Related questions: Question 1: parta, partb, partc, Question 2: parta, partb, partc "field1","field2","field3","field4" "data11","data12","data13","data14" "data21","data22","data23","data24" "data31","data32","data33","data34" "data41","data42","data43","data44" "data51","data52","data53","data54" 

I need to extract the csv part.

Caveats,
the text in the beginning is NOT limited to 4 - 5 lines.
the additional text is NOT just in the beginning of the file

I saw this post that suggests using re.split and/or csv.Sniffer, however my attempt was not fruitful.

with open("untitled.csv") as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(1024)) csvfile.seek(0) print(dialect.__dict__) csvstarts = False csvdump = [] for ln in csvfile.readlines(): toks = re.split(r'[,]', ln) print(toks) if toks[0] == '"field1"' and not csvstarts: # identify by the header line csvstarts = True continue if csvstarts: if toks[0] == '"field1"': # identify the start of subsequent csv data csvstarts = False continue csvdump.append(ln) # record the current line print(csvdump) 

For now I am able to identify the csv lines accurately ONLY if there is one bunch of data.

Is there anything better I can do?

2
  • What separates the lines with data and lines with text then? Can a text line start with "? Commented Sep 22, 2016 at 2:48
  • so, that is where the problem lies, there is nothing that separates csv data from text, unless you consider blank lines as separator. Commented Sep 22, 2016 at 3:13

3 Answers 3

1

How about this:

import re my_pattern = re.compile("(\"[\w]+\",)+") with open('<your_file>', 'rb') as fi: for f in fi: result = my_pattern.match(f) if result: print f 

Assuming the csv data can be differentiated from the rest by having no special characters in them (we only accept each element to have letters or numbers surrounded by double quotes and a comma to separate from the next element)

Sign up to request clarification or add additional context in comments.

4 Comments

Note to self: must learn how to use regular expressions...much better answer!
Happy to help. You could accept the answer if it solves your problem.
Sorry about that.
this is about the closest that does the job. a minor fix in the regex is to check if that is the start and end if line as suggested by @Mrd05d below. hence accepting. i will update if i manage to do better
0

If your csv lines and only those lines start with \", then you can do this:

import csv data = list(csv.reader(open("test.csv", 'rb'), quotechar='¬')) # for quotechar - use something that won't turn up in data def importCSV(data): # outputs list of list with required data # works on the assumption that all required data starts with \" # and that no text starts with \" out = [] for line in data: if (line != []) and (line[0][0] == "\""): line = [el.replace("\"", "") for el in line] out.append(line) return out useful = importCSV(data) 

Comments

0

Can you not read each line and do a regex to see weather or not to pull the data? Maybe something like:

^(["][\w]["][,])+["][\w]["]$

My regex is not the best and there may likely be a better way but that seemed to work for me.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.