3

We have an integration with another system that relies on passing CSV files back and forth (really old school).

The structure is generally:

ID, Name, PhoneNumber, comments, fathersname 1, tom, 555-1234, just some random text, bill 2, jill smith, 555-4234, other random text, richard 

Every so often we see this:

3, jacked up, 999-1231, here be dragons amongst us, ted 

The primary problem I care about is detecting that a line breaker (\n) occurs in the middle of the record when that is the record terminator.

Is there anyway I can preprocess this to reliably fix it?

Note that we have zero control over what the other system emits.

4
  • There are plenty of CSV readers out there.. I have used this one successfully in the past. It is really fast. codeproject.com/Articles/9258/A-Fast-CSV-Reader You can setup rules and tweak it. Commented Nov 15, 2012 at 22:06
  • 3
    Find whoever wrote the code to generate the invalid format and slap them, then just have your code throw new FormatException();. I don't think most formatters will be able to handle this, without quotes around the field; you'll need to roll your own. Commented Nov 15, 2012 at 22:07
  • I suppose you can count the nr. of unescaped , characters on a new line and if it is 0 than it is not actually a new record. Commented Nov 15, 2012 at 22:12
  • Send them the standards list, creativyst.com/Doc/Articles/CSV/CSV01.htm#EmbedBRs that states fields can embed the newline, but must be surrounded by quotes. Could dig into some libraries that may already handle this, codeproject.com/Articles/25133/LINQ-to-CSV-library may be one Commented Nov 15, 2012 at 22:14

3 Answers 3

1

So you should be able to do something more or less like this:

for (int i = 0; i < lines.Count; i++) { var fields = lines[i].Split(',').ToList(); while (fields.Count < numFields)//here be dragons amonst us { i++;//include next line in this line //check to make sure we haven't run out of lines. //combine end of previous field with start of the next one, //and add the line break back in. var innerFields = lines[i].Split(','); fields[fields.Count - 1] += "\n" + innerFields[0]; fields.AddRange(innerFields.Skip(1)); } //we now know we have a "real" full line processFields(fields); } 

(For simplicity I assumed all lines were read in at the start; I assume you could alter it to lazily fetch each line easily enough.)

Sign up to request clarification or add additional context in comments.

1 Comment

I like this. Will give it a go.
0

Let me start and say that the CSV file in your example is invalid. If a line break occurs inside a string, it should be wrapped with double quote characters.

Now for the answer - In order to parse this invalid csv format you must do several assumptions. In this case I made 2 assumptions: 1) The ID column must be numeric 2) The comment field can not contain digits.

Based on these assumptions you can check the first character after the line break character. If it is digit, you assume its a new record. If not you should treat it as a continue value of the comment field.

I don't know if the second assumption is valid, if not, you can enhance the logic so it will cover the business rules of the system.

Good Luck!

1 Comment

You're absolutely right about it being invalid. However, the big mega corp that produced the garbage has been promising to fix it for 3 years now; so I'm not holding my breath. Unfortunately, we can't guarantee 1 and the comment field might very well start with numbers.
0

Firstly I would recommend using a tool to manage reading and writing your csv files, I use the FileHelpers library which is great.

You can essentially type your records and it will do all the validation and such for you. Worth the effort.

To your question perhaps you can do some preprocessing on the file and use Regex to replace any line breaks with a space?

I do something similar (not with files but) try

line.Replace(Environment.NewLine, " "); 

With FileHelpers you could write a custom converter to do this during processing, or hook into the BeforeRead event.

1 Comment

We're already using FileHelpers. However, it blows on lines that don't meet the spec so we set it to ignore those and move on. If you have details on writing a custom converter to handle it I'd be interested...

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.