3

I have the following CSV file with this header:

AccountOwnerEmail PartnerName EnrollmentID Customer LicensingProgram Country Culture Issue

with lines like this:

[email protected],"HEY"? Tester, 12345789,"Catalysis", LLC., Enterprise 6 TEST, etc,etc ,etc

I have a method to separate the lines into the corresponding columns:

var columns = columnsRegex.Matches(line) .Cast<Match>() .Select(m=> m.Value.Trim('\"', '\'', ' ', '\t')) .ToList(); 

Here's the definition for columnsRegex:

private static Regex columnsRegex = new Regex("\"[^\"]*\"|'[^']*'|[^,;]+"); 

My problem here is that for example the PartnerName content is being separated into 3 columns like "" "Hey" and "?Tester"

I know that the CSV escapes the double quotes with another double quotes. And I have already checked another posts similar to this that recomends to add the reference to Microsoft.VisualBasic but that is not working for me. Its there any other approach to take to correctly process the CSV content?

12
  • 3
    There are multiple CSV parsers for .NET out there (see a NuGet Search). Commented Sep 18, 2018 at 13:20
  • 4
    This is NOT a CSV. To parse it with a regex is right, field name is ^\w+: and everything else .* is its value. Ignore the header (if records are somehow separated, from example by a full blank line, otherwise just use it to count the number of fields in each record header.Split(' ').Length). Alternatively a simple line.IndexOf(":") might also work pretty well in this case... Commented Sep 18, 2018 at 13:20
  • 3
    "with lines like this" That isn't CSV Commented Sep 18, 2018 at 13:21
  • @Richard If it was an actual CSV file then yes... This is essentially a text file. Commented Sep 18, 2018 at 13:21
  • 2
    I have editted the text to show the real CSV line. Commented Sep 18, 2018 at 13:38

2 Answers 2

3

I use CsvHelper for it. It's a very nice library to parse CSV documents. Use nuget package:

Install-Package CsvHelper 

Documentation can be found here.

var csv = new CsvReader( textReader ); var records = csv.GetRecords<MyCsvRecord>(); 

Where MyCsvRecord is your CSV row e.g.:

public class MyCsvRecord { public string AccountOwnerEmail { get; set; } public string PartnerName { get; set; } // etc. } 
Sign up to request clarification or add additional context in comments.

7 Comments

I'll try this and let you know
My problem with this is that is returnin illegal character in path error because of the double quotes. And I cannot delete them, I need to show the exact content of the file
@pedrodotnet can you open your document in Excel? I feel like your "CSV" document is not well formatted.
Yes, It's an Excel file, I upload the excel to my solution to process it.
I used CSVHelper and is really nice. i solved all of my problems with simple custom things. I fixed /r/n charts in a column, parsed a column to multiple columns, etc. @pedrodotnet go to this way, will be the best solution
|
1

EDIT: Added another parser method, fixed line and test parsing output.

I would say, that your regular expression pattern is wrong. It does not allow to use (doubled) " character in " prefixed values. The same problem is for '

internal static class Program { private const string wrongLine = "[email protected],\"HEY\"? Tester, 12345789,\"Catalysis\", LLC., Enterprise 6 TEST, etc,etc ,etc"; private const string fixedLine = "[email protected],\"\"\"HEY\"\"? Tester\", 12345789,\"Catalysis\", LLC., Enterprise 6 TEST, etc,etc ,etc"; private static readonly Regex wrongPattern = new Regex("\"[^\"]*\"|'[^']*'|[^,;]+"); private static readonly Regex fixedPattern = new Regex("((?:\"((?:[^\"]|\"\")*)\")|(?:'((?:[^']|'')*)')|([^,;]*))(?:[,;]|$)"); private static void Main() { Console.WriteLine("*** Wrong line: ***"); Console.WriteLine(); Parse(wrongLine); Console.WriteLine(); Console.WriteLine(); Console.WriteLine("*** Fixed line: ***"); Console.WriteLine(); Parse(fixedLine); } private static void Parse(string line) { Console.WriteLine("--- [Original Regex] ---"); var matches = wrongPattern.Matches(line); for (int i = 0; i < matches.Count; i++) { Console.WriteLine("'" + matches[i].Value + "'"); } Console.WriteLine(); Console.WriteLine("--- [Fixed Regex] ---"); Console.WriteLine(); matches = fixedPattern.Matches(line); for (int i = 0; i < matches.Count; i++) { Console.WriteLine("'" + GetValue(matches[i]) + "'"); } Console.WriteLine(); Console.WriteLine("--- [Correct(?) parser] ---"); Console.WriteLine(); var position = 0; while (position < line.Length) { var value = GetValue(line, ref position); Console.WriteLine("'" + value + "'"); } } private static string GetValue(Match match) { if (!string.IsNullOrEmpty(match.Groups[2].Value)) { return (match.Groups[2].Value.Replace("\"\"", "\"")); } if (!string.IsNullOrEmpty(match.Groups[3].Value)) { return (match.Groups[3].Value.Replace("''", "'")); } return (match.Groups[4].Value.Replace("''", "'")); } private static string GetValue(string line, ref int position) { string value; char? prefix; string endWith; switch (line[position]) { case '\'': case '\"': prefix = line[position]; endWith = prefix + ","; position++; break; default: prefix = null; endWith = ","; break; } var endPosition = line.IndexOf(endWith, position); if (endPosition < 0 && prefix.HasValue) { if (line[line.Length - 1] == prefix.Value) { value = line.Substring(position, line.Length - 1 - position); position = line.Length; return Fixprefix(value, prefix.Value.ToString()); } position--; endPosition = line.IndexOf(',', position); } if (endPosition < 0) { value = line.Substring(position); position = line.Length; return value; } if (prefix.HasValue) { value = line.Substring(position, endPosition - position); position = endPosition + endWith.Length; return Fixprefix(value, prefix.Value.ToString()); } value = line.Substring(position, endPosition - position); position = endPosition + endWith.Length; return value; } private static string Fixprefix(string value, string prefix) => value.Replace(prefix + prefix, prefix); } 

The 'fixed Regex pattern' still has a bug, but I leave it at current state...

(Figure your self how to fix this parsing.)

Parser test

Output window:

*** Wrong line: *** --- [Original Regex] --- '[email protected]' '"HEY"' '? Tester' ' 12345789' '"Catalysis"' ' LLC.' ' Enterprise 6 TEST' ' etc' 'etc ' 'etc' --- [Fixed Regex] --- '[email protected]' '"HEY"? Tester' ' 12345789' 'Catalysis' ' LLC.' ' Enterprise 6 TEST' ' etc' 'etc ' 'etc' '' --- [Correct(?) parser] --- '[email protected]' 'HEY"? Tester, 12345789,"Catalysis' ' LLC.' ' Enterprise 6 TEST' ' etc' 'etc ' 'etc' *** Fixed line: *** --- [Original Regex] --- '[email protected]' '""' '"HEY"' '"? Tester"' ' 12345789' '"Catalysis"' ' LLC.' ' Enterprise 6 TEST' ' etc' 'etc ' 'etc' --- [Fixed Regex] --- '[email protected]' '"HEY"? Tester' ' 12345789' 'Catalysis' ' LLC.' ' Enterprise 6 TEST' ' etc' 'etc ' 'etc' '' --- [Correct(?) parser] --- '[email protected]' '"HEY"? Tester' ' 12345789' 'Catalysis' ' LLC.' ' Enterprise 6 TEST' ' etc' 'etc ' 'etc' 

7 Comments

I don't think the problem is the Regex, the problem is that the csv adds double quotes to escape the double quotes so the c# is reading like """HEY""?TEST"
@pedrodotnet: The real problem (one of) is the Regex. It does not allow " character in " prefixed item value (e.g. "the value is ""a""" will not be parsed correctly). Even in my example there is one of two possible parse results. (And IMHO, the incorrect one.) But this parses the (incorrect) string the way, it should be parsed according to request.
The new Regex helped a little bit, now is not splitting the content in different columns but the "Hey?Tester" is being parsed as "Hey\"\?Tester\"
Even in the corrected example? What is the incorrect line?
For example this line [email protected],"HEY"? Tester, 12345789,"Catalysis", LLC., Enterprise 6 TEST, etc,etc ,etc the value "HEY"? Tester is returned as \"Hey\"\?Tester"
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.