Merge CSV files with field delimiters also occuring inside quotes

Question

I have 3 csv files I want to join by first column (id column)

Each file has the same 3 columns.

Row example :

id | timestamp | Name 3792318, 2014-07-15 00:00:00, "A, B"

When I join the 3 csv files with

join -t, <(join -t, csv1 csv2) csv3 > out.csv

The out.csv file doesn't have the same number of columns for each row, probably because the delimiter is a comma and some rows (like in the example above) have commas in the contents of the cell.

The header is not part of your actual file right? Also, are non-delimiter commas always within quotes? — terdon
– terdon ♦, Commented Jul 15, 2014 at 12:06
Can we safely assume that 1) the 1st field will never contain a comma and 2) You only want the ids that are present in the 1st file (if an id is in file2 and not in file1, you ignore it)? — terdon
– terdon ♦, Commented Jul 15, 2014 at 12:17
CSV is not a simple format, although there are some simple CSV files. You are much better of using a proper CSV parsing library (available e.g for Python and Perl) that transparently take care of (non-)quoted/multi-line/embedded-seperator etc. values. — Anthon
– Anthon, Commented Jul 15, 2014 at 12:22
terdon, the 1st column never contain a comma, all the rest might contain, and yes I ignore it (its an extreme case) — James W.
– James W., Commented Jul 15, 2014 at 12:33

terdon · Accepted Answer · 2014-07-15 13:01:38Z

Obviously, using a csv parser would be better but if we can safely assume that

The 1st field will never contain a comma;
You only want the ids that are present in the 1st file (if an id is in file2 or file3 and not in file1 of you ignore it);
The files are small enough to fit in your RAM.

Then this Perl approach should work:

#!/usr/bin/env perl use strict; my %f; ## Read the files while (<>) { ## remove trailing newlines chomp; ## Replace any commas within quotes with '|'. ## I am using a while loop to deal with multiple commas. while (s/\"([^"]*?),([^"]*?)\"/"$1|$2"/){} ## match the id and the rest. /^(.+?)(,.+)/; ## The keys of the %f hash are the ids ## each line with the same id is appended to ## the current value of the key in the hash. $f{$1}.=$2; } ## Print the lines foreach my $id (keys(%f)) { print "$id$f{$id}\n"; }

Save the script above as foo.pl and run it like this:

perl foo.pl file1.csv file2.csv file3.csv

The script above can also be written as a one-liner:

perl -lne 'while(s/\"([^"]*?),([^"]*)\"/"$1|$2"/){} /^(.+?)(,.+)/; $k{$1}.=$2; END{print "$_$k{$_}" for keys(%k)}' file1 file2 file3

Thank you, is there a way to remove comma inside qoutes ? or replace with other symbol ? — James W.
– James W., Commented Jul 15, 2014 at 12:47
@JimWest sure, see updated answer. I must stress however, that you would be much better off using a parser designed for this type of thing. — terdon
– terdon ♦, Commented Jul 15, 2014 at 13:01

Kaz · Accepted Answer · 2014-07-15 17:44:02Z

TXR language:

@(do (defun csv-parse (str) (let ((toks (tok-str str #/[^\s,][^,]+[^\s,]|"[^"]*"|[^\s,]/))) [mapcar (do let ((l (match-regex @1 #/".*"/))) (if (eql l (length @1)) [@1 1..-1] @1)) toks])) (defun csv-format (list) (cat-str (mapcar (do if (find #\, @1) `"@1"` @1) list) ", ")) (defun join-recs (recs-left recs-right) (append-each ((l recs-left)) (collect-each ((r recs-right)) (append l r)))) (let ((hashes (collect-each ((arg *args*)) (let ((stream (open-file arg))) [group-by first [mapcar csv-parse (gun (get-line stream))] :equal-based])))) (when hashes (let ((joined (reduce-left (op hash-isec @1 @2 join-recs) hashes))) (dohash (key recs joined) (each ((rec recs)) (put-line (csv-format rec))))))))

Sample data.

Note: the key 3792318 occurs twice third file, so we expect two rows in the join output for that key.

Note: The data is not required to be sorted; hashing is used for the join.

$ for x in csv* ; do echo "File $x:" ; cat $x ; done File csv1: 3792318, 2014-07-15 00:00:00, "A, B" 3792319, 2014-07-16 00:00:01, "B, C" 3792320, 2014-07-17 00:00:02, "D, E" File csv2: 3792319, 2014-07-15 00:02:00, "X, Y" 3792320, 2014-07-11 00:03:00, "S, T" 3792318, 2014-07-16 00:02:01, "W, Z" File csv3: 3792319, 2014-07-10 00:04:00, "M" 3792320, 2014-07-09 00:06:00, "N" 3792318, 2014-07-05 00:07:01, "P" 3792318, 2014-07-16 00:08:01, "Q"

Run:

$ txr join.txr csv1 csv2 csv3 3792319, 2014-07-16 00:00:01, "B, C", 3792319, 2014-07-15 00:02:00, "X, Y", 3792319, 2014-07-10 00:04:00, M 3792318, 2014-07-15 00:00:00, "A, B", 3792318, 2014-07-16 00:02:01, "W, Z", 3792318, 2014-07-05 00:07:01, P 3792318, 2014-07-15 00:00:00, "A, B", 3792318, 2014-07-16 00:02:01, "W, Z", 3792318, 2014-07-16 00:08:01, Q 3792320, 2014-07-17 00:00:02, "D, E", 3792320, 2014-07-11 00:03:00, "S, T", 3792320, 2014-07-09 00:06:00, N

A more "correct" csv-parse function is:

 ;; Include the comma separators as tokens; then parse the token ;; list, recognizing consecutive comma tokens as an empty field, ;; and stripping leading/trailing whitespace and quotes. (defun csv-parse (str) (labels ((clean (str) (set str (trim-str str)) (if (and (= [str 0] #\") (= [str -1] #\")) [str 1..-1] str)) (post-process (tokens) (tree-case tokens ((tok sep . rest) (if (equal tok ",") ^("" ,*(post-process (cons sep rest))) ^(,(clean tok) ,*(post-process rest)))) ((tok . rest) (if (equal tok ",") '("") ^(,(clean tok))))))) (post-process (tok-str str #/[^,]+|"[^"]*"|,/))))

Stack Exchange Network

Merge CSV files with field delimiters also occuring inside quotes

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Merge CSV files with field delimiters also occuring inside quotes

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions