1

I have 3 csv files I want to join by first column (id column)

Each file has the same 3 columns.

Row example :

id | timestamp | Name 3792318, 2014-07-15 00:00:00, "A, B" 

When I join the 3 csv files with

join -t, <(join -t, csv1 csv2) csv3 > out.csv 

The out.csv file doesn't have the same number of columns for each row, probably because the delimiter is a comma and some rows (like in the example above) have commas in the contents of the cell.

6
  • The header is not part of your actual file right? Also, are non-delimiter commas always within quotes? Commented Jul 15, 2014 at 12:06
  • The header is for show only, & yes Commented Jul 15, 2014 at 12:16
  • Can we safely assume that 1) the 1st field will never contain a comma and 2) You only want the ids that are present in the 1st file (if an id is in file2 and not in file1, you ignore it)? Commented Jul 15, 2014 at 12:17
  • 1
    CSV is not a simple format, although there are some simple CSV files. You are much better of using a proper CSV parsing library (available e.g for Python and Perl) that transparently take care of (non-)quoted/multi-line/embedded-seperator etc. values. Commented Jul 15, 2014 at 12:22
  • terdon, the 1st column never contain a comma, all the rest might contain, and yes I ignore it (its an extreme case) Commented Jul 15, 2014 at 12:33

2 Answers 2

1

Obviously, using a csv parser would be better but if we can safely assume that

  1. The 1st field will never contain a comma;
  2. You only want the ids that are present in the 1st file (if an id is in file2 or file3 and not in file1 of you ignore it);
  3. The files are small enough to fit in your RAM.

Then this Perl approach should work:

#!/usr/bin/env perl use strict; my %f; ## Read the files while (<>) { ## remove trailing newlines chomp; ## Replace any commas within quotes with '|'. ## I am using a while loop to deal with multiple commas. while (s/\"([^"]*?),([^"]*?)\"/"$1|$2"/){} ## match the id and the rest. /^(.+?)(,.+)/; ## The keys of the %f hash are the ids ## each line with the same id is appended to ## the current value of the key in the hash. $f{$1}.=$2; } ## Print the lines foreach my $id (keys(%f)) { print "$id$f{$id}\n"; } 

Save the script above as foo.pl and run it like this:

perl foo.pl file1.csv file2.csv file3.csv 

The script above can also be written as a one-liner:

perl -lne 'while(s/\"([^"]*?),([^"]*)\"/"$1|$2"/){} /^(.+?)(,.+)/; $k{$1}.=$2; END{print "$_$k{$_}" for keys(%k)}' file1 file2 file3 
2
  • Thank you, is there a way to remove comma inside qoutes ? or replace with other symbol ? Commented Jul 15, 2014 at 12:47
  • @JimWest sure, see updated answer. I must stress however, that you would be much better off using a parser designed for this type of thing. Commented Jul 15, 2014 at 13:01
1

TXR language:

@(do (defun csv-parse (str) (let ((toks (tok-str str #/[^\s,][^,]+[^\s,]|"[^"]*"|[^\s,]/))) [mapcar (do let ((l (match-regex @1 #/".*"/))) (if (eql l (length @1)) [@1 1..-1] @1)) toks])) (defun csv-format (list) (cat-str (mapcar (do if (find #\, @1) `"@1"` @1) list) ", ")) (defun join-recs (recs-left recs-right) (append-each ((l recs-left)) (collect-each ((r recs-right)) (append l r)))) (let ((hashes (collect-each ((arg *args*)) (let ((stream (open-file arg))) [group-by first [mapcar csv-parse (gun (get-line stream))] :equal-based])))) (when hashes (let ((joined (reduce-left (op hash-isec @1 @2 join-recs) hashes))) (dohash (key recs joined) (each ((rec recs)) (put-line (csv-format rec)))))))) 

Sample data.

Note: the key 3792318 occurs twice third file, so we expect two rows in the join output for that key.

Note: The data is not required to be sorted; hashing is used for the join.

$ for x in csv* ; do echo "File $x:" ; cat $x ; done File csv1: 3792318, 2014-07-15 00:00:00, "A, B" 3792319, 2014-07-16 00:00:01, "B, C" 3792320, 2014-07-17 00:00:02, "D, E" File csv2: 3792319, 2014-07-15 00:02:00, "X, Y" 3792320, 2014-07-11 00:03:00, "S, T" 3792318, 2014-07-16 00:02:01, "W, Z" File csv3: 3792319, 2014-07-10 00:04:00, "M" 3792320, 2014-07-09 00:06:00, "N" 3792318, 2014-07-05 00:07:01, "P" 3792318, 2014-07-16 00:08:01, "Q" 

Run:

$ txr join.txr csv1 csv2 csv3 3792319, 2014-07-16 00:00:01, "B, C", 3792319, 2014-07-15 00:02:00, "X, Y", 3792319, 2014-07-10 00:04:00, M 3792318, 2014-07-15 00:00:00, "A, B", 3792318, 2014-07-16 00:02:01, "W, Z", 3792318, 2014-07-05 00:07:01, P 3792318, 2014-07-15 00:00:00, "A, B", 3792318, 2014-07-16 00:02:01, "W, Z", 3792318, 2014-07-16 00:08:01, Q 3792320, 2014-07-17 00:00:02, "D, E", 3792320, 2014-07-11 00:03:00, "S, T", 3792320, 2014-07-09 00:06:00, N 

A more "correct" csv-parse function is:

 ;; Include the comma separators as tokens; then parse the token ;; list, recognizing consecutive comma tokens as an empty field, ;; and stripping leading/trailing whitespace and quotes. (defun csv-parse (str) (labels ((clean (str) (set str (trim-str str)) (if (and (= [str 0] #\") (= [str -1] #\")) [str 1..-1] str)) (post-process (tokens) (tree-case tokens ((tok sep . rest) (if (equal tok ",") ^("" ,*(post-process (cons sep rest))) ^(,(clean tok) ,*(post-process rest)))) ((tok . rest) (if (equal tok ",") '("") ^(,(clean tok))))))) (post-process (tok-str str #/[^,]+|"[^"]*"|,/)))) 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.