TXR language:
@(do (defun csv-parse (str) (let ((toks (tok-str str #/[^\s,][^,]+[^\s,]|"[^"]*"|[^\s,]/))) [mapcar (do let ((l (match-regex @1 #/".*"/))) (if (eql l (length @1)) [@1 1..-1] @1)) toks])) (defun csv-format (list) (cat-str (mapcar (do if (find #\, @1) `"@1"` @1) list) ", ")) (defun join-recs (recs-left recs-right) (append-each ((l recs-left)) (collect-each ((r recs-right)) (append l r)))) (let ((hashes (collect-each ((arg *args*)) (let ((stream (open-file arg))) [group-by first [mapcar csv-parse (gun (get-line stream))] :equal-based])))) (when hashes (let ((joined (reduce-left (op hash-isec @1 @2 join-recs) hashes))) (dohash (key recs joined) (each ((rec recs)) (put-line (csv-format rec))))))))
Sample data.
Note: the key 3792318 occurs twice third file, so we expect two rows in the join output for that key.
Note: The data is not required to be sorted; hashing is used for the join.
$ for x in csv* ; do echo "File $x:" ; cat $x ; done File csv1: 3792318, 2014-07-15 00:00:00, "A, B" 3792319, 2014-07-16 00:00:01, "B, C" 3792320, 2014-07-17 00:00:02, "D, E" File csv2: 3792319, 2014-07-15 00:02:00, "X, Y" 3792320, 2014-07-11 00:03:00, "S, T" 3792318, 2014-07-16 00:02:01, "W, Z" File csv3: 3792319, 2014-07-10 00:04:00, "M" 3792320, 2014-07-09 00:06:00, "N" 3792318, 2014-07-05 00:07:01, "P" 3792318, 2014-07-16 00:08:01, "Q"
Run:
$ txr join.txr csv1 csv2 csv3 3792319, 2014-07-16 00:00:01, "B, C", 3792319, 2014-07-15 00:02:00, "X, Y", 3792319, 2014-07-10 00:04:00, M 3792318, 2014-07-15 00:00:00, "A, B", 3792318, 2014-07-16 00:02:01, "W, Z", 3792318, 2014-07-05 00:07:01, P 3792318, 2014-07-15 00:00:00, "A, B", 3792318, 2014-07-16 00:02:01, "W, Z", 3792318, 2014-07-16 00:08:01, Q 3792320, 2014-07-17 00:00:02, "D, E", 3792320, 2014-07-11 00:03:00, "S, T", 3792320, 2014-07-09 00:06:00, N
A more "correct" csv-parse function is:
;; Include the comma separators as tokens; then parse the token ;; list, recognizing consecutive comma tokens as an empty field, ;; and stripping leading/trailing whitespace and quotes. (defun csv-parse (str) (labels ((clean (str) (set str (trim-str str)) (if (and (= [str 0] #\") (= [str -1] #\")) [str 1..-1] str)) (post-process (tokens) (tree-case tokens ((tok sep . rest) (if (equal tok ",") ^("" ,*(post-process (cons sep rest))) ^(,(clean tok) ,*(post-process rest)))) ((tok . rest) (if (equal tok ",") '("") ^(,(clean tok))))))) (post-process (tok-str str #/[^,]+|"[^"]*"|,/))))