theory
This problem is known as partitioning a set into equivalence classes, with input file listing pairwise equivalences. It can be solved with the help of a disjoint-set data structure.
Less abstract example is e.g. partitioning words into groups of synonyms given pairs of synonyms:
large big big great great vast small little little tiny
becomes:
large big great vast small little tiny
ruby solution
Disjoint set isn't available in ruby standard library, so I emulate it using a ruby Hash (known elsewhere as "associative array", "dictionary", "map").
#!/usr/bin/env ruby # new elements end up in "singleton subsets" subsets = Hash.new { |subsets, element| subsets[element] = [element] } ARGF.each do |line| x, y = line.scan(/[^\s,]/) # these two emulate disjoint-set's "find" operation x_set = subsets[x] y_set = subsets[y] # and this loop implements disjoint-set's "union" y_set.each do |element, _| subsets[element] = x_set << element end unless x_set == y_set end puts subsets.values.uniq.map{|set| set.join(" ")}
usage
this expects filenames on the command line or data on stdin:
$ ruby so-162730.rb input.txt A B C D E F G $ ruby so-162730.rb < input.txt A B C D E F G
awk solution
Perhaps more appropriate for this site.
Here I use a slightly different implementation of disjoint-set: each subset is represented by one of its elements ("leader"). This makes union operation slower, but is easier to implement with awk's simple data types.
{ union(find($1), find($2)); } END { format_subsets(); for(i in subsets) print subsets[i]; } function find(element) { if (!leaders[element]) leaders[element] = element; return leaders[element]; } function union(leader_1, leader_2) { for(i in leaders) if (leaders[i] == leader_2) leaders[i] = leader_1; } function format_subsets() { for(element in leaders) { leader = leaders[element] subsets[leader] = (subset = subsets[leader]) ? (subset OFS element) : element; } }
usage
$ awk -f so-162730.awk < input.txt A B C D E F G
Or for whitespace or comma separated input:
$ awk -f so-162730.awk -F '[[:space:]]+|,' input.txt A B C D E F G
A B D C Ealso correct for the first line?DandEwill only be in the first row after having been added by processing the previous rows. What if you haveE Gfurther down? How should that be dealt with? Also, which is it, space or comma?