3

I have a space or comma separated table with two columns, each row representing equivalence of the two words.

A B B C B D C E F G 

What I want is a table with each row listing all mutually equivalent words.

A B C D E F G 

That is if two words ever occur on the same row of input they must end up in the same row of the output.

Any tool would do.

6
  • Is A B D C E also correct for the first line? Commented Oct 17, 2014 at 14:24
  • I guess you need this done recursively, right? I mean, D and E will only be in the first row after having been added by processing the previous rows. What if you have E G further down? How should that be dealt with? Also, which is it, space or comma? Commented Oct 17, 2014 at 14:30
  • 1
    why aren't F and G in the first row? Commented Oct 17, 2014 at 16:16
  • @terdon yes it has to be done recursively. Commented Oct 20, 2014 at 11:12
  • @artm Because A = B, B = C, C = D, and D = E so these all should come in one line. F & G are different so this goes to new line or another line. Kindly help guys. Commented Oct 20, 2014 at 11:13

2 Answers 2

3

In python, start with the input file as an argument:

import sys res = [] # list of lists for line in open(sys.argv[1]): try: x, y = line.split() # split on space except ValueError: line = line.rstrip() x, y = line.split(',') # retry with comma for l in res: if x in l: if y not in l: l.append(y) break else: res.append([x, y]) for line in res: print ' '.join(line) 

The test if y not in l: skips adding the same value twice, I am not sure if that is wanted, or if the source has such anomalies. You can leave out the test and always execute l.append(y).

The code tries to split on space first, then retries comma. This assumes comma separated lines have no space in them (i.e. are not A, B).

The nested for loop uses (AFAIK) a python particularity: the else is only executed if the for loop ends through exhaustion, that is not through the break statement. This means if x is not found, the pair is appended as new list to res.

6
  • yes...I am new to Stackexchange. I do not know how to say accept the answer and close the thread. Commented Oct 18, 2014 at 6:45
  • @Vinay No problem. To click just click on the smaller v like sign below the down arrow next to the number (currently 1) next to this question. Commented Oct 19, 2014 at 19:32
  • Hey @Anthon... I am in little trouble again. Seems like you script is not working properly or it is not giving the expected result. May be I was not able explain the problem properly. Should I start a new thread or can I post it here again? Commented Oct 20, 2014 at 10:41
  • @Vinay just extend/refine your question. Show clearly with what input the program goes wrong ( we need at least the input and the expected output, the erroneous output I can generate with the program ;-) Then leave me short comment here, because I get notified of that comment but not from when you edit your question, and I will look at it. Commented Oct 20, 2014 at 11:57
  • @Anthon, your script will leave a newline in the last word of a line split on commas. It also suffers from the problem my previous version had (not merging sets once created) Commented Oct 20, 2014 at 14:22
2

theory

This problem is known as partitioning a set into equivalence classes, with input file listing pairwise equivalences. It can be solved with the help of a disjoint-set data structure.

Less abstract example is e.g. partitioning words into groups of synonyms given pairs of synonyms:

large big big great great vast small little little tiny 

becomes:

large big great vast small little tiny 

ruby solution

Disjoint set isn't available in ruby standard library, so I emulate it using a ruby Hash (known elsewhere as "associative array", "dictionary", "map").

#!/usr/bin/env ruby # new elements end up in "singleton subsets" subsets = Hash.new { |subsets, element| subsets[element] = [element] } ARGF.each do |line| x, y = line.scan(/[^\s,]/) # these two emulate disjoint-set's "find" operation x_set = subsets[x] y_set = subsets[y] # and this loop implements disjoint-set's "union" y_set.each do |element, _| subsets[element] = x_set << element end unless x_set == y_set end puts subsets.values.uniq.map{|set| set.join(" ")} 

usage

this expects filenames on the command line or data on stdin:

$ ruby so-162730.rb input.txt A B C D E F G $ ruby so-162730.rb < input.txt A B C D E F G 

awk solution

Perhaps more appropriate for this site.

Here I use a slightly different implementation of disjoint-set: each subset is represented by one of its elements ("leader"). This makes union operation slower, but is easier to implement with awk's simple data types.

{ union(find($1), find($2)); } END { format_subsets(); for(i in subsets) print subsets[i]; } function find(element) { if (!leaders[element]) leaders[element] = element; return leaders[element]; } function union(leader_1, leader_2) { for(i in leaders) if (leaders[i] == leader_2) leaders[i] = leader_1; } function format_subsets() { for(element in leaders) { leader = leaders[element] subsets[leader] = (subset = subsets[leader]) ? (subset OFS element) : element; } } 

usage

$ awk -f so-162730.awk < input.txt A B C D E F G 

Or for whitespace or comma separated input:

$ awk -f so-162730.awk -F '[[:space:]]+|,' input.txt A B C D E F G 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.