How can I partition a set of words given pairs that must end up together?

Question

I have a space or comma separated table with two columns, each row representing equivalence of the two words.

A B B C B D C E F G

What I want is a table with each row listing all mutually equivalent words.

A B C D E F G

That is if two words ever occur on the same row of input they must end up in the same row of the output.

Any tool would do.

I guess you need this done recursively, right? I mean, D and E will only be in the first row after having been added by processing the previous rows. What if you have E G further down? How should that be dealt with? Also, which is it, space or comma? — terdon
– terdon ♦, Commented Oct 17, 2014 at 14:30
@artm Because A = B, B = C, C = D, and D = E so these all should come in one line. F & G are different so this goes to new line or another line. Kindly help guys. — Vinay
– Vinay, Commented Oct 20, 2014 at 11:13

Jeff Schaller · Accepted Answer · 2018-03-06 22:53:22Z

In python, start with the input file as an argument:

import sys res = [] # list of lists for line in open(sys.argv[1]): try: x, y = line.split() # split on space except ValueError: line = line.rstrip() x, y = line.split(',') # retry with comma for l in res: if x in l: if y not in l: l.append(y) break else: res.append([x, y]) for line in res: print ' '.join(line)

The test if y not in l: skips adding the same value twice, I am not sure if that is wanted, or if the source has such anomalies. You can leave out the test and always execute l.append(y).

The code tries to split on space first, then retries comma. This assumes comma separated lines have no space in them (i.e. are not A, B).

The nested for loop uses (AFAIK) a python particularity: the else is only executed if the for loop ends through exhaustion, that is not through the break statement. This means if x is not found, the pair is appended as new list to res.

yes...I am new to Stackexchange. I do not know how to say accept the answer and close the thread. — Vinay
– Vinay, Commented Oct 18, 2014 at 6:45
@Vinay No problem. To click just click on the smaller v like sign below the down arrow next to the number (currently 1) next to this question. — Anthon
– Anthon, Commented Oct 19, 2014 at 19:32
Hey @Anthon... I am in little trouble again. Seems like you script is not working properly or it is not giving the expected result. May be I was not able explain the problem properly. Should I start a new thread or can I post it here again? — Vinay
– Vinay, Commented Oct 20, 2014 at 10:41
@Vinay just extend/refine your question. Show clearly with what input the program goes wrong ( we need at least the input and the expected output, the erroneous output I can generate with the program ;-) Then leave me short comment here, because I get notified of that comment but not from when you edit your question, and I will look at it. — Anthon
– Anthon, Commented Oct 20, 2014 at 11:57
@Anthon, your script will leave a newline in the last word of a line split on commas. It also suffers from the problem my previous version had (not merging sets once created) — artm
– artm, Commented Oct 20, 2014 at 14:22

artm · Accepted Answer · 2014-10-21 06:53:49Z

theory

This problem is known as partitioning a set into equivalence classes, with input file listing pairwise equivalences. It can be solved with the help of a disjoint-set data structure.

Less abstract example is e.g. partitioning words into groups of synonyms given pairs of synonyms:

large big big great great vast small little little tiny

becomes:

large big great vast small little tiny

ruby solution

Disjoint set isn't available in ruby standard library, so I emulate it using a ruby Hash (known elsewhere as "associative array", "dictionary", "map").

#!/usr/bin/env ruby # new elements end up in "singleton subsets" subsets = Hash.new { |subsets, element| subsets[element] = [element] } ARGF.each do |line| x, y = line.scan(/[^\s,]/) # these two emulate disjoint-set's "find" operation x_set = subsets[x] y_set = subsets[y] # and this loop implements disjoint-set's "union" y_set.each do |element, _| subsets[element] = x_set << element end unless x_set == y_set end puts subsets.values.uniq.map{|set| set.join(" ")}

usage

this expects filenames on the command line or data on stdin:

$ ruby so-162730.rb input.txt A B C D E F G $ ruby so-162730.rb < input.txt A B C D E F G

awk solution

Perhaps more appropriate for this site.

Here I use a slightly different implementation of disjoint-set: each subset is represented by one of its elements ("leader"). This makes union operation slower, but is easier to implement with awk's simple data types.

{ union(find($1), find($2)); } END { format_subsets(); for(i in subsets) print subsets[i]; } function find(element) { if (!leaders[element]) leaders[element] = element; return leaders[element]; } function union(leader_1, leader_2) { for(i in leaders) if (leaders[i] == leader_2) leaders[i] = leader_1; } function format_subsets() { for(element in leaders) { leader = leaders[element] subsets[leader] = (subset = subsets[leader]) ? (subset OFS element) : element; } }

usage

$ awk -f so-162730.awk < input.txt A B C D E F G

Or for whitespace or comma separated input:

$ awk -f so-162730.awk -F '[[:space:]]+|,' input.txt A B C D E F G

Stack Exchange Network

How can I partition a set of words given pairs that must end up together?

2 Answers 2

theory

ruby solution

usage

awk solution

usage

You must log in to answer this question.

Hot Network Questions

How can I partition a set of words given pairs that must end up together?

2 Answers 2

theory

ruby solution

usage

awk solution

usage

You must log in to answer this question.

Related

Hot Network Questions