I am looking for a library that implements a pairwise ranking algorithm. For example, if I have 200 writing samples from 100 people (two samples from each individual) and I want to identify which samples belong together (i.e., were written by the same person), what library could I use?
$\begingroup$ $\endgroup$
6 - $\begingroup$ Do you have details about the number of samples written by a single person? Is it 200 together or by each? $\endgroup$Hima Varsha– Hima Varsha2016-07-13 08:51:44 +00:00Commented Jul 13, 2016 at 8:51
- $\begingroup$ It is 200 together (i.e., two samples per person). $\endgroup$You_got_it– You_got_it2016-07-13 12:16:19 +00:00Commented Jul 13, 2016 at 12:16
- $\begingroup$ Do you just want a person to handwriting match? Or a ranking giving the highest priority to the ones with the maximum match? $\endgroup$Hima Varsha– Hima Varsha2016-07-13 12:51:58 +00:00Commented Jul 13, 2016 at 12:51
- $\begingroup$ Just a match. E.g, if I have person_1_writing_sample_1, person_1_writing_sample_2, person_2_writing_sample_1, and person_2_writing_sample_2, I want to match the two former and the two latter. $\endgroup$You_got_it– You_got_it2016-07-13 13:04:21 +00:00Commented Jul 13, 2016 at 13:04
- $\begingroup$ Try k-means with 100 clusters. You should be able to find a library for it in every language. $\endgroup$Emre– Emre2016-07-13 18:50:47 +00:00Commented Jul 13, 2016 at 18:50
| Show 1 more comment
1 Answer
$\begingroup$ $\endgroup$
1 If you can transform those sentences into number vectors (e.g. into a bag of words or tf-idf representation), I guess you could use k-Means or hierarchical clustering functionality from Orange, a GUI and machine learning library written in Python.
It also has an add-on for text mining specifically, but I cannot attest to it as I haven't tried it yet.
- $\begingroup$ Thanks. Ultimately, I decided to go with difference metrics (Jaccard, etc.). $\endgroup$You_got_it– You_got_it2016-07-26 19:44:28 +00:00Commented Jul 26, 2016 at 19:44