11
$\begingroup$

I need to find how well several different lists match a reference list. I'm looking for a percentage or some kind of similarity score.

For example,

a = {"A278", "G279", "S280", "G281", "I282", "I283", "I284", "S285", "D286", "T287", "P288", "V289", "H290", "D291", "C292"} b = {"S280", "G281", "I282", "I284"} c = {"C275", "S276", "T277", "A278", "G279"} 

How can I determine that b is a better match against a than c? a is the reference list.

Order matters.

After looking through the documentation, the only way I can think of doing this is to iterate through b and c and test if each element is MemberQ of a, tallying up the total and comparing the totals at the end. Is there a better approach?

$\endgroup$
3
  • 2
    $\begingroup$ You might consider looking through the whole bunch of *Distance[]/*Dissimilarity[] functions available. SequenceAlignment[] might also be of use. $\endgroup$ Commented Oct 8, 2018 at 4:24
  • 1
    $\begingroup$ Testing as described in the last paragraph of the question does not take account of order. If order actually does not matter, consider Complement. $\endgroup$ Commented Oct 8, 2018 at 4:33
  • $\begingroup$ Working on Complement now, it seems promising @bbgodfrey $\endgroup$ Commented Oct 8, 2018 at 4:37

7 Answers 7

10
$\begingroup$

Maybe:

LongestCommonSequence[a, b] 

{"S280", "G281", "I282", "I284"}

LongestCommonSequence[a, c] 

{"A278", "G279"}

Length@LongestCommonSequence[a, #] & /@ {b, c} 

{4, 2}

$\endgroup$
9
$\begingroup$
MaximalBy[Length[a⋂#]&]@{b,c} 

{{"S280", "G281", "I282", "I284"}}

MinimalBy[Length@Complement[a,#]&]@{b,c} 

{{"S280", "G281", "I282", "I284"}}

$\endgroup$
2
  • $\begingroup$ +1 for conciseness, this is better than my answer $\endgroup$ Commented Oct 8, 2018 at 5:21
  • $\begingroup$ @briennakh, yours is probably faster. $\endgroup$ Commented Oct 8, 2018 at 5:22
6
$\begingroup$

Suppose I have the reference list a and a matrix otherLists of all other lists I want to compare against a:

otherLists[[Ordering[Length[#] & /@ (Complement[a, #] & /@ otherLists), 1]]] 

This will return the list that best matches a.

$\endgroup$
6
  • 1
    $\begingroup$ you can use just Length instead of Length[#] &. $\endgroup$ Commented Oct 8, 2018 at 5:28
  • $\begingroup$ Please don't edit my answer @MarcoB — Write a comment. $\endgroup$ Commented Feb 20, 2019 at 18:18
  • $\begingroup$ I rolled back my changes. @kglr 's comment is suggesting the same change. Is there a reason you prefer to retain your version? $\endgroup$ Commented Feb 20, 2019 at 18:21
  • $\begingroup$ My answer works. If you want to optimize my answer, you can add your suggestion as a comment or upvote kglr's comment. Thank you. @MarcoB $\endgroup$ Commented Feb 20, 2019 at 18:23
  • $\begingroup$ @briennakh It certainly works, but the usage of Length[#]& where Length would suffice is unnecessary. This site is collaboratively edited, so I made a change that, in my opinion, improved this answer for future readers. Anyway, I will leave it as it was; perhaps you might consider making a change yourself. $\endgroup$ Commented Feb 20, 2019 at 18:33
5
$\begingroup$

Going off your percent similarity idea, maybe something like

listsim[ref_, test_] := {#, 100. (1 - Length@Complement[ref, #]/Length@ref)} & /@ test listsim[a, {a, b, c}] 

{{{A278,G279,S280,G281,I282,I283,I284,S285,D286,T287,P288,V289,H290,D291,C292},100.} {{S280,G281,I282,I284},26.6667}
{{C275,S276,T277,A278,G279},13.3333}}

which ended up being in a similar vein to your answer.

$\endgroup$
1
  • $\begingroup$ I like this!! Thanks! $\endgroup$ Commented Oct 8, 2018 at 5:18
3
$\begingroup$

Using SequenceAlignment:

a = {"A278", "G279", "S280", "G281", "I282", "I283", "I284", "S285", "D286", "T287", "P288", "V289", "H290", "D291", "C292"}; b = {"S280", "G281", "I282", "I284"}; c = {"C275", "S276", "T277", "A278", "G279"}; SequenceAlignment[a, b] // Select[VectorQ] // MaximalBy[Length] 

{{"S280", "G281", "I282"}}

SequenceAlignment[a, c] // Select[VectorQ] // MaximalBy[Length] 

{{"A278", "G279"}}

$\endgroup$
3
$\begingroup$
a = {"A278", "G279", "S280", "G281", "I282", "I283", "I284", "S285", "D286", "T287", "P288", "V289", "H290", "D291", "C292"}; b = {"S280", "G281", "I282", "I284"}; c = {"C275", "S276", "T277", "A278", "G279"}; 

Using DeleteCases:

DeleteCases[a, Except[Alternatives @@ b]] 

{"S280", "G281", "I282", "I284"}

DeleteCases[a, Except[Alternatives @@ c]] 

{"A278", "G279"}

Length@DeleteCases[a, Except[Alternatives @@ #]] & /@ {b, c} 

{4, 2}

$\endgroup$
2
$\begingroup$
a = {"A278", "G279", "S280", "G281", "I282", "I283", "I284", "S285", "D286", "T287", "P288", "V289", "H290", "D291", "C292"}; b = {"S280", "G281", "I282", "I284"}; c = {"C275", "S276", "T277", "A278", "G279"}; 

Using SymmetricDifference (new in 13.1) and TakeSmallestBy (new in 10.1)

Extract[{b, c}, TakeSmallestBy[ Map[SymmetricDifference[a, #] &, {b, c}] -> {"Index"}, Length, 1]] 

{{"S280", "G281", "I282", "I284"}}

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.