4
$\begingroup$

Given two strings $x$ and $y$ over the alphabet $\{A,C,G,T\}$, I'm trying to determine a shortest string $z$ such that $z$ is a subsequence of $x$ and not a subsequence of $y$.

Example: a shortest string that is a subsequence of AATGAAC but not a subsequence of ATCGATC is AAG.

$\endgroup$
2
  • 2
    $\begingroup$ I think the example you had made the question clearer. Why remove it? $\endgroup$ Commented May 17, 2013 at 0:19
  • $\begingroup$ Possible same on SO: stackoverflow.com/questions/12607512/… $\endgroup$ Commented Feb 23, 2015 at 6:57

1 Answer 1

5
$\begingroup$

A dynamic programming algorithm seems to work.

Given a prefix $X_k = x_1 x_2\dots x_k$ of string $X$ ($X$ is of length $s$, so $X = X_s$), a prefix $Y_m = y_1 y_2 \dots y_m$ of string $Y$ ($Y$ is of length $t$, so $Y_t = Y$), and a length $L$, we determine if there is a subsequence of $X_k$ which is also a subsequence of $Y_m$, such that the sub-sequence is of length exactly $L$ and ends at $x_k$. Call that (boolean) value: $\text{is_there}[k,m,L]$.

We will also need to know if there is a subsequence of $X_k$ which is also a subsequence of $Y_m$ such that the subsequence is of length exactly $L$ and ends at or before $x_k$. Call that boolean value $\text{is_there_any}[k,m,L]$.

These satisfy the recurrences:

$$\text{is_there}[k,m,L] = \text{is_there}[k, m-1, L] \vee (x_k = y_m \wedge \text{is_there_any}[k-1,m-1,L-1])$$

and

$$\text{is_there_any}[k, m, L] = \text{is_there}[k, m, L] \vee \text{is_there_any}[k, m-1, L]$$

The smallest $L$ such that $\text{is_there}[k, t, L] = false$ for all $k$ gives you the result. (Note that $t$ is the length of $Y$).

If the length of the shortest such subsequence is $S$, This can be implemented in $O(stS)$ time and $O(stS)$ space by a triply nested for-loop with $L$ on the outside, then $k$, then $m$.

Something like:

Compute isThere for L = 1. foreach L in (2 ... s) foreach k in (1 ... s) foreach m in (1 ... t) is_there[k,m,L] = blah is_there_any[k,m,L] = blah 
$\endgroup$
6
  • $\begingroup$ I believe you mean "the smallest $L$ for which, for all $k$, $is\_there[k, t, L] = false$. If you only require that it hold for "some" $k$, then $is\_there[k', t, L]$ could be true for some $k' \ne k$, which would break the requirements. $\endgroup$ Commented Sep 17, 2017 at 20:33
  • $\begingroup$ I think there is another problem: in the second branch of the $\land$, checking just $is\_there[k-1, m-1, L-1]$ forces $X_{m-1}$ into the subsequence, when we should allow the length-$(L-1)$ subsequence to end at any character at or before $X_{m-1}$. This could be fixed, without hurting the asymptotic running time, by defining $is\_there\_any[k, m, L] = is\_there[k, m, L] \lor is\_there\_any[k, m-1, L]$, and using $is\_there\_any[k-1, m-1, L-1]$ in the DP recurrence instead. $\endgroup$ Commented Sep 17, 2017 at 21:22
  • $\begingroup$ @j_random_hacker: I think you are right. Would you like to edit the answer as you please? $\endgroup$ Commented Sep 18, 2017 at 19:53
  • 1
    $\begingroup$ Done! (Pad, pad, pad.) $\endgroup$ Commented Sep 18, 2017 at 20:08
  • $\begingroup$ is_there does not have a halting condition. $\endgroup$ Commented Oct 29, 2017 at 19:33

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.