Finding longest common prefix

Question

I have been trying to solve a modification of the Longest Common Prefix problem. It is defined below.

Defining substring

For a string P with characters P₁, P₂,…, P_q, let us denote by P[i, j] the substring P_i, P_i+1,…, P_j.

Defining longest common prefix

LCP(S₁, S₂,…, S_k), is defined as largest possible integer j such that S₁[1, j] = S₂[1, j] = … = S_k[1, j].

You are given an array of N strings, A1, A2 ,…, AN and an integer K. Count how many indices (i, j) exist such that 1 ≤ i ≤ j ≤ N and LCP(A_i, A_i+1,…, A_j) ≥ K. Print required answer modulo 10⁹+7.

Note that K does not exceed the length of any of the N strings. K <= min(len(A_i)) for all i

For example,

A = ["ab", "ac", "bc"] and K=1.
LCP(A[1, 1]) = LCP(A[2, 2]) = LCP(A[3, 3]) = 2 LCP(A[1, 2]) = LCP("ab", "ac") = 1 LCP(A[1, 3]) = LCP("ab", "ac", "bc") = 0 LCP(A[2, 3]) = LCP("ac", "bc") = 0 
So, the answer is 4. Return your answer % MOD = 1000000007

Constraints

1 ≤ Sum of length of all strings ≤ 5*10⁵. Strings consist of small alphabets only.

Here is my approach:

class Solution: # @param A : list of strings # @param B : integer # @return an integer def LCPrefix(self, A, B): res = 0 for i in xrange(len(A)): prev = A[i] prevLCP = len(A[i]) for j in xrange(i, len(A)): prevLCP = self.getLCP(prev, A[j], prevLCP) prev = A[j] if prevLCP >= B: res += 1 return res % 1000000007 def getLCP(self, A, B, upto): i = 0 lim = min(upto, len(B)) while i < lim: if A[i] != B[i]: break i += 1 return i

The time complexity of this algorithm is O(n^2*m), where n is the length of the list and m is the maximum length of the string.

The online judge (InterviewBit) does not accept this solution in terms of time complexity. Can anyone think of a way to improve it?

Hint: it's possible in O(n * K) without using any fancy datastructures. — Peter Taylor
– Peter Taylor, Commented Jul 26, 2016 at 7:28

Peter Taylor · Accepted Answer · 2016-07-28 10:21:57Z

Note: although I hinted in a comment that there's a O(n * K) solution, I'm deliberately not going to give it to you (and I hope that by bumping the question to the front page I don't cause someone else to) because in my opinion the point of challenge websites is for you to learn by doing, and you will learn more effectively by figuring it out yourself with hints than by getting the solution on a plate. However, I do have some comments on your code.

class Solution: # @param A : list of strings # @param B : integer # @return an integer

It's useful to document the expected type, but it's more useful to document the meaning. Here B is the K of the problem specification, but I shouldn't have to work that out myself. The parameters would benefit from descriptive names, and the @return should outline what the integer returned means.

 def LCPrefix(self, A, B): res = 0 for i in xrange(len(A)): prev = A[i] prevLCP = len(A[i]) for j in xrange(i, len(A)): prevLCP = self.getLCP(prev, A[j], prevLCP) prev = A[j]

There's a possible minor optimisation here: when j == i you already know the LCP, but you calculate it again. In my opinion it would be worth duplicating the simple test if prevLCP >= B: res += 1 before the loop, and only considering j > i. As a bonus, you would no longer need to track prev because you would always be testing the common prefix of A[j-1] and A[j]. (Hint: does that give you any ideas for optimisation?)

Why the update to prevLCP? I can figure it out, but a comment (and a better name - because the value of prevLCP isn't the previous LCP) would be useful.

 if prevLCP >= B: res += 1

An else: break clause here would not change the asymptotic complexity of your solution, but it would surely make it much faster in many test cases.

 return res % 1000000007

I'll give you one more hint about improving the complexity: the fact that they want the answer modulo 10⁹ is a really big clue that you shouldn't compute it solely via res += 1.

 def getLCP(self, A, B, upto):

Document the parameters, and think about the names. getLCP implies to me that the return value should be the longest common prefix: i.e. a string. But it's actually an integer, and it might not even be the length of the LCP because of upto.

 i = 0 lim = min(upto, len(B)) while i < lim: if A[i] != B[i]: break i += 1 return i

In the other method you used xrange: why not here?

 lim = min(upto, len(B)) for i in xrange(lim): if A[i] != B[i]: return i return lim

you could do spoiler, if you have answer give it(link, point it), or better remove that part of answer, to not provoke someone to post it (if it's really exists) — MolbOrg
– MolbOrg, Commented Jul 28, 2016 at 11:17

Oscar Smith · Accepted Answer · 2016-07-26 03:30:58Z

1

My guess is that you will want to use a DAWG or similar structure. This will take a while to start, but it will greatly speed up prefix searches.

answered Jul 26, 2016 at 3:30

Oscar Smith

3,71718 silver badges31 bronze badges

5

\$\begingroup\$ While a recommendation for a new way is always great, it isn't really a review. Please either review the code, optionally including this recommendation, or leave this recommendation as a comment on the question and delete this answer. \$\endgroup\$

zondo
– zondo

2016-07-27 00:16:25 +00:00
Commented Jul 27, 2016 at 0:16

Add a comment |

Stack Exchange Network

Finding longest common prefix

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Finding longest common prefix

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions