8

What's the best way of checking if StringA = StringB with an another StringC inserted at some arbitrary point?

For example, given abcdef and abcXYZdef, I want to find that abcXYZdef is abcdef with XYZ inserted at position 4.

On the other hand, given abcdef and abRSTcdXYZef, I want to find that the first string cannot be turned into the second with only a single insertion.

I know I could go over StringA character by character, from both ends, and check if it covers the whole of StringB, but that would be rather tedious to write. It would also be rather slow to do this in Python (which i am working in) and I would rather not write a special C-extension just for this.

Are there any clever things I can do with Regex's or other standard string-manipulation functions that can do this for me?

edit: To clarify, StringC is completely unknown; There may not even be a valid StringC, and i will want to know if that is the case.

5
  • 3
    It would probably help if you made your sample string shorter and easier to comprehend. Commented Aug 2, 2011 at 20:49
  • Do you really think it would be that tedious to write? Python has the nice slicing stuff for checking substrings s1[:n]==s2[:n]. It is of course not blazingly efficient, but I think it wouldn't take long to code it. Commented Aug 2, 2011 at 20:50
  • I don't know why you reject the character-by-character solution out of hand. It doesn't seem like it would be more than a few lines of code, and it would be about as fast as pure Python can be. Commented Aug 2, 2011 at 20:59
  • @mark: mainly because i'll be handling text strings maybe 100kb in size; I want something faster than pure python =D. Commented Aug 2, 2011 at 21:58
  • If you need something faster, the C/C++ implementation of character-by-character comparison will likely be really fast. But first check out my Python implementation below and see if it's fast enough. Commented Aug 2, 2011 at 22:02

6 Answers 6

6

A very underappreciated gem in the standard lib is difflib...

>>> import difflib >>> s = difflib.SequenceMatcher(None, "GHSKWITNIFSI", "GHSKWAGDITNIFSI") >>> s.get_matching_blocks()[:-1] [(0, 0, 5), (5, 8, 7)] >>> s = difflib.SequenceMatcher(None, "GHSKWITNIFSI", "GHSKWITNIFSI") >>> s.get_matching_blocks()[:-1] [(0, 0, 12)] 
Sign up to request clarification or add additional context in comments.

4 Comments

+1 for making difflib known but explaining how to interpret results would help
@neurino — The tuples each represent a matching block; the first number is the starting offset in the first sequence, the second the starting offset in the second sequence, and the last the length of the matching block.
Nice! Never knew about that library
Wow... batteries included indeed!
2

This ... feels kludgy to a degree, and it's only probably half-way there, but it seems like it found the substring in your example and could probably be expanded a bit. I can revise it some in a minute with some more time to test, but it's an approach concept:

s1 = 'GHSKWITNIFSI' s2 = 'GHSKWAGDITNIFSI' l = len(s2) - len(s1) for i in range(len(s1)): if s2[0:i] + s2[i + l:] == s1: print i break 

I don't like the use of range(len()), but in this particular use scenario I think it's appropriate. It will print the index where an insertion took place if a single insertion will turn s1 into s2.

2 Comments

why don't you like range(len())?
@krs1 - it's just typically not "pythonic" ... you usually iterate directly over an iterable, or if you must have index values you use enumerate(iterable) to make them available. Like I said though, in this scenario it's probably appropriate.
0

I don't know, but you are trying to find the "edit distance". Checking Wikipedia:

http://en.wikipedia.org/wiki/Edit_distance

You might also look at Peter Norvig's spelling corrector:

http://norvig.com/spell-correct.html

I think you could adapt code from the spelling corrector to do what you need.

Good luck.

1 Comment

also 'longest common substring'
0
def GetInsertedString(StringA, StringB): lenA = len(StringA) lenB = len(StringB) if lenA > lenB: return None, None begincount = 0 while begincount < lenA and StringA[begincount] == StringB[begincount]: begincount += 1 endcount = 0 while endcount < (lenA - begincount) and StringA[lenA-endcount-1] == StringB[lenB-endcount-1]: endcount += 1 if begincount + endcount != lenA: return None, None return begincount, StringB[begincount:begincount+lenB-lenA] >>> GetInsertedString('GHSKWITNIFSI', 'GHSKWAGDITNIFSI') (5, 'AGD') >>> GetInsertedString('GHSKWITNIFSI', 'GHSKWAGDTNIFSI') (None, None) 

Comments

0
from itertools import dropwhile def get_inserted_substring(s1, s2): try: # diff is the first index at which the strings differ diff = dropwhile(lambda i: s1[i] == s2[i], xrange(len(s2))).next() if s2[diff:].endswith(s1[diff:]): return (diff, s2[diff:diff-len(s1)]) except (StopIteration, IndexError): # the strings are the same or only differ at the end if len(s1) <= len(s2): return (len(s1), s2[len(s1):]) return (None, None) 

And examples...

>>> get_inserted_substring('abcdef', 'abcXYZdef') (3, 'XYZ') >>> get_inserted_substring('abcdef', 'abRSTcdXYZef') (None, None) >>> get_inserted_substring('abcdef', 'abcdefXYZ') (6, 'XYZ') >>> get_inserted_substring('abcdef', 'XYZabcdef') (0, 'XYZ') >>> get_inserted_substring('abcdefXYZ', 'abcdef') (None, None) 

Comments

-2
strA='foor' strB='foobar' strC='ba' if strB.replace(strC,'') == strA: print strC,' at index ',len(strB.split(strC)[0]) 

Possibly? Testing right now...

1 Comment

I don't think strC is a known value - that's what he's attempting to determine. If it were known there'd be no reason for the question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.