Finding an Insertion in a String

Question

What's the best way of checking if StringA = StringB with an another StringC inserted at some arbitrary point?

For example, given abcdef and abcXYZdef, I want to find that abcXYZdef is abcdef with XYZ inserted at position 4.

On the other hand, given abcdef and abRSTcdXYZef, I want to find that the first string cannot be turned into the second with only a single insertion.

I know I could go over StringA character by character, from both ends, and check if it covers the whole of StringB, but that would be rather tedious to write. It would also be rather slow to do this in Python (which i am working in) and I would rather not write a special C-extension just for this.

Are there any clever things I can do with Regex's or other standard string-manipulation functions that can do this for me?

edit: To clarify, StringC is completely unknown; There may not even be a valid StringC, and i will want to know if that is the case.

It would probably help if you made your sample string shorter and easier to comprehend. — Paul Sasik
– Paul Sasik, Commented Aug 2, 2011 at 20:49
Do you really think it would be that tedious to write? Python has the nice slicing stuff for checking substrings s1[:n]==s2[:n]. It is of course not blazingly efficient, but I think it wouldn't take long to code it. — phimuemue
– phimuemue, Commented Aug 2, 2011 at 20:50
I don't know why you reject the character-by-character solution out of hand. It doesn't seem like it would be more than a few lines of code, and it would be about as fast as pure Python can be. — Mark Ransom
– Mark Ransom, Commented Aug 2, 2011 at 20:59
@mark: mainly because i'll be handling text strings maybe 100kb in size; I want something faster than pure python =D. — Li Haoyi
– Li Haoyi, Commented Aug 2, 2011 at 21:58
If you need something faster, the C/C++ implementation of character-by-character comparison will likely be really fast. But first check out my Python implementation below and see if it's fast enough. — Mark Ransom
– Mark Ransom, Commented Aug 2, 2011 at 22:02

pyroscope · Accepted Answer · 2011-08-02 20:58:41Z

6

A very underappreciated gem in the standard lib is difflib...

>>> import difflib >>> s = difflib.SequenceMatcher(None, "GHSKWITNIFSI", "GHSKWAGDITNIFSI") >>> s.get_matching_blocks()[:-1] [(0, 0, 5), (5, 8, 7)] >>> s = difflib.SequenceMatcher(None, "GHSKWITNIFSI", "GHSKWITNIFSI") >>> s.get_matching_blocks()[:-1] [(0, 0, 12)]

answered Aug 2, 2011 at 20:58

pyroscope

4,1581 gold badge20 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

neurino Over a year ago

+1 for making difflib known but explaining how to interpret results would help

Ben Blank Over a year ago

@neurino — The tuples each represent a matching block; the first number is the starting offset in the first sequence, the second the starting offset in the second sequence, and the last the length of the matching block.

Li Haoyi Over a year ago

Nice! Never knew about that library

steveha Over a year ago

Wow... batteries included indeed!

g.d.d.c · Accepted Answer · 2011-08-02 20:57:03Z

This ... feels kludgy to a degree, and it's only probably half-way there, but it seems like it found the substring in your example and could probably be expanded a bit. I can revise it some in a minute with some more time to test, but it's an approach concept:

s1 = 'GHSKWITNIFSI' s2 = 'GHSKWAGDITNIFSI' l = len(s2) - len(s1) for i in range(len(s1)): if s2[0:i] + s2[i + l:] == s1: print i break

I don't like the use of range(len()), but in this particular use scenario I think it's appropriate. It will print the index where an insertion took place if a single insertion will turn s1 into s2.

@krs1 - it's just typically not "pythonic" ... you usually iterate directly over an iterable, or if you must have index values you use enumerate(iterable) to make them available. Like I said though, in this scenario it's probably appropriate.

steveha · Accepted Answer · 2011-08-02 20:51:01Z

I don't know, but you are trying to find the "edit distance". Checking Wikipedia:

http://en.wikipedia.org/wiki/Edit_distance

You might also look at Peter Norvig's spelling corrector:

http://norvig.com/spell-correct.html

I think you could adapt code from the spelling corrector to do what you need.

Good luck.

Mark Ransom · Accepted Answer · 2011-08-02 21:22:24Z

def GetInsertedString(StringA, StringB): lenA = len(StringA) lenB = len(StringB) if lenA > lenB: return None, None begincount = 0 while begincount < lenA and StringA[begincount] == StringB[begincount]: begincount += 1 endcount = 0 while endcount < (lenA - begincount) and StringA[lenA-endcount-1] == StringB[lenB-endcount-1]: endcount += 1 if begincount + endcount != lenA: return None, None return begincount, StringB[begincount:begincount+lenB-lenA] >>> GetInsertedString('GHSKWITNIFSI', 'GHSKWAGDITNIFSI') (5, 'AGD') >>> GetInsertedString('GHSKWITNIFSI', 'GHSKWAGDTNIFSI') (None, None)

Andrew Clark · Accepted Answer · 2011-08-02 21:27:02Z

from itertools import dropwhile def get_inserted_substring(s1, s2): try: # diff is the first index at which the strings differ diff = dropwhile(lambda i: s1[i] == s2[i], xrange(len(s2))).next() if s2[diff:].endswith(s1[diff:]): return (diff, s2[diff:diff-len(s1)]) except (StopIteration, IndexError): # the strings are the same or only differ at the end if len(s1) <= len(s2): return (len(s1), s2[len(s1):]) return (None, None)

And examples...

>>> get_inserted_substring('abcdef', 'abcXYZdef') (3, 'XYZ') >>> get_inserted_substring('abcdef', 'abRSTcdXYZef') (None, None) >>> get_inserted_substring('abcdef', 'abcdefXYZ') (6, 'XYZ') >>> get_inserted_substring('abcdef', 'XYZabcdef') (0, 'XYZ') >>> get_inserted_substring('abcdefXYZ', 'abcdef') (None, None)

krs1 · Accepted Answer · 2011-08-02 20:55:40Z

-2

strA='foor' strB='foobar' strC='ba' if strB.replace(strC,'') == strA: print strC,' at index ',len(strB.split(strC)[0])

Possibly? Testing right now...

answered Aug 2, 2011 at 20:55

krs1

1,1357 silver badges16 bronze badges

1 Comment

g.d.d.c Over a year ago

I don't think strC is a known value - that's what he's attempting to determine. If it were known there'd be no reason for the question.

Collectives™ on Stack Overflow

Finding an Insertion in a String

6 Answers 6

4 Comments

2 Comments

1 Comment

Comments

Comments

1 Comment

Hot Network Questions