Longest substring with at most 2 distinct characters

Question

I think I solved the following problem. However, I was wondering if there was a faster or more efficient way to solve this.

I believe the the runtime is O(n) and space is O(1) (O(256) assuming ASCII)

# Longest substring with at most 2 distinct characters # # Given a string S, find the length of the longest substring T that contains at most 2 distinct characters # # For example, # Given S = "eceba" # T is "ece" which it's length is 3 def longest_substring(string): if string is None or len(string) == 1 or len(string) == 0: return 0 distinct_set = set() longest = 0 current = 0 for i in string: set_len = len(distinct_set) if set_len < 2: if i in distinct_set: current += 1 if i not in distinct_set: distinct_set.add(i) current += 1 if set_len == 2: if i in distinct_set: current += 1 if i not in distinct_set: distinct_set.clear() distinct_set.add(i) if current > longest: longest = current current = 1 if current > longest: longest = current return longest

Input:

test_cases = [None, "", "a", "ab", "aab", "bab", "babc", "bbbbcccac"]

Output:

0, 0, 0, 2, 3, 3, 3, 7

Why is the result 0 for "a"? The spec is "at most 2 distinct characters", not "exactly 2". — Janne Karila
– Janne Karila, Commented Nov 20, 2015 at 13:50

holroy · Accepted Answer · 2015-11-24 21:00:36Z

Two style issues before we continue on improving your algorithm:

Use docstrings to document functions, not comments in front of the function. In addition don't let the comment use one name for a variable, and the function another. Keep them in sync, as otherwise the comment/docstring will confuse more than help
Add blank linkes in longer code segments. This can greatly enhance the reading and understanding of your code. I tend to add a blank line before for, while and if (and elif and else) usually.

Other than these your variable and function names are OK.

Regarding improvement on your algorithm here are some other comments:

Allow texts with only one character, and length 1 – Both of these conform to at most two distinct characters, and should be included in the final set
Bug: When a character not in the set appears, you set length to 1 – You clear the set and set it to current character, and set length to one, but if your text is ceeeeebbbbbb (and you come to the b the length should really be 5 (or 6) and your distinct set should consist of e and b. You forget that the e could still be part of the longest substring, even though the combination c and e didnt't yield the longest one.
Use of set is kind of heavy – It kind of make sense to use a set to keep the two distinct character, and if it was larger subset I would argue that is wise, but as it is only two characters, it is faster to just keep track of the last two characters, and use that for length calculations.
Move increment of current out of loop – As you increment it in all the if statements except the last, you could move this to front of loop. Function will work the same, but look nicer. (Do still keep the current = 1 near the end)
Move i in distinct_set out a level – This is another block of code which is the same, and could be kept at an outer level. The point I'm trying to make is to try to avoid repeating code, and if multiple if blocks have the same code, move it out the outer level.

This leads to the following version of your code:

def longest_substring_mod(text): """Longest substring with at most 2 distinct characters.""" if text is None or len(text) == 0: return 0 distinct_characters = set() longest = 0 current = 0 for i in text: if i not in distinct_characters: if len(distinct_characters) == 2: distinct_characters.clear() if current > longest: longest = current # This should've been corrected to more than # one if previous character is repeated current = 1 distinct_characters.add(i) if i in distinct_characters: current += 1 if current > longest: longest = current return longest

This behaves mostly the same as your code, and the difference is that this also returns 1 when the text is only one character long. The other flaws/feature, is kept as is. Do notice how it has simpler logic, and is a little easier to read, whilst still doing the same stuff.

Alternate algorithm

In my first naive approach at a different algorithm, I used a loop going through the string once, and stored the position of where I first saw a given character. This worked rather nicely, until I considered the case "abbaaaaeeeeeeeeee", where the first occurence of "a" is at start, whilst the longest substring should start with the "a" at position 3. This made me change the length calculation from using positions, to stripping of the characters from end of string. This led to the following code:

def longest_substring(text): """Longest substring of text with at most 2 distinct characters. >>> ### Using doctests test cases >>> # Test cases from OP >>> [longest_substring(s) for s in [None, "", "a", "ab", "aab", ... "bab", "babc", "bbbbcccac"]] [0, 0, 1, 2, 3, 3, 3, 7] >>> # Test cases illustrating fail in OP code from 200_success >>> # See http://codereview.stackexchange.com/a/111277/78136 >>> [longest_substring(s) for s in ["eeeeebbbbb", "eeeeebbbbbc", ... "ceeeeebbbbb"]] [10, 10, 10] >>> # Single character tests >>> [longest_substring(s) for s in ["a", "aa", "aaaaaaaaaa"]] [1, 2, 10] >>> # Icky test case, as it needs to get length of last a-repeats >>> longest_substring("abbaaaaeeeeeeeeee") 14 """ if text is None or len(text) == 0: return 0 first_character = text[0] second_character = None max_length = 0 for idx, character in enumerate(text): # If character is either of the two last seen characters, continue if character == first_character or character == second_character: continue # ... else if we have no previous character, set it and continue elif second_character is None: second_character = character continue # Now we have a different character at idx, so compare length of # previous substring, and set max_length accordingly max_length = max(max_length, idx - len(text[:idx].rstrip(first_character + second_character))) # Shift the distinct set of characters first_character, second_character = text[idx-1:idx], character # Return max_length within text, or length of last subtext if second_character: return max(max_length, len(text) - len(text.rstrip(first_character + second_character))) else: # No prev_character is set, meaning we only have one character in text return len(text) def doctest(): """Do the doctests on this module.""" import doctest doctest.testmod() if __name__ == '__main__': doctest() print([longest_substring(test) for test in [None, "", "a", "ab", "aab", "bab", "babc", "bbbbcccac"]]) print([longest_substring(test) for test in ["eeeeebbbbb", "eeeeebbbbbc", "ceeeeebbbbb"]]) print([longest_substring(test) for test in ["aaaaa", "edfraaaaaa", "baacba", "babababababaaaabbbabz", "babc", "bbbbcccac"]]) print(longest_substring("abbaaaaeeeeeeeeee"))

Some extra comments related to my code:

I've implemented testing using the doctests module, as indicated by 200_success in a comment on another answer. In addition I've extended the test cases with some from other answers, and some of my own. Note that testing the OP's code with these test cases fails...
This implementation is faster in some cases, and similar in others. I do however think it is a slightly cleaner implementation, and it is a correct implementation.

It could probably be made even faster, if finding a way to avoid the rstrip() function, but that would also require some extra variables to keep track of both the first time the first character is seen, and the first time we've seen the current character. This should allow for a simple subtraction instead of doing text sliceing and calculation.

anon · Accepted Answer · 2015-11-20 01:43:28Z

Three small things:

Firstly, docstrings exist. Use them. Put the explanation for how to use your method in them. That means that instead of this:

# Longest substring with at most 2 distinct characters # # Given a string S, find the length of the longest substring T that contains at most 2 distinct characters # # For example, # Given S = "eceba" # T is "ece" which it's length is 3 def longest_substring(string):

you have this:

def longest_substring(string): """ Longest substring with at most 2 distinct characters Given a string S, find the length of the longest substring T that contains t most 2 distinct characters For example, Given S = "eceba" T is "ece" which it's length is 3"""

At least, if I've understood their purpose correctly.

Secondly, if someone passes in None, that should probably be an error, not a return 0 case. Same if someone passes in a number, Duck, or any other invalid type.

Thirdly, according to PEP8, the official style guide, none of your lines should be longer than 80 characters. The (new) second line violates this; it even did before. Just split it in half and you're good.

\$\begingroup\$ Even better, write the examples as doctests. \$\endgroup\$

200_success
– 200_success

2015-11-20 07:20:16 +00:00
Commented Nov 20, 2015 at 7:20 — 200_success
– 200_success, Commented Nov 20, 2015 at 7:20

200_success · Accepted Answer · 2015-11-20 05:00:30Z

Your algorithm is too simplistic. These cases work:

>>> longest_substring('eeeeeeeeeeebbbbbbbbbb') 21 >>> longest_substring('eeeeeeeeeeebbbbbbbbbbc') 21

This case fails:

>>> longest_substring('ceeeeeeeeeeebbbbbbbbbb') 12

Stack Exchange Network

Longest substring with at most 2 distinct characters

3 Answers 3

Alternate algorithm

You must log in to answer this question.

Hot Network Questions

Longest substring with at most 2 distinct characters

3 Answers 3

Alternate algorithm

You must log in to answer this question.

Related

Hot Network Questions