Two style issues before we continue on improving your algorithm:
- Use docstrings to document functions, not comments in front of the function. In addition don't let the comment use one name for a variable, and the function another. Keep them in sync, as otherwise the comment/docstring will confuse more than help
- Add blank linkes in longer code segments. This can greatly enhance the reading and understanding of your code. I tend to add a blank line before
for, while and if (and elif and else) usually.
Other than these your variable and function names are OK.
Regarding improvement on your algorithm here are some other comments:
- Allow texts with only one character, and length 1 – Both of these conform to at most two distinct characters, and should be included in the final set
- Bug: When a character not in the set appears, you set length to 1 – You clear the set and set it to current character, and set length to one, but if your text is
ceeeeebbbbbb (and you come to the b the length should really be 5 (or 6) and your distinct set should consist of e and b. You forget that the e could still be part of the longest substring, even though the combination c and e didnt't yield the longest one. - Use of set is kind of heavy – It kind of make sense to use a set to keep the two distinct character, and if it was larger subset I would argue that is wise, but as it is only two characters, it is faster to just keep track of the last two characters, and use that for length calculations.
- Move increment of
current out of loop – As you increment it in all the if statements except the last, you could move this to front of loop. Function will work the same, but look nicer. (Do still keep the current = 1 near the end) - Move
i in distinct_set out a level – This is another block of code which is the same, and could be kept at an outer level. The point I'm trying to make is to try to avoid repeating code, and if multiple if blocks have the same code, move it out the outer level.
This leads to the following version of your code:
def longest_substring_mod(text): """Longest substring with at most 2 distinct characters.""" if text is None or len(text) == 0: return 0 distinct_characters = set() longest = 0 current = 0 for i in text: if i not in distinct_characters: if len(distinct_characters) == 2: distinct_characters.clear() if current > longest: longest = current # This should've been corrected to more than # one if previous character is repeated current = 1 distinct_characters.add(i) if i in distinct_characters: current += 1 if current > longest: longest = current return longest
This behaves mostly the same as your code, and the difference is that this also returns 1 when the text is only one character long. The other flaws/feature, is kept as is. Do notice how it has simpler logic, and is a little easier to read, whilst still doing the same stuff.
Alternate algorithm
In my first naive approach at a different algorithm, I used a loop going through the string once, and stored the position of where I first saw a given character. This worked rather nicely, until I considered the case "abbaaaaeeeeeeeeee", where the first occurence of "a" is at start, whilst the longest substring should start with the "a" at position 3. This made me change the length calculation from using positions, to stripping of the characters from end of string. This led to the following code:
def longest_substring(text): """Longest substring of text with at most 2 distinct characters. >>> ### Using doctests test cases >>> # Test cases from OP >>> [longest_substring(s) for s in [None, "", "a", "ab", "aab", ... "bab", "babc", "bbbbcccac"]] [0, 0, 1, 2, 3, 3, 3, 7] >>> # Test cases illustrating fail in OP code from 200_success >>> # See http://codereview.stackexchange.com/a/111277/78136 >>> [longest_substring(s) for s in ["eeeeebbbbb", "eeeeebbbbbc", ... "ceeeeebbbbb"]] [10, 10, 10] >>> # Single character tests >>> [longest_substring(s) for s in ["a", "aa", "aaaaaaaaaa"]] [1, 2, 10] >>> # Icky test case, as it needs to get length of last a-repeats >>> longest_substring("abbaaaaeeeeeeeeee") 14 """ if text is None or len(text) == 0: return 0 first_character = text[0] second_character = None max_length = 0 for idx, character in enumerate(text): # If character is either of the two last seen characters, continue if character == first_character or character == second_character: continue # ... else if we have no previous character, set it and continue elif second_character is None: second_character = character continue # Now we have a different character at idx, so compare length of # previous substring, and set max_length accordingly max_length = max(max_length, idx - len(text[:idx].rstrip(first_character + second_character))) # Shift the distinct set of characters first_character, second_character = text[idx-1:idx], character # Return max_length within text, or length of last subtext if second_character: return max(max_length, len(text) - len(text.rstrip(first_character + second_character))) else: # No prev_character is set, meaning we only have one character in text return len(text) def doctest(): """Do the doctests on this module.""" import doctest doctest.testmod() if __name__ == '__main__': doctest() print([longest_substring(test) for test in [None, "", "a", "ab", "aab", "bab", "babc", "bbbbcccac"]]) print([longest_substring(test) for test in ["eeeeebbbbb", "eeeeebbbbbc", "ceeeeebbbbb"]]) print([longest_substring(test) for test in ["aaaaa", "edfraaaaaa", "baacba", "babababababaaaabbbabz", "babc", "bbbbcccac"]]) print(longest_substring("abbaaaaeeeeeeeeee"))
Some extra comments related to my code:
- I've implemented testing using the doctests module, as indicated by 200_success in a comment on another answer. In addition I've extended the test cases with some from other answers, and some of my own. Note that testing the OP's code with these test cases fails...
This implementation is faster in some cases, and similar in others. I do however think it is a slightly cleaner implementation, and it is a correct implementation.
It could probably be made even faster, if finding a way to avoid the rstrip() function, but that would also require some extra variables to keep track of both the first time the first character is seen, and the first time we've seen the current character. This should allow for a simple subtraction instead of doing text sliceing and calculation.
"a"? The spec is "at most 2 distinct characters", not "exactly 2". \$\endgroup\$