- Notifications
You must be signed in to change notification settings - Fork 2
feat(algorithms, sliding window): repeated dna sequences #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a new "Repeated DNA Sequences" sliding-window module: documentation, two implementations (naive and rolling-hash optimized), and unit tests validating various repeat scenarios and input validation. Changes
Sequence Diagram(s)sequenceDiagram participant Caller participant Module as RepeatedDNA_Module rect rgb(211,228,205) Note right of Module: Naive flow Caller->>Module: find_repeated_dna_sequences_naive(s) Module->>Module: iterate all 10-char windows (slice) Module->>Module: track seen set -> when repeat -> add to results Module->>Caller: return list(results) end rect rgb(224,235,255) Note right of Module: Optimized rolling-hash flow Caller->>Module: find_repeated_dna_sequences(s) Module->>Module: validate characters (A,C,G,T) or raise ValueError Module->>Module: encode first 10 chars -> compute initial hash loop slide windows Module->>Module: update rolling hash (remove leading, add trailing) Module->>Module: if hash seen before -> add substring to output set end Module->>Caller: return list(output) end Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)
5-40: Good test coverage for typical scenarios.The tests cover various important cases including overlapping repeats, homogeneous sequences, and patterns. The use of
sorted()for order-independent comparison in tests 1 and 5 is appropriate.Consider adding edge case tests for:
- Empty string or strings with length < 10 (should return empty list)
- String with exactly 10 characters (should return empty since no repeats possible)
- These would verify the early-return logic in the optimized implementation
Example:
def test_edge_case_short_sequence(self): """Test sequences shorter than 10 characters""" self.assertEqual([], find_repeated_dna_sequences("ACGT")) self.assertEqual([], find_repeated_dna_sequences("")) def test_edge_case_exact_length(self): """Test sequence of exactly 10 characters""" self.assertEqual([], find_repeated_dna_sequences("ACGTACGTAC"))algorithms/sliding_window/repeated_dna_sequences/__init__.py (1)
52-52: Remove duplicate comment marker.- # # Compute the initial hash using base-4 multiplication + # Compute the initial hash using base-4 multiplication
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (18)
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_example_one.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_example_three.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_example_two.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_eight.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_eleven.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_fifteen.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_five.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_four.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_fourteen.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_nine.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_one.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_seven.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_six.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_ten.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_thirteen.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_three.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_twelve.pngis excluded by!**/*.pngalgorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_two.pngis excluded by!**/*.png
📒 Files selected for processing (4)
DIRECTORY.md(1 hunks)algorithms/sliding_window/repeated_dna_sequences/README.md(1 hunks)algorithms/sliding_window/repeated_dna_sequences/__init__.py(1 hunks)algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)
algorithms/sliding_window/repeated_dna_sequences/__init__.py (1)
find_repeated_dna_sequences(27-71)
🪛 LanguageTool
algorithms/sliding_window/repeated_dna_sequences/README.md
[grammar] ~240-~240: Use a hyphen to join words.
Context: ...t containing all the repeating 10-letter long sequences. ### Time Complexity Let...
(QB_NEW_EN_HYPHEN)
🪛 markdownlint-cli2 (0.18.1)
DIRECTORY.md
104-104: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
105-105: Unordered list indentation
Expected: 4; Actual: 6
(MD007, ul-indent)
🔇 Additional comments (3)
DIRECTORY.md (1)
104-105: LGTM - Directory entry follows repository conventions.The addition correctly documents the new Repeated DNA Sequences module under the Sliding Window category with proper link formatting.
algorithms/sliding_window/repeated_dna_sequences/README.md (1)
1-271: Excellent documentation with comprehensive explanations.The README provides clear explanations of both the naive and optimized approaches, detailed mathematical formulas for the rolling hash technique, step-by-step implementation guidance, and proper complexity analysis. The extensive illustrations and code examples will help readers understand the algorithm.
algorithms/sliding_window/repeated_dna_sequences/__init__.py (1)
27-71: Well-implemented rolling hash algorithm.The optimized implementation correctly uses a base-4 rolling hash with proper:
- Character encoding (A→0, C→1, G→2, T→3)
- Initial hash computation for the first window
- Efficient hash updates when sliding:
h' = h * 4 - old_char * 4^10 + new_char- Early return for sequences ≤ 10 characters
- Set-based deduplication of results
The time complexity is O(n) and space complexity is O(n), meeting the optimization goals described in the README.
Note on hash collisions: Theoretically, different 10-character sequences could produce the same hash value. Given the hash space (4^10 = 1,048,576) and problem constraints (s.length ≤ 10³), collisions are unlikely in practice. However, for production code with larger inputs, consider verifying actual substring equality when a hash collision is detected to eliminate false positives.
…fixes to naive implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
algorithms/sliding_window/repeated_dna_sequences/__init__.py (2)
1-1: Remove unusedDictimport.The
Dicttype is imported but not used anywhere in this module.Apply this diff:
-from typing import List, Dict +from typing import List
57-57: Fix double hash in comment.The comment has an extra
#character.Apply this diff:
- # # Compute the initial hash using base-4 multiplication + # Compute the initial hash using base-4 multiplicationalgorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)
1-80: Test coverage is solid but could benefit from edge cases.The test suite effectively validates both implementations with parallel test cases covering:
- Typical repeated sequences
- Homogeneous sequences (all same character)
- No repeats scenarios
- Overlapping patterns
- Complex repeating patterns
The use of
sorted()for order-independent comparisons is appropriate since the functions return results from sets.Consider adding edge case tests:
def test_empty_string(self): """Test empty string input.""" dna_sequence = "" expected = [] actual = find_repeated_dna_sequences(dna_sequence) self.assertEqual(expected, actual) def test_exact_length_10(self): """Test string of exactly length 10 (boundary case).""" dna_sequence = "AAAAAAAAAA" expected = [] actual = find_repeated_dna_sequences(dna_sequence) self.assertEqual(expected, actual) def test_length_11_with_repeat(self): """Test minimal length where repeat is possible.""" dna_sequence = "AAAAAAAAAAA" # 11 A's expected = ["AAAAAAAAAA"] actual = find_repeated_dna_sequences(dna_sequence) self.assertEqual(expected, actual)If input validation is added to handle invalid characters, also add:
def test_invalid_characters(self): """Test that invalid DNA characters raise appropriate error.""" dna_sequence = "AAAAXAAAAA" with self.assertRaises(ValueError): find_repeated_dna_sequences(dna_sequence)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
algorithms/sliding_window/repeated_dna_sequences/__init__.py(1 hunks)algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)
algorithms/sliding_window/repeated_dna_sequences/__init__.py (2)
find_repeated_dna_sequences(32-76)find_repeated_dna_sequences_naive(4-30)
🔇 Additional comments (2)
algorithms/sliding_window/repeated_dna_sequences/__init__.py (2)
4-30: LGTM! Past review feedback fully addressed.The naive implementation now correctly:
- Validates input length before processing
- Uses a set for
seeninstead of a dictionary- Checks subsequence length to skip incomplete windows at the end
The logic for tracking repeated sequences is sound.
52-76: Rolling hash implementation looks solid.The rolling hash algorithm is correctly implemented:
- Base-4 encoding efficiently maps DNA bases to integers
- Initial hash computation properly accumulates values
- The precomputed
a_k(4^10) value enables O(1) window updates- The sliding window correctly removes the leftmost character and adds the new rightmost character
The algorithm achieves O(n) time complexity with O(n) space for tracking seen hashes, which is optimal for this problem.
Describe your change:
Adds and solves repeated DNA sequences uses both a naive approach and sliding window algorithm
Checklist:
Summary by CodeRabbit
New Features
Documentation
Tests
Bug Fixes