feat(algorithms, sliding window): repeated dna sequences #97

BrianLusina · 2025-11-05T12:09:59Z

Describe your change:

Adds and solves repeated DNA sequences uses both a naive approach and sliding window algorithm

Add an algorithm?
Fix a bug or typo in an existing algorithm?
Documentation change?

Checklist:

I have read CONTRIBUTING.md.
This pull request is all my own work -- I have not plagiarized.
I know that pull requests will not be merged if they fail the automated tests.
This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
All new Python files are placed inside an existing directory.
All filenames are in all lowercase characters with no spaces or dashes.
All functions and variable names follow Python naming conventions.
All function parameters and return values are annotated with Python type hints.
All functions have doctests that pass the automated testing.
All new algorithms have a URL in its comments that points to Wikipedia or other similar explanation.

Summary by CodeRabbit

New Features
- Added algorithms to detect repeated 10-letter DNA sequences using both naive and optimized approaches.
Documentation
- Added a comprehensive guide describing the problem, approaches, illustrations, and complexity analysis.
Tests
- Added unit tests covering typical repeats, overlapping patterns, long inputs, and no-repeat edge cases.
Bug Fixes
- Input validation now enforces valid DNA characters; invalid input raises an error.

coderabbitai · 2025-11-05T12:10:13Z

Walkthrough

Adds a new "Repeated DNA Sequences" sliding-window module: documentation, two implementations (naive and rolling-hash optimized), and unit tests validating various repeat scenarios and input validation.

Changes

Cohort / File(s)	Summary
Directory index `DIRECTORY.md`	Added a new entry under the Sliding Window category for Repeated Dna Sequences with a test link.
Documentation / README `algorithms/sliding_window/repeated_dna_sequences/README.md`	New README covering problem statement, constraints, naive substring-enumeration approach, optimized sliding-window rolling-hash approach (base-4 encoding A/C/G/T → 0/1/2/3), step-by-step illustration, implementation outline, and complexity analysis.
Implementation `algorithms/sliding_window/repeated_dna_sequences/__init__.py`	Added two public functions: `find_repeated_dna_sequences_naive(dna_sequence: str) -> List[str]` and `find_repeated_dna_sequences(dna_sequence: str) -> List[str]` (optimized rolling-hash). Optimized function validates input characters and raises `ValueError` for invalid DNA letters.
Tests `algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py`	New unit tests for both functions (parallel test classes) covering typical repeats, overlapping repeats, long repeats, no-repeat cases, and ordering-independent assertions.

Sequence Diagram(s)

sequenceDiagram participant Caller participant Module as RepeatedDNA_Module rect rgb(211,228,205) Note right of Module: Naive flow Caller->>Module: find_repeated_dna_sequences_naive(s) Module->>Module: iterate all 10-char windows (slice) Module->>Module: track seen set -> when repeat -> add to results Module->>Caller: return list(results) end rect rgb(224,235,255) Note right of Module: Optimized rolling-hash flow Caller->>Module: find_repeated_dna_sequences(s) Module->>Module: validate characters (A,C,G,T) or raise ValueError Module->>Module: encode first 10 chars -> compute initial hash loop slide windows Module->>Module: update rolling hash (remove leading, add trailing) Module->>Module: if hash seen before -> add substring to output set end Module->>Caller: return list(output) end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pay attention to:
- Rolling-hash update correctness and base-4 mapping.
- Collision handling: hashes vs. actual substring verification.
- Tests asserting equality independent of ordering and edge-case coverage (invalid input, short strings, overlapping repeats).

Poem

🐰 I hop through A, C, G, T in tune,
Ten-letter windows beneath the moon.
I nibble naive, I tally with hash,
Repeats uncovered in a quiet dash. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description provides a clear summary of changes and marks all required checklist items as complete, though one checkbox is problematic.	The 'Fix a bug or typo' checkbox should not be marked as it contradicts the actual changes (adding new code, not fixing bugs). Verify and correct this checkbox.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a repeated DNA sequences algorithm to the sliding window category.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/algorithms-sliding-window

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a93e069 and 3f3c9e9.

📒 Files selected for processing (1)

algorithms/sliding_window/repeated_dna_sequences/__init__.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

algorithms/sliding_window/repeated_dna_sequences/init.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)
5-40: Good test coverage for typical scenarios.

The tests cover various important cases including overlapping repeats, homogeneous sequences, and patterns. The use of sorted() for order-independent comparison in tests 1 and 5 is appropriate.

Consider adding edge case tests for:

Empty string or strings with length < 10 (should return empty list)

String with exactly 10 characters (should return empty since no repeats possible)

These would verify the early-return logic in the optimized implementation

Example:
def test_edge_case_short_sequence(self): """Test sequences shorter than 10 characters""" self.assertEqual([], find_repeated_dna_sequences("ACGT")) self.assertEqual([], find_repeated_dna_sequences("")) def test_edge_case_exact_length(self): """Test sequence of exactly 10 characters""" self.assertEqual([], find_repeated_dna_sequences("ACGTACGTAC"))
algorithms/sliding_window/repeated_dna_sequences/__init__.py (1)
52-52: Remove duplicate comment marker.
- # # Compute the initial hash using base-4 multiplication + # Compute the initial hash using base-4 multiplication

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1aff4bb and 682dcdb.

⛔ Files ignored due to path filters (18)

algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_example_one.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_example_three.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_example_two.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_eight.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_eleven.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_fifteen.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_five.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_four.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_fourteen.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_nine.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_one.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_seven.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_six.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_ten.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_thirteen.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_three.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_twelve.png is excluded by !**/*.png
algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_two.png is excluded by !**/*.png

📒 Files selected for processing (4)

DIRECTORY.md (1 hunks)
algorithms/sliding_window/repeated_dna_sequences/README.md (1 hunks)
algorithms/sliding_window/repeated_dna_sequences/__init__.py (1 hunks)
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)

algorithms/sliding_window/repeated_dna_sequences/__init__.py (1)

find_repeated_dna_sequences (27-71)

🪛 LanguageTool

algorithms/sliding_window/repeated_dna_sequences/README.md

[grammar] ~240-~240: Use a hyphen to join words.
Context: ...t containing all the repeating 10-letter long sequences. ### Time Complexity Let...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)

DIRECTORY.md

104-104: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

105-105: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)

🔇 Additional comments (3)

DIRECTORY.md (1)

104-105: LGTM - Directory entry follows repository conventions.

The addition correctly documents the new Repeated DNA Sequences module under the Sliding Window category with proper link formatting.

algorithms/sliding_window/repeated_dna_sequences/README.md (1)

1-271: Excellent documentation with comprehensive explanations.

The README provides clear explanations of both the naive and optimized approaches, detailed mathematical formulas for the rolling hash technique, step-by-step implementation guidance, and proper complexity analysis. The extensive illustrations and code examples will help readers understand the algorithm.

algorithms/sliding_window/repeated_dna_sequences/__init__.py (1)

27-71: Well-implemented rolling hash algorithm.

The optimized implementation correctly uses a base-4 rolling hash with proper:

Character encoding (A→0, C→1, G→2, T→3)

Initial hash computation for the first window

Efficient hash updates when sliding: h' = h * 4 - old_char * 4^10 + new_char

Early return for sequences ≤ 10 characters

Set-based deduplication of results

The time complexity is O(n) and space complexity is O(n), meeting the optimization goals described in the README.

Note on hash collisions: Theoretically, different 10-character sequences could produce the same hash value. Given the hash space (4^10 = 1,048,576) and problem constraints (s.length ≤ 10³), collisions are unlikely in practice. However, for production code with larger inputs, consider verifying actual substring equality when a hash collision is detected to eliminate false positives.

algorithms/sliding_window/repeated_dna_sequences/__init__.py

…fixes to naive implementation

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

algorithms/sliding_window/repeated_dna_sequences/__init__.py (2)
1-1: Remove unused Dict import.

The Dict type is imported but not used anywhere in this module.

Apply this diff:
-from typing import List, Dict +from typing import List
57-57: Fix double hash in comment.

The comment has an extra # character.

Apply this diff:
- # # Compute the initial hash using base-4 multiplication + # Compute the initial hash using base-4 multiplication
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)
1-80: Test coverage is solid but could benefit from edge cases.

The test suite effectively validates both implementations with parallel test cases covering:

Typical repeated sequences

Homogeneous sequences (all same character)

No repeats scenarios

Overlapping patterns

Complex repeating patterns

The use of sorted() for order-independent comparisons is appropriate since the functions return results from sets.

Consider adding edge case tests:
def test_empty_string(self): """Test empty string input.""" dna_sequence = "" expected = [] actual = find_repeated_dna_sequences(dna_sequence) self.assertEqual(expected, actual) def test_exact_length_10(self): """Test string of exactly length 10 (boundary case).""" dna_sequence = "AAAAAAAAAA" expected = [] actual = find_repeated_dna_sequences(dna_sequence) self.assertEqual(expected, actual) def test_length_11_with_repeat(self): """Test minimal length where repeat is possible.""" dna_sequence = "AAAAAAAAAAA" # 11 A's expected = ["AAAAAAAAAA"] actual = find_repeated_dna_sequences(dna_sequence) self.assertEqual(expected, actual)
If input validation is added to handle invalid characters, also add:
def test_invalid_characters(self): """Test that invalid DNA characters raise appropriate error.""" dna_sequence = "AAAAXAAAAA" with self.assertRaises(ValueError): find_repeated_dna_sequences(dna_sequence)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 682dcdb and a93e069.

📒 Files selected for processing (2)

algorithms/sliding_window/repeated_dna_sequences/__init__.py (1 hunks)
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)

algorithms/sliding_window/repeated_dna_sequences/__init__.py (2)

find_repeated_dna_sequences (32-76)

find_repeated_dna_sequences_naive (4-30)

🔇 Additional comments (2)

algorithms/sliding_window/repeated_dna_sequences/__init__.py (2)

4-30: LGTM! Past review feedback fully addressed.

The naive implementation now correctly:

Validates input length before processing

Uses a set for seen instead of a dictionary

Checks subsequence length to skip incomplete windows at the end

The logic for tracking repeated sequences is sound.

52-76: Rolling hash implementation looks solid.

The rolling hash algorithm is correctly implemented:

Base-4 encoding efficiently maps DNA bases to integers

Initial hash computation properly accumulates values

The precomputed a_k (4^10) value enables O(1) window updates

The sliding window correctly removes the leftmost character and adds the new rightmost character

The algorithm achieves O(n) time complexity with O(n) space for tracking seen hashes, which is optimal for this problem.

algorithms/sliding_window/repeated_dna_sequences/__init__.py

BrianLusina and others added 2 commits November 5, 2025 14:00

feat(algorithms, sliding-window): repeated dna sequences

e9f6632

updating DIRECTORY.md

682dcdb

BrianLusina self-assigned this Nov 5, 2025

BrianLusina added enhancement Algorithm Algorithm Problem Documentation Documentation Updates Sliding Window labels Nov 5, 2025

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

algorithms/sliding_window/repeated_dna_sequences/__init__.py Show resolved Hide resolved

chore(algorithms, sliding-window, find-repeated-dna-sequences: minor …

a93e069

…fixes to naive implementation

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

algorithms/sliding_window/repeated_dna_sequences/__init__.py Show resolved Hide resolved

chore: add validation for dna sequences

3f3c9e9

BrianLusina merged commit e84daec into main Nov 5, 2025
6 of 8 checks passed

BrianLusina deleted the feat/algorithms-sliding-window branch November 5, 2025 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(algorithms, sliding window): repeated dna sequences #97

feat(algorithms, sliding window): repeated dna sequences #97

Uh oh!

BrianLusina commented Nov 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 5, 2025 •

edited

Loading

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Labels

2 participants

feat(algorithms, sliding window): repeated dna sequences #97

feat(algorithms, sliding window): repeated dna sequences #97

Uh oh!

Conversation

BrianLusina commented Nov 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your change:

Checklist:

Summary by CodeRabbit

coderabbitai bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

2 participants

BrianLusina commented Nov 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 5, 2025 •

edited

Loading