Skip to content

Conversation

@BrianLusina
Copy link
Owner

@BrianLusina BrianLusina commented Nov 5, 2025

Describe your change:

Adds and solves repeated DNA sequences uses both a naive approach and sliding window algorithm

  • Add an algorithm?
  • Fix a bug or typo in an existing algorithm?
  • Documentation change?

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms have a URL in its comments that points to Wikipedia or other similar explanation.

Summary by CodeRabbit

  • New Features

    • Added algorithms to detect repeated 10-letter DNA sequences using both naive and optimized approaches.
  • Documentation

    • Added a comprehensive guide describing the problem, approaches, illustrations, and complexity analysis.
  • Tests

    • Added unit tests covering typical repeats, overlapping patterns, long inputs, and no-repeat edge cases.
  • Bug Fixes

    • Input validation now enforces valid DNA characters; invalid input raises an error.
@BrianLusina BrianLusina self-assigned this Nov 5, 2025
@BrianLusina BrianLusina added enhancement Algorithm Algorithm Problem Documentation Documentation Updates Sliding Window labels Nov 5, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

Walkthrough

Adds a new "Repeated DNA Sequences" sliding-window module: documentation, two implementations (naive and rolling-hash optimized), and unit tests validating various repeat scenarios and input validation.

Changes

Cohort / File(s) Summary
Directory index
DIRECTORY.md
Added a new entry under the Sliding Window category for Repeated Dna Sequences with a test link.
Documentation / README
algorithms/sliding_window/repeated_dna_sequences/README.md
New README covering problem statement, constraints, naive substring-enumeration approach, optimized sliding-window rolling-hash approach (base-4 encoding A/C/G/T → 0/1/2/3), step-by-step illustration, implementation outline, and complexity analysis.
Implementation
algorithms/sliding_window/repeated_dna_sequences/__init__.py
Added two public functions: find_repeated_dna_sequences_naive(dna_sequence: str) -> List[str] and find_repeated_dna_sequences(dna_sequence: str) -> List[str] (optimized rolling-hash). Optimized function validates input characters and raises ValueError for invalid DNA letters.
Tests
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py
New unit tests for both functions (parallel test classes) covering typical repeats, overlapping repeats, long repeats, no-repeat cases, and ordering-independent assertions.

Sequence Diagram(s)

sequenceDiagram participant Caller participant Module as RepeatedDNA_Module rect rgb(211,228,205) Note right of Module: Naive flow Caller->>Module: find_repeated_dna_sequences_naive(s) Module->>Module: iterate all 10-char windows (slice) Module->>Module: track seen set -> when repeat -> add to results Module->>Caller: return list(results) end rect rgb(224,235,255) Note right of Module: Optimized rolling-hash flow Caller->>Module: find_repeated_dna_sequences(s) Module->>Module: validate characters (A,C,G,T) or raise ValueError Module->>Module: encode first 10 chars -> compute initial hash loop slide windows Module->>Module: update rolling hash (remove leading, add trailing) Module->>Module: if hash seen before -> add substring to output set end Module->>Caller: return list(output) end 
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Pay attention to:
    • Rolling-hash update correctness and base-4 mapping.
    • Collision handling: hashes vs. actual substring verification.
    • Tests asserting equality independent of ordering and edge-case coverage (invalid input, short strings, overlapping repeats).

Poem

🐰 I hop through A, C, G, T in tune,
Ten-letter windows beneath the moon.
I nibble naive, I tally with hash,
Repeats uncovered in a quiet dash. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides a clear summary of changes and marks all required checklist items as complete, though one checkbox is problematic. The 'Fix a bug or typo' checkbox should not be marked as it contradicts the actual changes (adding new code, not fixing bugs). Verify and correct this checkbox.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a repeated DNA sequences algorithm to the sliding window category.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/algorithms-sliding-window

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a93e069 and 3f3c9e9.

📒 Files selected for processing (1)
  • algorithms/sliding_window/repeated_dna_sequences/__init__.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • algorithms/sliding_window/repeated_dna_sequences/init.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)

5-40: Good test coverage for typical scenarios.

The tests cover various important cases including overlapping repeats, homogeneous sequences, and patterns. The use of sorted() for order-independent comparison in tests 1 and 5 is appropriate.

Consider adding edge case tests for:

  • Empty string or strings with length < 10 (should return empty list)
  • String with exactly 10 characters (should return empty since no repeats possible)
  • These would verify the early-return logic in the optimized implementation

Example:

def test_edge_case_short_sequence(self): """Test sequences shorter than 10 characters""" self.assertEqual([], find_repeated_dna_sequences("ACGT")) self.assertEqual([], find_repeated_dna_sequences("")) def test_edge_case_exact_length(self): """Test sequence of exactly 10 characters""" self.assertEqual([], find_repeated_dna_sequences("ACGTACGTAC"))
algorithms/sliding_window/repeated_dna_sequences/__init__.py (1)

52-52: Remove duplicate comment marker.

- # # Compute the initial hash using base-4 multiplication + # Compute the initial hash using base-4 multiplication
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1aff4bb and 682dcdb.

⛔ Files ignored due to path filters (18)
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_example_one.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_example_three.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_example_two.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_eight.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_eleven.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_fifteen.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_five.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_four.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_fourteen.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_nine.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_one.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_seven.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_six.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_ten.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_thirteen.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_three.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_twelve.png is excluded by !**/*.png
  • algorithms/sliding_window/repeated_dna_sequences/images/repeated_dna_sequences_illustration_two.png is excluded by !**/*.png
📒 Files selected for processing (4)
  • DIRECTORY.md (1 hunks)
  • algorithms/sliding_window/repeated_dna_sequences/README.md (1 hunks)
  • algorithms/sliding_window/repeated_dna_sequences/__init__.py (1 hunks)
  • algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)
algorithms/sliding_window/repeated_dna_sequences/__init__.py (1)
  • find_repeated_dna_sequences (27-71)
🪛 LanguageTool
algorithms/sliding_window/repeated_dna_sequences/README.md

[grammar] ~240-~240: Use a hyphen to join words.
Context: ...t containing all the repeating 10-letter long sequences. ### Time Complexity Let...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)
DIRECTORY.md

104-104: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


105-105: Unordered list indentation
Expected: 4; Actual: 6

(MD007, ul-indent)

🔇 Additional comments (3)
DIRECTORY.md (1)

104-105: LGTM - Directory entry follows repository conventions.

The addition correctly documents the new Repeated DNA Sequences module under the Sliding Window category with proper link formatting.

algorithms/sliding_window/repeated_dna_sequences/README.md (1)

1-271: Excellent documentation with comprehensive explanations.

The README provides clear explanations of both the naive and optimized approaches, detailed mathematical formulas for the rolling hash technique, step-by-step implementation guidance, and proper complexity analysis. The extensive illustrations and code examples will help readers understand the algorithm.

algorithms/sliding_window/repeated_dna_sequences/__init__.py (1)

27-71: Well-implemented rolling hash algorithm.

The optimized implementation correctly uses a base-4 rolling hash with proper:

  • Character encoding (A→0, C→1, G→2, T→3)
  • Initial hash computation for the first window
  • Efficient hash updates when sliding: h' = h * 4 - old_char * 4^10 + new_char
  • Early return for sequences ≤ 10 characters
  • Set-based deduplication of results

The time complexity is O(n) and space complexity is O(n), meeting the optimization goals described in the README.

Note on hash collisions: Theoretically, different 10-character sequences could produce the same hash value. Given the hash space (4^10 = 1,048,576) and problem constraints (s.length ≤ 10³), collisions are unlikely in practice. However, for production code with larger inputs, consider verifying actual substring equality when a hash collision is detected to eliminate false positives.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
algorithms/sliding_window/repeated_dna_sequences/__init__.py (2)

1-1: Remove unused Dict import.

The Dict type is imported but not used anywhere in this module.

Apply this diff:

-from typing import List, Dict +from typing import List

57-57: Fix double hash in comment.

The comment has an extra # character.

Apply this diff:

- # # Compute the initial hash using base-4 multiplication + # Compute the initial hash using base-4 multiplication
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)

1-80: Test coverage is solid but could benefit from edge cases.

The test suite effectively validates both implementations with parallel test cases covering:

  • Typical repeated sequences
  • Homogeneous sequences (all same character)
  • No repeats scenarios
  • Overlapping patterns
  • Complex repeating patterns

The use of sorted() for order-independent comparisons is appropriate since the functions return results from sets.

Consider adding edge case tests:

def test_empty_string(self): """Test empty string input.""" dna_sequence = "" expected = [] actual = find_repeated_dna_sequences(dna_sequence) self.assertEqual(expected, actual) def test_exact_length_10(self): """Test string of exactly length 10 (boundary case).""" dna_sequence = "AAAAAAAAAA" expected = [] actual = find_repeated_dna_sequences(dna_sequence) self.assertEqual(expected, actual) def test_length_11_with_repeat(self): """Test minimal length where repeat is possible.""" dna_sequence = "AAAAAAAAAAA" # 11 A's expected = ["AAAAAAAAAA"] actual = find_repeated_dna_sequences(dna_sequence) self.assertEqual(expected, actual)

If input validation is added to handle invalid characters, also add:

def test_invalid_characters(self): """Test that invalid DNA characters raise appropriate error.""" dna_sequence = "AAAAXAAAAA" with self.assertRaises(ValueError): find_repeated_dna_sequences(dna_sequence)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 682dcdb and a93e069.

📒 Files selected for processing (2)
  • algorithms/sliding_window/repeated_dna_sequences/__init__.py (1 hunks)
  • algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py (1)
algorithms/sliding_window/repeated_dna_sequences/__init__.py (2)
  • find_repeated_dna_sequences (32-76)
  • find_repeated_dna_sequences_naive (4-30)
🔇 Additional comments (2)
algorithms/sliding_window/repeated_dna_sequences/__init__.py (2)

4-30: LGTM! Past review feedback fully addressed.

The naive implementation now correctly:

  • Validates input length before processing
  • Uses a set for seen instead of a dictionary
  • Checks subsequence length to skip incomplete windows at the end

The logic for tracking repeated sequences is sound.


52-76: Rolling hash implementation looks solid.

The rolling hash algorithm is correctly implemented:

  • Base-4 encoding efficiently maps DNA bases to integers
  • Initial hash computation properly accumulates values
  • The precomputed a_k (4^10) value enables O(1) window updates
  • The sliding window correctly removes the leftmost character and adds the new rightmost character

The algorithm achieves O(n) time complexity with O(n) space for tracking seen hashes, which is optimal for this problem.

@BrianLusina BrianLusina merged commit e84daec into main Nov 5, 2025
6 of 8 checks passed
@BrianLusina BrianLusina deleted the feat/algorithms-sliding-window branch November 5, 2025 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algorithm Algorithm Problem Documentation Documentation Updates enhancement Sliding Window

2 participants