Skip to content

Feature: Add semantic duplicate issue detection for GitHub issues #551

Open
vaibhavbaghel wants to merge 4 commits intoAOSSIE-Org:mainfrom
vaibhavbaghel:backend
Open

Feature: Add semantic duplicate issue detection for GitHub issues #551
vaibhavbaghel wants to merge 4 commits intoAOSSIE-Org:mainfrom
vaibhavbaghel:backend

Conversation

@vaibhavbaghel
Copy link

@vaibhavbaghel vaibhavbaghel commented Mar 10, 2026

Summary

This PR introduces a semantic duplicate issue detection tool for GitHub issues.
The script analyzes newly created issues and compares them against existing issues using sentence embeddings to identify potential duplicates.

Related Issue: #396

Key Features

  • Uses sentence-transformers to generate semantic embeddings
  • Calculates cosine similarity between issues
  • Flags potential duplicates above a configurable similarity threshold
  • Posts a comment suggesting related issues for maintainers to review
  • Does not automatically close issues (maintainer decision required)

Configuration

The behavior can be configured using environment variables:

Variable Description Default
SIMILARITY_THRESHOLD Similarity score threshold for duplicate detection 0.75
LIMIT_RESULTS Maximum number of suggested issues 5
EXCLUDE_CLOSED Whether to exclude closed issues false

Workflow

  1. Fetch existing issues from the repository
  2. Generate embeddings for the current issue
  3. Compute semantic similarity with existing issues
  4. Identify potential duplicates
  5. Post a comment suggesting related issues

Benefits

  • Helps reduce duplicate issues
  • Improves maintainers' triage efficiency
  • Maintains a cleaner issue tracker
  • Provides suggestions without automatic enforcement

Future Improvements

  • Embedding caching for faster comparisons
  • Support for issue labels filtering
  • Incremental similarity search

Summary by CodeRabbit

  • New Features

    • Automated duplicate-issue detection: analyzes new/edited issues, ranks similar existing issues by semantic similarity, and posts a formatted comment listing matches with status indicators and links.
  • Chores

    • Added a background workflow to run detection on issue events and a pinned dependency manifest to support the automation.
  • Style

    • Improved visual spacing on quantity adjustment buttons.
@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 456da1a5-afd6-4cde-b930-91e45bb7c773

📥 Commits

Reviewing files that changed from the base of the PR and between a2704d1 and c24b12a.

📒 Files selected for processing (1)
  • .github/workflows/duplicate-detection.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/duplicate-detection.yml

📝 Walkthrough

Walkthrough

This PR adds an automated duplicate-issue detection pipeline: a new Python checker script, a pinned requirements file, and a GitHub Actions workflow that runs the checker on issue opened/edited events. Also includes a small UI padding tweak in a React component.

Changes

Cohort / File(s) Summary
Duplicate detection script
.​github/scripts/check-duplicates.py
New script that fetches up to 500 repo issues (paginated, optional exclude-closed), prepares title+first-500-chars of body, computes embeddings with SentenceTransformer, calculates cosine similarities, ranks/filter results above a configurable threshold, formats a markdown comment listing similar issues, and posts it to the current issue via the GitHub API.
Dependencies
.​github/scripts/requirements.txt
New requirements file pinning: sentence-transformers>=2.2.2, scikit-learn>=1.3.0, requests>=2.31.0, numpy>=1.24.0, torch>=2.0.0.
CI workflow
.​github/workflows/duplicate-detection.yml
New GitHub Actions workflow triggered on issue opened/edited; checks out repo, sets up Python 3.11, installs deps from the new requirements file, and runs the duplicate-check script with ISSUE_NUMBER, ISSUE_TITLE, ISSUE_BODY, REPO and GITHUB_TOKEN provided via env.
UI styling
eduaid_web/src/pages/Text_Input.jsx
Minor visual change: adds pb-1 bottom padding to the +/- quantity buttons (styling only).

Sequence Diagram

sequenceDiagram actor GitHub participant Workflow as rgba(0,128,255,0.5) GitHub Actions participant Script as rgba(0,200,100,0.5) check-duplicates.py participant GitHubAPI as rgba(255,165,0,0.5) GitHub API participant Model as rgba(128,0,128,0.5) SentenceTransformer GitHub->>Workflow: issue opened/edited event Workflow->>Workflow: checkout, setup Python, install deps Workflow->>Script: run with ISSUE_NUMBER, TITLE, BODY, REPO, GITHUB_TOKEN Script->>GitHubAPI: GET /issues (paginated, exclude closed optional, max 500) GitHubAPI-->>Script: issues list Script->>Model: prepare texts (title + 500 chars), compute embeddings Model-->>Script: embeddings Script->>Script: compute cosine similarities, filter & sort by threshold Script->>Script: format markdown comment Script->>GitHubAPI: POST comment to current issue (if matches) GitHubAPI-->>Script: comment response Script-->>Workflow: exit with status 
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Poem

🐰 I hopped through titles, sniffed the thread,
Found echoes where new and old were led,
Embeddings whispered "these are near",
I left a note so folks can steer,
Now issue burrows are tidy and fed.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main feature being added: semantic duplicate issue detection for GitHub issues. It aligns with the primary objective of the PR and is specific enough for teammates to understand the change.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vaibhavbaghel vaibhavbaghel changed the title Feature: AI-powered duplicate issue detection using sentence embeddings Feature: Add semantic duplicate issue detection for GitHub issues Mar 10, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
.github/workflows/duplicate-detection.yml (1)

24-36: Cache the sentence-transformer model between runs.

The workflow caches pip dependencies but not the Hugging Face model that check-duplicates.py loads at line 103 (SentenceTransformer("all-MiniLM-L6-v2")). On cold runner starts, this model will be redownloaded for every issue event, adding unnecessary latency and flakiness to triage. Add a caching step for the Hugging Face model directory using actions/cache and HF_HOME environment variable to persist the model across workflow runs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/duplicate-detection.yml around lines 24 - 36, The workflow currently installs pip deps but doesn't persist the Hugging Face model used in check-duplicates.py (SentenceTransformer("all-MiniLM-L6-v2")), so add an actions/cache step that caches the model directory and set HF_HOME to that cache path in the job environment; specifically, define HF_HOME (e.g., an env like HF_HOME: ${{ runner.temp }}/hf or similar), add an actions/cache entry that keys and restores the HF_HOME directory (or the Hugging Face cache path) before running pip and the python script, and ensure check-duplicates.py continues to load the model from SentenceTransformer using the HF_HOME-backed cache. 
.github/scripts/requirements.txt (1)

1-5: Use pinned dependency versions for the duplicate-detection workflow.

The workflow installs .github/scripts/requirements.txt with floating >= specs on every issue event, making duplicate detection behavior non-reproducible across runs and over time. The root requirements.txt already demonstrates pinned versions are maintained for this project (e.g., torch==2.5.1, scikit-learn==1.5.2). Update .github/scripts/requirements.txt to use pinned versions (==) from the root file or generate a proper lock file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/scripts/requirements.txt around lines 1 - 5, Update the .github/scripts/requirements.txt to use pinned versions instead of floating >= specs so duplicate-detection runs are reproducible: replace each dependency (sentence-transformers, scikit-learn, requests, numpy, torch) with exact versions (==) matching the project's root requirements.txt or freeze them into a lock file and reference that; ensure the file uses the same pinned versions as root (e.g., the root's torch and scikit-learn pins) or generate a pinned requirements.lock and update the workflow to install that lock file. 
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed. Inline comments: In @.github/scripts/check-duplicates.py: - Around line 141-145: The try/except around calculate_similarity is swallowing errors and converting failures into an empty duplicate list; instead propagate the failure so the job fails or returns an explicit error state—remove the blanket except or re-raise the caught exception (or return a distinct error value) from where calculate_similarity is called, and update main() to treat that propagated exception/value as a fatal error (exit non‑zero or surface an error message) rather than treating it as “no duplicates found.” Ensure you reference calculate_similarity and main() when making the change so callers handle real failures instead of returning []. - Around line 189-205: The post_comment function currently always posts a new comment; change it to upsert a single detector comment by first listing existing comments for the issue (GET to f"{GITHUB_API_URL}/repos/{REPO}/issues/{ISSUE_NUMBER}/comments" with HEADERS), search for a prior bot comment using a unique marker string you add to the comment body (e.g., "<!-- duplicate-detector -->") or by matching the bot's user, and if found use PATCH to update that comment (PATCH to f"{GITHUB_API_URL}/repos/{REPO}/issues/comments/{comment_id}") otherwise POST as before; keep using HEADERS, timeout and response.raise_for_status() and return True/False as currently implemented in post_comment to preserve behavior. - Around line 21-27: The dynamic runtime pip-install fallback around the SentenceTransformer import must be removed so missing dependencies fail fast; delete the try/except block that catches ImportError and runs subprocess.check_call, and replace it with a single top-level import from sentence_transformers import SentenceTransformer so an ImportError surfaces immediately (this touches the import of SentenceTransformer in the script). In @.github/workflows/duplicate-detection.yml: - Around line 18-22: Update the GitHub Actions step that currently uses actions/setup-python@v4 to a supported major version (actions/setup-python@v5 or `@v6`) to avoid Node16 runner removal; specifically modify the step referencing "actions/setup-python@v4" to use at least "actions/setup-python@v5" (or "@v6"), keep the existing inputs (python-version: '3.11' and cache: 'pip') unchanged, then run the workflow to verify the runner no longer errors due to the deprecated Node runtime. --- Nitpick comments: In @.github/scripts/requirements.txt: - Around line 1-5: Update the .github/scripts/requirements.txt to use pinned versions instead of floating >= specs so duplicate-detection runs are reproducible: replace each dependency (sentence-transformers, scikit-learn, requests, numpy, torch) with exact versions (==) matching the project's root requirements.txt or freeze them into a lock file and reference that; ensure the file uses the same pinned versions as root (e.g., the root's torch and scikit-learn pins) or generate a pinned requirements.lock and update the workflow to install that lock file. In @.github/workflows/duplicate-detection.yml: - Around line 24-36: The workflow currently installs pip deps but doesn't persist the Hugging Face model used in check-duplicates.py (SentenceTransformer("all-MiniLM-L6-v2")), so add an actions/cache step that caches the model directory and set HF_HOME to that cache path in the job environment; specifically, define HF_HOME (e.g., an env like HF_HOME: ${{ runner.temp }}/hf or similar), add an actions/cache entry that keys and restores the HF_HOME directory (or the Hugging Face cache path) before running pip and the python script, and ensure check-duplicates.py continues to load the model from SentenceTransformer using the HF_HOME-backed cache. 

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 376fea32-ba12-447c-bb31-636e7976cfd8

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and ad77624.

📒 Files selected for processing (4)
  • .github/scripts/check-duplicates.py
  • .github/scripts/requirements.txt
  • .github/workflows/duplicate-detection.yml
  • eduaid_web/src/pages/Text_Input.jsx
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed. Inline comments: In @.github/scripts/check-duplicates.py: - Around line 62-80: The current pagination loop in check-duplicates.py swallows RequestException and breaks, returning a partial or empty issues list; change the except requests.exceptions.RequestException block to propagate the failure instead of breaking (e.g., re-raise the caught exception or call sys.exit(1)) so that callers of the function (the workflow) fail fast on API/auth/network errors; update the handler around the requests.get/response.raise_for_status section (the try/except surrounding page_issues, issues.extend and pagination logic) to surface the original exception and include the error context in the raised error/logging. - Around line 240-242: The current main() ignores post_comment()'s return value so comment-post failures don't fail the job; update the block that calls post_comment(comment) to check its boolean result and exit non‑zero on failure (e.g., if not post_comment(comment): log an error and sys.exit(1) or raise SystemExit) so permission/API failures cause the script to fail; reference the post_comment function and the main() flow where the comment is printed and posted. 

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3eb61906-b169-4909-b104-8d3abe5eb2c7

📥 Commits

Reviewing files that changed from the base of the PR and between ad77624 and a2704d1.

📒 Files selected for processing (1)
  • .github/scripts/check-duplicates.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant