Feature: Add semantic duplicate issue detection for GitHub issues #551
Feature: Add semantic duplicate issue detection for GitHub issues #551vaibhavbaghel wants to merge 4 commits intoAOSSIE-Org:mainfrom
Conversation
| No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds an automated duplicate-issue detection pipeline: a new Python checker script, a pinned requirements file, and a GitHub Actions workflow that runs the checker on issue opened/edited events. Also includes a small UI padding tweak in a React component. Changes
Sequence DiagramsequenceDiagram actor GitHub participant Workflow as rgba(0,128,255,0.5) GitHub Actions participant Script as rgba(0,200,100,0.5) check-duplicates.py participant GitHubAPI as rgba(255,165,0,0.5) GitHub API participant Model as rgba(128,0,128,0.5) SentenceTransformer GitHub->>Workflow: issue opened/edited event Workflow->>Workflow: checkout, setup Python, install deps Workflow->>Script: run with ISSUE_NUMBER, TITLE, BODY, REPO, GITHUB_TOKEN Script->>GitHubAPI: GET /issues (paginated, exclude closed optional, max 500) GitHubAPI-->>Script: issues list Script->>Model: prepare texts (title + 500 chars), compute embeddings Model-->>Script: embeddings Script->>Script: compute cosine similarities, filter & sort by threshold Script->>Script: format markdown comment Script->>GitHubAPI: POST comment to current issue (if matches) GitHubAPI-->>Script: comment response Script-->>Workflow: exit with status Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
.github/workflows/duplicate-detection.yml (1)
24-36: Cache the sentence-transformer model between runs.The workflow caches pip dependencies but not the Hugging Face model that
check-duplicates.pyloads at line 103 (SentenceTransformer("all-MiniLM-L6-v2")). On cold runner starts, this model will be redownloaded for every issue event, adding unnecessary latency and flakiness to triage. Add a caching step for the Hugging Face model directory usingactions/cacheandHF_HOMEenvironment variable to persist the model across workflow runs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/duplicate-detection.yml around lines 24 - 36, The workflow currently installs pip deps but doesn't persist the Hugging Face model used in check-duplicates.py (SentenceTransformer("all-MiniLM-L6-v2")), so add an actions/cache step that caches the model directory and set HF_HOME to that cache path in the job environment; specifically, define HF_HOME (e.g., an env like HF_HOME: ${{ runner.temp }}/hf or similar), add an actions/cache entry that keys and restores the HF_HOME directory (or the Hugging Face cache path) before running pip and the python script, and ensure check-duplicates.py continues to load the model from SentenceTransformer using the HF_HOME-backed cache..github/scripts/requirements.txt (1)
1-5: Use pinned dependency versions for the duplicate-detection workflow.The workflow installs
.github/scripts/requirements.txtwith floating>=specs on every issue event, making duplicate detection behavior non-reproducible across runs and over time. The rootrequirements.txtalready demonstrates pinned versions are maintained for this project (e.g.,torch==2.5.1,scikit-learn==1.5.2). Update.github/scripts/requirements.txtto use pinned versions (==) from the root file or generate a proper lock file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/scripts/requirements.txt around lines 1 - 5, Update the .github/scripts/requirements.txt to use pinned versions instead of floating >= specs so duplicate-detection runs are reproducible: replace each dependency (sentence-transformers, scikit-learn, requests, numpy, torch) with exact versions (==) matching the project's root requirements.txt or freeze them into a lock file and reference that; ensure the file uses the same pinned versions as root (e.g., the root's torch and scikit-learn pins) or generate a pinned requirements.lock and update the workflow to install that lock file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed. Inline comments: In @.github/scripts/check-duplicates.py: - Around line 141-145: The try/except around calculate_similarity is swallowing errors and converting failures into an empty duplicate list; instead propagate the failure so the job fails or returns an explicit error state—remove the blanket except or re-raise the caught exception (or return a distinct error value) from where calculate_similarity is called, and update main() to treat that propagated exception/value as a fatal error (exit non‑zero or surface an error message) rather than treating it as “no duplicates found.” Ensure you reference calculate_similarity and main() when making the change so callers handle real failures instead of returning []. - Around line 189-205: The post_comment function currently always posts a new comment; change it to upsert a single detector comment by first listing existing comments for the issue (GET to f"{GITHUB_API_URL}/repos/{REPO}/issues/{ISSUE_NUMBER}/comments" with HEADERS), search for a prior bot comment using a unique marker string you add to the comment body (e.g., "<!-- duplicate-detector -->") or by matching the bot's user, and if found use PATCH to update that comment (PATCH to f"{GITHUB_API_URL}/repos/{REPO}/issues/comments/{comment_id}") otherwise POST as before; keep using HEADERS, timeout and response.raise_for_status() and return True/False as currently implemented in post_comment to preserve behavior. - Around line 21-27: The dynamic runtime pip-install fallback around the SentenceTransformer import must be removed so missing dependencies fail fast; delete the try/except block that catches ImportError and runs subprocess.check_call, and replace it with a single top-level import from sentence_transformers import SentenceTransformer so an ImportError surfaces immediately (this touches the import of SentenceTransformer in the script). In @.github/workflows/duplicate-detection.yml: - Around line 18-22: Update the GitHub Actions step that currently uses actions/setup-python@v4 to a supported major version (actions/setup-python@v5 or `@v6`) to avoid Node16 runner removal; specifically modify the step referencing "actions/setup-python@v4" to use at least "actions/setup-python@v5" (or "@v6"), keep the existing inputs (python-version: '3.11' and cache: 'pip') unchanged, then run the workflow to verify the runner no longer errors due to the deprecated Node runtime. --- Nitpick comments: In @.github/scripts/requirements.txt: - Around line 1-5: Update the .github/scripts/requirements.txt to use pinned versions instead of floating >= specs so duplicate-detection runs are reproducible: replace each dependency (sentence-transformers, scikit-learn, requests, numpy, torch) with exact versions (==) matching the project's root requirements.txt or freeze them into a lock file and reference that; ensure the file uses the same pinned versions as root (e.g., the root's torch and scikit-learn pins) or generate a pinned requirements.lock and update the workflow to install that lock file. In @.github/workflows/duplicate-detection.yml: - Around line 24-36: The workflow currently installs pip deps but doesn't persist the Hugging Face model used in check-duplicates.py (SentenceTransformer("all-MiniLM-L6-v2")), so add an actions/cache step that caches the model directory and set HF_HOME to that cache path in the job environment; specifically, define HF_HOME (e.g., an env like HF_HOME: ${{ runner.temp }}/hf or similar), add an actions/cache entry that keys and restores the HF_HOME directory (or the Hugging Face cache path) before running pip and the python script, and ensure check-duplicates.py continues to load the model from SentenceTransformer using the HF_HOME-backed cache. ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 376fea32-ba12-447c-bb31-636e7976cfd8
📒 Files selected for processing (4)
.github/scripts/check-duplicates.py.github/scripts/requirements.txt.github/workflows/duplicate-detection.ymleduaid_web/src/pages/Text_Input.jsx
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed. Inline comments: In @.github/scripts/check-duplicates.py: - Around line 62-80: The current pagination loop in check-duplicates.py swallows RequestException and breaks, returning a partial or empty issues list; change the except requests.exceptions.RequestException block to propagate the failure instead of breaking (e.g., re-raise the caught exception or call sys.exit(1)) so that callers of the function (the workflow) fail fast on API/auth/network errors; update the handler around the requests.get/response.raise_for_status section (the try/except surrounding page_issues, issues.extend and pagination logic) to surface the original exception and include the error context in the raised error/logging. - Around line 240-242: The current main() ignores post_comment()'s return value so comment-post failures don't fail the job; update the block that calls post_comment(comment) to check its boolean result and exit non‑zero on failure (e.g., if not post_comment(comment): log an error and sys.exit(1) or raise SystemExit) so permission/API failures cause the script to fail; reference the post_comment function and the main() flow where the comment is printed and posted. ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3eb61906-b169-4909-b104-8d3abe5eb2c7
📒 Files selected for processing (1)
.github/scripts/check-duplicates.py
Summary
This PR introduces a semantic duplicate issue detection tool for GitHub issues.
The script analyzes newly created issues and compares them against existing issues using sentence embeddings to identify potential duplicates.
Related Issue: #396
Key Features
sentence-transformersto generate semantic embeddingsConfiguration
The behavior can be configured using environment variables:
Workflow
Benefits
Future Improvements
Summary by CodeRabbit
New Features
Chores
Style