Feature: Add semantic duplicate issue detection for GitHub issues by vaibhavbaghel · Pull Request #551 · AOSSIE-Org/EduAid

vaibhavbaghel · 2026-03-10T13:11:14Z

Summary

This PR introduces a semantic duplicate issue detection tool for GitHub issues.
The script analyzes newly created issues and compares them against existing issues using sentence embeddings to identify potential duplicates.

Related Issue: #396

Key Features

Uses sentence-transformers to generate semantic embeddings
Calculates cosine similarity between issues
Flags potential duplicates above a configurable similarity threshold
Posts a comment suggesting related issues for maintainers to review
Does not automatically close issues (maintainer decision required)

Configuration

The behavior can be configured using environment variables:

Variable	Description	Default
SIMILARITY_THRESHOLD	Similarity score threshold for duplicate detection	0.75
LIMIT_RESULTS	Maximum number of suggested issues	5
EXCLUDE_CLOSED	Whether to exclude closed issues	false

Workflow

Fetch existing issues from the repository
Generate embeddings for the current issue
Compute semantic similarity with existing issues
Identify potential duplicates
Post a comment suggesting related issues

Benefits

Helps reduce duplicate issues
Improves maintainers' triage efficiency
Maintains a cleaner issue tracker
Provides suggestions without automatic enforcement

Future Improvements

Embedding caching for faster comparisons
Support for issue labels filtering
Incremental similarity search

Summary by CodeRabbit

New Features
- Automated duplicate-issue detection: analyzes new/edited issues, ranks similar existing issues by semantic similarity, and posts a formatted comment listing matches with status indicators and links.
Chores
- Added a background workflow to run detection on issue events and a pinned dependency manifest to support the automation.
Style
- Improved visual spacing on quantity adjustment buttons.

coderabbitai · 2026-03-10T13:11:32Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 456da1a5-afd6-4cde-b930-91e45bb7c773

📥 Commits

Reviewing files that changed from the base of the PR and between a2704d1 and c24b12a.

📒 Files selected for processing (1)

.github/workflows/duplicate-detection.yml

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/duplicate-detection.yml

📝 Walkthrough

Walkthrough

This PR adds an automated duplicate-issue detection pipeline: a new Python checker script, a pinned requirements file, and a GitHub Actions workflow that runs the checker on issue opened/edited events. Also includes a small UI padding tweak in a React component.

Changes

Cohort / File(s)	Summary
Duplicate detection script `.github/scripts/check-duplicates.py`	New script that fetches up to 500 repo issues (paginated, optional exclude-closed), prepares title+first-500-chars of body, computes embeddings with SentenceTransformer, calculates cosine similarities, ranks/filter results above a configurable threshold, formats a markdown comment listing similar issues, and posts it to the current issue via the GitHub API.
Dependencies `.github/scripts/requirements.txt`	New requirements file pinning: `sentence-transformers>=2.2.2`, `scikit-learn>=1.3.0`, `requests>=2.31.0`, `numpy>=1.24.0`, `torch>=2.0.0`.
CI workflow `.github/workflows/duplicate-detection.yml`	New GitHub Actions workflow triggered on issue opened/edited; checks out repo, sets up Python 3.11, installs deps from the new requirements file, and runs the duplicate-check script with ISSUE_NUMBER, ISSUE_TITLE, ISSUE_BODY, REPO and GITHUB_TOKEN provided via env.
UI styling `eduaid_web/src/pages/Text_Input.jsx`	Minor visual change: adds `pb-1` bottom padding to the +/- quantity buttons (styling only).

Sequence Diagram

sequenceDiagram actor GitHub participant Workflow as rgba(0,128,255,0.5) GitHub Actions participant Script as rgba(0,200,100,0.5) check-duplicates.py participant GitHubAPI as rgba(255,165,0,0.5) GitHub API participant Model as rgba(128,0,128,0.5) SentenceTransformer GitHub->>Workflow: issue opened/edited event Workflow->>Workflow: checkout, setup Python, install deps Workflow->>Script: run with ISSUE_NUMBER, TITLE, BODY, REPO, GITHUB_TOKEN Script->>GitHubAPI: GET /issues (paginated, exclude closed optional, max 500) GitHubAPI-->>Script: issues list Script->>Model: prepare texts (title + 500 chars), compute embeddings Model-->>Script: embeddings Script->>Script: compute cosine similarities, filter & sort by threshold Script->>Script: format markdown comment Script->>GitHubAPI: POST comment to current issue (if matches) GitHubAPI-->>Script: comment response Script-->>Workflow: exit with status

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Poem

🐰 I hopped through titles, sniffed the thread,
Found echoes where new and old were led,
Embeddings whispered "these are near",
I left a note so folks can steer,
Now issue burrows are tidy and fed.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main feature being added: semantic duplicate issue detection for GitHub issues. It aligns with the primary objective of the PR and is specific enough for teammates to understand the change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

.github/workflows/duplicate-detection.yml (1)
24-36: Cache the sentence-transformer model between runs.

The workflow caches pip dependencies but not the Hugging Face model that check-duplicates.py loads at line 103 (SentenceTransformer("all-MiniLM-L6-v2")). On cold runner starts, this model will be redownloaded for every issue event, adding unnecessary latency and flakiness to triage. Add a caching step for the Hugging Face model directory using actions/cache and HF_HOME environment variable to persist the model across workflow runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/duplicate-detection.yml around lines 24 - 36, The workflow currently installs pip deps but doesn't persist the Hugging Face model used in check-duplicates.py (SentenceTransformer("all-MiniLM-L6-v2")), so add an actions/cache step that caches the model directory and set HF_HOME to that cache path in the job environment; specifically, define HF_HOME (e.g., an env like HF_HOME: ${{ runner.temp }}/hf or similar), add an actions/cache entry that keys and restores the HF_HOME directory (or the Hugging Face cache path) before running pip and the python script, and ensure check-duplicates.py continues to load the model from SentenceTransformer using the HF_HOME-backed cache. 
.github/scripts/requirements.txt (1)
1-5: Use pinned dependency versions for the duplicate-detection workflow.

The workflow installs .github/scripts/requirements.txt with floating >= specs on every issue event, making duplicate detection behavior non-reproducible across runs and over time. The root requirements.txt already demonstrates pinned versions are maintained for this project (e.g., torch==2.5.1, scikit-learn==1.5.2). Update .github/scripts/requirements.txt to use pinned versions (==) from the root file or generate a proper lock file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/scripts/requirements.txt around lines 1 - 5, Update the .github/scripts/requirements.txt to use pinned versions instead of floating >= specs so duplicate-detection runs are reproducible: replace each dependency (sentence-transformers, scikit-learn, requests, numpy, torch) with exact versions (==) matching the project's root requirements.txt or freeze them into a lock file and reference that; ensure the file uses the same pinned versions as root (e.g., the root's torch and scikit-learn pins) or generate a pinned requirements.lock and update the workflow to install that lock file. 

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed. Inline comments: In @.github/scripts/check-duplicates.py: - Around line 141-145: The try/except around calculate_similarity is swallowing errors and converting failures into an empty duplicate list; instead propagate the failure so the job fails or returns an explicit error state—remove the blanket except or re-raise the caught exception (or return a distinct error value) from where calculate_similarity is called, and update main() to treat that propagated exception/value as a fatal error (exit non‑zero or surface an error message) rather than treating it as “no duplicates found.” Ensure you reference calculate_similarity and main() when making the change so callers handle real failures instead of returning []. - Around line 189-205: The post_comment function currently always posts a new comment; change it to upsert a single detector comment by first listing existing comments for the issue (GET to f"{GITHUB_API_URL}/repos/{REPO}/issues/{ISSUE_NUMBER}/comments" with HEADERS), search for a prior bot comment using a unique marker string you add to the comment body (e.g., "<!-- duplicate-detector -->") or by matching the bot's user, and if found use PATCH to update that comment (PATCH to f"{GITHUB_API_URL}/repos/{REPO}/issues/comments/{comment_id}") otherwise POST as before; keep using HEADERS, timeout and response.raise_for_status() and return True/False as currently implemented in post_comment to preserve behavior. - Around line 21-27: The dynamic runtime pip-install fallback around the SentenceTransformer import must be removed so missing dependencies fail fast; delete the try/except block that catches ImportError and runs subprocess.check_call, and replace it with a single top-level import from sentence_transformers import SentenceTransformer so an ImportError surfaces immediately (this touches the import of SentenceTransformer in the script). In @.github/workflows/duplicate-detection.yml: - Around line 18-22: Update the GitHub Actions step that currently uses actions/setup-python@v4 to a supported major version (actions/setup-python@v5 or `@v6`) to avoid Node16 runner removal; specifically modify the step referencing "actions/setup-python@v4" to use at least "actions/setup-python@v5" (or "@v6"), keep the existing inputs (python-version: '3.11' and cache: 'pip') unchanged, then run the workflow to verify the runner no longer errors due to the deprecated Node runtime. --- Nitpick comments: In @.github/scripts/requirements.txt: - Around line 1-5: Update the .github/scripts/requirements.txt to use pinned versions instead of floating >= specs so duplicate-detection runs are reproducible: replace each dependency (sentence-transformers, scikit-learn, requests, numpy, torch) with exact versions (==) matching the project's root requirements.txt or freeze them into a lock file and reference that; ensure the file uses the same pinned versions as root (e.g., the root's torch and scikit-learn pins) or generate a pinned requirements.lock and update the workflow to install that lock file. In @.github/workflows/duplicate-detection.yml: - Around line 24-36: The workflow currently installs pip deps but doesn't persist the Hugging Face model used in check-duplicates.py (SentenceTransformer("all-MiniLM-L6-v2")), so add an actions/cache step that caches the model directory and set HF_HOME to that cache path in the job environment; specifically, define HF_HOME (e.g., an env like HF_HOME: ${{ runner.temp }}/hf or similar), add an actions/cache entry that keys and restores the HF_HOME directory (or the Hugging Face cache path) before running pip and the python script, and ensure check-duplicates.py continues to load the model from SentenceTransformer using the HF_HOME-backed cache.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 376fea32-ba12-447c-bb31-636e7976cfd8

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and ad77624.

📒 Files selected for processing (4)

.github/scripts/check-duplicates.py
.github/scripts/requirements.txt
.github/workflows/duplicate-detection.yml
eduaid_web/src/pages/Text_Input.jsx

.github/scripts/check-duplicates.py

.github/workflows/duplicate-detection.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed. Inline comments: In @.github/scripts/check-duplicates.py: - Around line 62-80: The current pagination loop in check-duplicates.py swallows RequestException and breaks, returning a partial or empty issues list; change the except requests.exceptions.RequestException block to propagate the failure instead of breaking (e.g., re-raise the caught exception or call sys.exit(1)) so that callers of the function (the workflow) fail fast on API/auth/network errors; update the handler around the requests.get/response.raise_for_status section (the try/except surrounding page_issues, issues.extend and pagination logic) to surface the original exception and include the error context in the raised error/logging. - Around line 240-242: The current main() ignores post_comment()'s return value so comment-post failures don't fail the job; update the block that calls post_comment(comment) to check its boolean result and exit non‑zero on failure (e.g., if not post_comment(comment): log an error and sys.exit(1) or raise SystemExit) so permission/API failures cause the script to fail; reference the post_comment function and the main() flow where the comment is printed and posted.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3eb61906-b169-4909-b104-8d3abe5eb2c7

📥 Commits

Reviewing files that changed from the base of the PR and between ad77624 and a2704d1.

📒 Files selected for processing (1)

.github/scripts/check-duplicates.py

.github/scripts/check-duplicates.py

vaibhavbaghel added 2 commits February 7, 2026 23:06

fix: adjust button padding for question increment and decrement

0da431c

feat: add duplicate issue detection script and workflow

ad77624

vaibhavbaghel changed the title ~~Feature: AI-powered duplicate issue detection using sentence embeddings~~ Feature: Add semantic duplicate issue detection for GitHub issues Mar 10, 2026

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

.github/scripts/check-duplicates.py Outdated Show resolved Hide resolved

.github/scripts/check-duplicates.py Show resolved Hide resolved

.github/scripts/check-duplicates.py Show resolved Hide resolved

.github/workflows/duplicate-detection.yml Show resolved Hide resolved

Update .github/scripts/check-duplicates.py

a2704d1

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

.github/scripts/check-duplicates.py Show resolved Hide resolved

.github/scripts/check-duplicates.py Show resolved Hide resolved

Upgrade setup-python action to version 6

c24b12a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Add semantic duplicate issue detection for GitHub issues #551

Feature: Add semantic duplicate issue detection for GitHub issues #551
vaibhavbaghel wants to merge 4 commits intoAOSSIE-Org:mainfrom
vaibhavbaghel:backend

vaibhavbaghel commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Labels

1 participant

Uh oh!

Conversation

vaibhavbaghel commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Configuration

Workflow

Benefits

Future Improvements

Summary by CodeRabbit

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

1 participant

vaibhavbaghel commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading