A curated list of awesome AI guardrails.
If you find this list helpful, give it a ⭐ on GitHub, share it, and contribute by submitting a pull request or issue!
| Name | Description |
|---|---|
security-and-privacy | Security and privacy guardrails ensure content remains safe, ethical, and devoid of offensive material |
response-and-relevance | Ensures model responses are accurate, focused, and aligned with user intent |
language-quality | Ensures high standards of readability, coherence, and clarity |
content-validation | Ensures factual correctness and logical coherence of content |
logic-validation | Ensures logical and functional correctness of generated code and data |
| Sub Category | Description |
|---|---|
inappropriate-content | Detects and filters inappropriate or explicit content |
offensive-language | Identifies and filters profane or offensive language |
prompt-injection | Prevents manipulation attempts through malicious prompts |
sensitive-content | Flags culturally, politically, or socially sensitive topics |
deepfake-detection | Detects and filters deepfake content |
pii | Identifies and filters personally identifiable information |
Models in security-and-privacy
| Sub Category | Description |
|---|---|
relevance | Validates semantic relevance between input and output |
prompt-address | Confirms response correctly addresses user's prompt |
url-validation | Verifies validity of generated URLs |
factuality | Cross-references content with external knowledge sources |
refusal | Refuses to answer questions that are not appropriate or relevant |
Models in response-and-relevance
| Name | Size | Task |
|---|---|---|
| protectai/distilroberta-base-rejection-v1 | 0.0821B | text-classification |
| s-nlp/E5-EverGreen-Multilingual-Small | 0.118B | text-classification |
| lytang/MiniCheck-RoBERTa-Large | 0.4B | text-classification |
| lytang/MiniCheck-Flan-T5-Large | 0.8B | text-classification |
| ibm-granite/granite-guardian-3.1-2b | 2B | text-classification |
| bespokelabs/Bespoke-MiniCheck-7B | 7B | text-classification |
| nvidia/prompt-task-and-complexity-classifier | 0.184B | text-classification |
| PatronusAI/glider | 3.8B | text-classification |
| flowaicom/Flow-Judge-v0.1 | 3.8B | text-classification |
| Sub Category | Description |
|---|---|
quality | Assesses structure, relevance, and coherence of output |
translation-accuracy | Ensures contextually correct and linguistically accurate translations |
duplicate-elimination | Detects and removes redundant content |
readability | Evaluates text complexity for target audience |
Models in language-quality
| Name | Size | Task |
|---|---|---|
| HuggingFaceFW/fineweb-edu-classifier | 0.109B | text-classification |
| nvidia/quality-classifier-deberta | 0.184B | text-classification |
| facebook/nllb-200-distilled-600M | 0.6B | text-to-text-generation |
| nvidia/prompt-task-and-complexity-classifier | 0.184B | text-classification |
| PatronusAI/glider | 3.8B | text-classification |
| flowaicom/Flow-Judge-v0.1 | 3.8B | text-classification |
| Sub Category | Description |
|---|---|
competitor-blocking | Screens for mentions of rival brands or companies |
price-validation | Validates price-related data against verified sources |
source-verification | Verifies accuracy of external quotes and references |
gibberish-filter | Identifies and filters nonsensical or incoherent outputs |
Models in content-validation
| Name | Size | Task |
|---|---|---|
| s-nlp/mdistilbert-base-formality-ranker | 0.142B | text-classification |
| d4data/bias-detection-model | 0.3B | text-classification |
| NousResearch/Minos-v1 | 0.4B | text-classification |
| osmosis-ai/Osmosis-Structure-0.6B | 0.6B | token-classification |
| gliner-community/gliner_small-v2.5 | 0.7B | token-classification |
| Sub Category | Description |
|---|---|
sql-validation | Validates SQL queries for syntax and security |
api-validation | Ensures API calls conform to OpenAPI standards |
json-validation | Validates JSON structure and schema |
logical-consistency | Checks for contradictory or illogical statements |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| s-nlp/mdistilbert-base-formality-ranker | 0.142B | content-validation | quality |
| d4data/bias-detection-model | 0.3B | content-validation | bias |
| NousResearch/Minos-v1 | 0.4B | content-validation | refusal |
| HuggingFaceFW/fineweb-edu-classifier | 0.109B | language-quality | quality |
| nvidia/quality-classifier-deberta | 0.184B | language-quality | quality |
| protectai/distilroberta-base-rejection-v1 | 0.0821B | response-and-relevance | rejection |
| s-nlp/E5-EverGreen-Multilingual-Small | 0.118B | response-and-relevance | factuality |
| lytang/MiniCheck-RoBERTa-Large | 0.4B | response-and-relevance | factuality, logical-consistency, relevance |
| lytang/MiniCheck-Flan-T5-Large | 0.8B | response-and-relevance | factuality, logical-consistency, relevance |
| ibm-granite/granite-guardian-3.1-2b | 2B | response-and-relevance | factuality, logical-consistency, relevance |
| bespokelabs/Bespoke-MiniCheck-7B | 7B | response-and-relevance | factuality, logical-consistency, relevance |
| nvidia/prompt-task-and-complexity-classifier | 0.184B | response-and-relevance, language-quality | relevance, quality |
| PatronusAI/glider | 3.8B | response-and-relevance, language-quality | factuality, logical-consistency, relevance, quality |
| flowaicom/Flow-Judge-v0.1 | 3.8B | response-and-relevance, language-quality | factuality, logical-consistency, relevance, quality |
| meta-llama/Llama-Prompt-Guard-2-22M | 0.022B | security-and-privacy | prompt-injection, jailbreaks |
| eliasalbouzidi/distilbert-nsfw-text-classifier | 0.068B | security-and-privacy | inappropriate-content |
| meta-llama/Llama-Prompt-Guard-2-86M | 0.086B | security-and-privacy | prompt-injection, jailbreaks |
| ibm-granite/granite-guardian-hap-125m | 0.125B | security-and-privacy | toxicity, hallucination |
| ibm-granite/granite-guardian-hap-125m | 0.125B | security-and-privacy | toxicity, hallucination |
| protectai/deberta-v3-small-prompt-injection-v2 | 0.142B | security-and-privacy | prompt-injection |
| protectai/deberta-v3-base-prompt-injection-v2 | 0.182B | security-and-privacy | prompt-injection |
| TostAI/nsfw-text-detection-large | 0.355B | security-and-privacy | inappropriate-content |
| MoritzLaurer/ModernBERT-large-zeroshot-v2.0 | 0.4B | security-and-privacy | inappropriate-content, offensive-language, prompt-injection, sensitive-content |
| madhurjindal/Jailbreak-Detector-2-XL | 0.5B | security-and-privacy | jailbreaks |
| google/shieldgemma-2b | 2B | security-and-privacy | inappropriate-content, offensive-language, prompt-injection, sensitive-content |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| osmosis-ai/Osmosis-Structure-0.6B | 0.6B | content-validation, security-and-privacy | pii, competitor-blocking |
| gliner-community/gliner_small-v2.5 | 0.7B | content-validation, security-and-privacy | pii, competitor-blocking |
| ai4privacy/llama-ai4privacy-multilingual-categorical-anonymiser-openpii | 0.15B | security-and-privacy | pii |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| facebook/nllb-200-distilled-600M | 0.6B | language-quality | translation-accuracy |
| meta-llama/Llama-3.2-1B-Instruct | 1B | security-and-privacy | inappropriate-content, offensive-language, prompt-injection, sensitive-content |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| Marqo/nsfw-image-detection-384 | 0.006B | security-and-privacy | inappropriate-content |
| Freepik/nsfw_image_detector | 0.086B | security-and-privacy | inappropriate-content |
| Organika/sdxl-detector | 0.086B | security-and-privacy | deepfake-detection |
| prithivMLmods/Deep-Fake-Detector-v2-Model | 0.086B | security-and-privacy | deepfake-detection |
| TostAI/nsfw-image-detection-large | 0.0871B | security-and-privacy | inappropriate-content |
| Ateeqq/nsfw-image-detection | 0.092B | security-and-privacy | inappropriate-content |
| Falconsai/nsfw_image_detection | 0.1B | security-and-privacy | inappropriate-content |
| OpenSafetyLab/ImageGuard | na | security-and-privacy | inappropriate-content |
| Name | Size | Category | Sub Category |
|---|---|---|---|
| meta-llama/Llama-Guard-4-12B | 12B | security-and-privacy | inappropriate-content, offensive-language, prompt-injection, sensitive-content |
| Name | Category | Description |
|---|---|---|
| guardrails | all | Adding guardrails to large language models. |
| NeMo-Guardrails | all | NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. |
| uqlm | hallucination | UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection. |
| llm-guard | all | The Security Toolkit for LLM Interactions. |
| Name | Category | Description |
|---|---|---|
| Lakera | all | Lakera is a company that provides a range of AI services. |
| Guardrails AI Pro | all | Guardrails AI Pro is a commercial version of guardrails that provides additional features and support. |
| Name | Category | Description |
|---|---|---|
| lytang/LLM-AggreFact | factuality | Bias in Bios is a dataset of 100000 bios of people with different biases. |
| Entreprise PII Masking | pii | Entreprise PII Masking are datasets for enterprise PII masking focused on location, work, health, digital and financial information. |
| prithivMLmods/OpenDeepfake-Preview | deepfake-detection | OpenDeepfake-Preview is a dataset of 20K deepfake images. |
| eliasalbouzidi/NSFW-Safe-Dataset | nsfw | NSFW-Safe-Dataset is a dataset for NSFW content detection. |
| lmsys/toxic-chat | toxic-chat | Toxic-Chat is a dataset for toxic chat detection. |
| Name | Category | Description |
|---|---|---|
| Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers | hallucination | Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers |
| RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models | factuality | RAGTruth is a dataset of 100000 bios of people with different biases. |
| MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents | factuality | how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. |
| A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions | hallucination | A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions |
| Granite Guardian: A Guardrail Framework for Large Language Models | all | Granite Guardian is a guardrail framework for large language models. |
| "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models | prompt-injection | "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models |
| "Tiny-Toxic-Detector: A compact transformer-based model for toxic content detection | toxic-chat | "Tiny-Toxic-Detector: A compact transformer-based model for toxic content detection |
| T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation | toxic-chat | T2ISafety is a benchmark for assessing fairness, toxicity, and privacy in image generation. |