Hallucination detection for Elixir, powered by Vectara's HHEM (Hallucination Evaluation Model).
Given a premise and a hypothesis, Hallmark scores how consistent the hypothesis is with the premise on a scale from 0 (hallucinated) to 1 (consistent). Useful for checking whether LLM-generated text is actually grounded in the source material.
HHEM is a fine-tuned FLAN-T5-base (184M params) that runs locally via Bumblebee. No API keys, no external calls after the initial model download.
def deps do [ {:hallmark, "~> 0.1.0"}, {:exla, "~> 0.9"} ] endYou need a compiler like EXLA or EMLX. Without one, the pure Elixir evaluator runs each tensor op individually and a single score call takes 10+ minutes. With EXLA, it's ~170ms.
EXLA works on all platforms (CPU on Mac, CUDA on Linux with a GPU). If you're on Apple Silicon and want Metal GPU acceleration, use EMLX instead:
{:emlx, "~> 0.2"}# Load the model (downloads ~440MB on first run, cached after that) {:ok, model} = Hallmark.load(compiler: EXLA) # Score a single pair {:ok, score} = Hallmark.score(model, "I am in California", "I am in United States.") # => {:ok, 0.65} # Hallucination detected {:ok, score} = Hallmark.score(model, "The capital of France is Berlin.", "The capital of France is Paris.") # => {:ok, 0.01} # Get a label instead of a score {:ok, :consistent} = Hallmark.evaluate(model, "I am in California", "I am in United States.") {:ok, :hallucinated} = Hallmark.evaluate(model, "The capital of France is Berlin.", "The capital of France is Paris.") # Custom threshold (default is 0.5) {:ok, label} = Hallmark.evaluate(model, premise, hypothesis, threshold: 0.8) # Batch scoring {:ok, scores} = Hallmark.score_batch(model, [ {"I am in California", "I am in United States."}, {"I am in United States", "I am in California."}, {"Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg."} ])HHEM checks logical entailment, not factual accuracy. It answers "does the hypothesis follow from the premise?" rather than "is the hypothesis true?" So a factually correct statement can still be flagged as hallucinated if it doesn't follow from the given premise.
Under the hood, Hallmark loads the T5 encoder from HHEM's fine-tuned weights, runs the input through the encoder, and applies a 2-class classifier head on the pad token embedding. The softmax probability of the "consistent" class is the score.
MIT