Skip to content

worka-ai/pii

Worka PII

Crates.io Docs CI License

Worka PII is a Rust library for detecting and anonymizing personally identifiable information (PII). It provides deterministic, capability-aware NLP pipelines designed to run on CPU-only environments with explicit auditability and controlled degradation when language features are unavailable.

This crate was extracted from the Worka internal monorepo to become a standalone, reusable component. The APIs and the RFCs are maintained here to support independent development and external adoption.

Features

  • Deterministic PII detection with stable byte offsets
  • Regex, validator, dictionary, and NER-backed recognizers
  • Capability-aware pipeline (tokenization, lemma, POS, NER)
  • Configurable anonymization operators (redact, mask, replace, hash)
  • Optional Candle-based NER via candle-ner feature

Examples

cargo run --example redact cargo run --example extract

Redaction Example

use pii::anonymize::{AnonymizeConfig, Anonymizer}; use pii::nlp::SimpleNlpEngine; use pii::presets::default_recognizers; use pii::{Analyzer, PolicyConfig}; use pii::types::Language; use std::collections::HashMap; let analyzer = Analyzer::new( Box::new(SimpleNlpEngine::default()), default_recognizers(), Vec::new(), PolicyConfig::default(), ); let text = "Contact Jane at jane@example.com or +1 415-555-1212."; let result = analyzer.analyze(text, &Language::from("en")).unwrap(); let mut config = AnonymizeConfig::default(); let mut per_entity = HashMap::new(); per_entity.insert("Email".to_string(), pii::anonymize::Operator::Replace { with: "<EMAIL>".into() }); per_entity.insert("Phone".to_string(), pii::anonymize::Operator::Mask { ch: '*', from_end: 4 }); config.per_entity = per_entity; let redacted = Anonymizer::anonymize(text, &result.entities, &config).unwrap(); assert!(redacted.text.contains("<EMAIL>"));

Span Extraction Example

This example keeps the input text intact and uses the detected spans directly.

use pii::nlp::SimpleNlpEngine; use pii::presets::default_recognizers; use pii::{Analyzer, PolicyConfig}; use pii::types::Language; let analyzer = Analyzer::new( Box::new(SimpleNlpEngine::default()), default_recognizers(), Vec::new(), PolicyConfig::default(), ); let text = "Reach me at jane@example.com from 10.0.0.5."; let result = analyzer.analyze(text, &Language::from("en")).unwrap(); for detection in &result.entities { let span = &text[detection.start..detection.end]; println!( "type={} start={} end={} value={}", detection.entity_type.as_str(), detection.start, detection.end, span ); }

Custom Operators + Audit Log Example

This example applies per-entity operators and emits a simple audit log that records the original value alongside the replacement.

use pii::anonymize::{AnonymizeConfig, Anonymizer, Operator}; use pii::nlp::SimpleNlpEngine; use pii::presets::default_recognizers; use pii::{Analyzer, PolicyConfig}; use pii::types::Language; use std::collections::HashMap; let analyzer = Analyzer::new( Box::new(SimpleNlpEngine::default()), default_recognizers(), Vec::new(), PolicyConfig::default(), ); let text = "Email jane@example.com or call +1 415-555-1212."; let result = analyzer.analyze(text, &Language::from("en")).unwrap(); let mut config = AnonymizeConfig::default(); let mut per_entity = HashMap::new(); per_entity.insert("Email".to_string(), Operator::Replace { with: "<EMAIL>".into() }); per_entity.insert("Phone".to_string(), Operator::Mask { ch: '*', from_end: 4 }); config.per_entity = per_entity; let anonymized = Anonymizer::anonymize(text, &result.entities, &config).unwrap(); for item in &anonymized.items { let original = &text[item.entity.start..item.entity.end]; println!( "type={} value={} replacement={}", item.entity.entity_type.as_str(), original, item.replacement ); }

Supported Entity Types (Built-in)

The following entity types are supported out of the box via built-in recognizers:

  • Email
  • Phone
  • IpAddress (IPv4)
  • Ipv6
  • CreditCard
  • Iban
  • Ssn
  • Itin
  • TaxId
  • Passport
  • DriverLicense
  • BankAccount
  • RoutingNumber
  • CryptoAddress
  • MacAddress
  • Uuid
  • Vin
  • Imei
  • Url
  • Domain
  • Hostname

The following types are supported when a NER engine is enabled:

  • Person
  • Location
  • Organization

Custom Entities and Recognizers

You can add custom entities and recognizers to the pipeline.

use pii::recognizers::regex::RegexRecognizer; use pii::types::EntityType; let mut recognizers = default_recognizers(); let employee_id = RegexRecognizer::new( "regex_employee_id", EntityType::Custom("EmployeeId".to_string()), r"\bEMP-\d{4}\b", 0.7, "employee_id", ).unwrap(); recognizers.push(Box::new(employee_id)); let analyzer = Analyzer::new( Box::new(SimpleNlpEngine::default()), recognizers, Vec::new(), PolicyConfig::default(), );

Custom Pipeline

The pipeline is fully customizable: you can supply your own NLP engine, recognizers, and context enhancers.

  • Implement NlpEngine if you want custom tokenization, lemma/POS, or NER.
  • Add domain-specific recognizers and context enhancers for tuned detection.
  • Swap the default recognizers with your own curated set for strict control.

Language Support and Degradation

The default SimpleNlpEngine is language-agnostic and provides tokenization plus sentence splitting for any language tag. For EN/DE/ES, you can provide richer language profiles and context terms to improve recall.

For unsupported languages:

  • Regex and validator recognizers still work (language-neutral).
  • Lemma/POS/NER capabilities will be absent unless your NlpEngine provides them.
  • Context enhancement falls back to surface terms when lemma is unavailable.

Adding Languages

To add a new language with higher fidelity:

  1. Implement or integrate an NlpEngine that can emit token offsets, lemmas, POS tags, and/or NER.
  2. Provide a LanguageProfile with context terms for that language.
  3. Attach those to the analyzer via your pipeline configuration.

Specification

The full specification is in docs/rfc-1200-pii.md and defines the data model, pipeline behavior, capability reporting, and conformance requirements.

Tests

cargo test

Benchmarks

cargo bench

Candle NER tests are ignored by default and require --features candle-ner plus a model:

PII_CANDLE_MODEL_DIR=/path/to/model \ cargo test --features candle-ner --test candle_ner -- --ignored

You can also set PII_CANDLE_MODEL_ID to download a model via hf-hub.

License

Licensed under either of:

  • Apache License, Version 2.0
  • MIT license

About

A library to identify and help redact Personally Identifiable Information (PII) from text. It gives you deterministic PII detection and anonymization in Rust (CPU‑only, capability‑aware).

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages