Worka PII is a Rust library for detecting and anonymizing personally identifiable information (PII). It provides deterministic, capability-aware NLP pipelines designed to run on CPU-only environments with explicit auditability and controlled degradation when language features are unavailable.
This crate was extracted from the Worka internal monorepo to become a standalone, reusable component. The APIs and the RFCs are maintained here to support independent development and external adoption.
- Deterministic PII detection with stable byte offsets
- Regex, validator, dictionary, and NER-backed recognizers
- Capability-aware pipeline (tokenization, lemma, POS, NER)
- Configurable anonymization operators (redact, mask, replace, hash)
- Optional Candle-based NER via
candle-nerfeature
cargo run --example redact cargo run --example extractuse pii::anonymize::{AnonymizeConfig, Anonymizer}; use pii::nlp::SimpleNlpEngine; use pii::presets::default_recognizers; use pii::{Analyzer, PolicyConfig}; use pii::types::Language; use std::collections::HashMap; let analyzer = Analyzer::new( Box::new(SimpleNlpEngine::default()), default_recognizers(), Vec::new(), PolicyConfig::default(), ); let text = "Contact Jane at jane@example.com or +1 415-555-1212."; let result = analyzer.analyze(text, &Language::from("en")).unwrap(); let mut config = AnonymizeConfig::default(); let mut per_entity = HashMap::new(); per_entity.insert("Email".to_string(), pii::anonymize::Operator::Replace { with: "<EMAIL>".into() }); per_entity.insert("Phone".to_string(), pii::anonymize::Operator::Mask { ch: '*', from_end: 4 }); config.per_entity = per_entity; let redacted = Anonymizer::anonymize(text, &result.entities, &config).unwrap(); assert!(redacted.text.contains("<EMAIL>"));This example keeps the input text intact and uses the detected spans directly.
use pii::nlp::SimpleNlpEngine; use pii::presets::default_recognizers; use pii::{Analyzer, PolicyConfig}; use pii::types::Language; let analyzer = Analyzer::new( Box::new(SimpleNlpEngine::default()), default_recognizers(), Vec::new(), PolicyConfig::default(), ); let text = "Reach me at jane@example.com from 10.0.0.5."; let result = analyzer.analyze(text, &Language::from("en")).unwrap(); for detection in &result.entities { let span = &text[detection.start..detection.end]; println!( "type={} start={} end={} value={}", detection.entity_type.as_str(), detection.start, detection.end, span ); }This example applies per-entity operators and emits a simple audit log that records the original value alongside the replacement.
use pii::anonymize::{AnonymizeConfig, Anonymizer, Operator}; use pii::nlp::SimpleNlpEngine; use pii::presets::default_recognizers; use pii::{Analyzer, PolicyConfig}; use pii::types::Language; use std::collections::HashMap; let analyzer = Analyzer::new( Box::new(SimpleNlpEngine::default()), default_recognizers(), Vec::new(), PolicyConfig::default(), ); let text = "Email jane@example.com or call +1 415-555-1212."; let result = analyzer.analyze(text, &Language::from("en")).unwrap(); let mut config = AnonymizeConfig::default(); let mut per_entity = HashMap::new(); per_entity.insert("Email".to_string(), Operator::Replace { with: "<EMAIL>".into() }); per_entity.insert("Phone".to_string(), Operator::Mask { ch: '*', from_end: 4 }); config.per_entity = per_entity; let anonymized = Anonymizer::anonymize(text, &result.entities, &config).unwrap(); for item in &anonymized.items { let original = &text[item.entity.start..item.entity.end]; println!( "type={} value={} replacement={}", item.entity.entity_type.as_str(), original, item.replacement ); }The following entity types are supported out of the box via built-in recognizers:
- Phone
- IpAddress (IPv4)
- Ipv6
- CreditCard
- Iban
- Ssn
- Itin
- TaxId
- Passport
- DriverLicense
- BankAccount
- RoutingNumber
- CryptoAddress
- MacAddress
- Uuid
- Vin
- Imei
- Url
- Domain
- Hostname
The following types are supported when a NER engine is enabled:
- Person
- Location
- Organization
You can add custom entities and recognizers to the pipeline.
use pii::recognizers::regex::RegexRecognizer; use pii::types::EntityType; let mut recognizers = default_recognizers(); let employee_id = RegexRecognizer::new( "regex_employee_id", EntityType::Custom("EmployeeId".to_string()), r"\bEMP-\d{4}\b", 0.7, "employee_id", ).unwrap(); recognizers.push(Box::new(employee_id)); let analyzer = Analyzer::new( Box::new(SimpleNlpEngine::default()), recognizers, Vec::new(), PolicyConfig::default(), );The pipeline is fully customizable: you can supply your own NLP engine, recognizers, and context enhancers.
- Implement
NlpEngineif you want custom tokenization, lemma/POS, or NER. - Add domain-specific recognizers and context enhancers for tuned detection.
- Swap the default recognizers with your own curated set for strict control.
The default SimpleNlpEngine is language-agnostic and provides tokenization plus sentence splitting for any language tag. For EN/DE/ES, you can provide richer language profiles and context terms to improve recall.
For unsupported languages:
- Regex and validator recognizers still work (language-neutral).
- Lemma/POS/NER capabilities will be absent unless your
NlpEngineprovides them. - Context enhancement falls back to surface terms when lemma is unavailable.
To add a new language with higher fidelity:
- Implement or integrate an
NlpEnginethat can emit token offsets, lemmas, POS tags, and/or NER. - Provide a
LanguageProfilewith context terms for that language. - Attach those to the analyzer via your pipeline configuration.
The full specification is in docs/rfc-1200-pii.md and defines the data model, pipeline behavior, capability reporting, and conformance requirements.
cargo testcargo benchCandle NER tests are ignored by default and require --features candle-ner plus a model:
PII_CANDLE_MODEL_DIR=/path/to/model \ cargo test --features candle-ner --test candle_ner -- --ignoredYou can also set PII_CANDLE_MODEL_ID to download a model via hf-hub.
Licensed under either of:
- Apache License, Version 2.0
- MIT license