Skip to content

Introduction to PII Eraser

Secure, High-Performance PII Detection & Anonymization

PII Eraser is a containerized REST API designed to detect, redact, mask, or hash Personally Identifiable Information (PII), Payment Card Industry (PCI) data, and other sensitive entities in text and chat logs.

It serves as a self-hosted alternative to hyperscaler cloud services, offering lower latency & costs, and complete data sovereignty — your data never leaves your infrastructure.

Core Capabilities

PII Eraser supports four transformation operators that control how detected entities are handled. Each operator can be set globally via config.yaml or per-request via the API.

Operator Description Input Output
redact Replaces entities with a semantic type tag. Call Stefan Müller Call <NAME>
mask Replaces characters with a configurable symbol. ID: 181/815/08155 ID: #############
hash Replaces entities with a deterministic SHA-256 hash. Stefan Müller a8b92f1c...
redact_constant Replaces all entities with the same static string. Call Stefan Müller Call <REDACTED>

In addition to transformation, PII Eraser offers detection-only endpoints that return entity types, positions, and confidence scores without modifying the text — ideal for analytics, compliance audits, and NER workflows.

Both text and chat endpoints support all four operators, along with customizable entity types, allow & block lists, and confidence thresholds.

Quick Start

You can be up and running in minutes using Docker.

1. Run the container:

docker run -p 8000:8000 --read-only --tmpfs /tmp \
  <path to your container repo>

2. Send a request:

curl -X 'POST' \
    'http://localhost:8000/text/transform' \
    -H 'Content-Type: application/json' \
    -d '{
    "text": ["Hello Max Mustermann"],
    "operator": "redact"
}'
import json
import requests

response = requests.post(
    "http://localhost:8000/text/transform",
    json={
        "text": ["Hello Max Mustermann"],
        "operator": "redact"
    }
)
print(json.dumps(response.json(), indent=4))

3. Response:

{
    "text": [
        "Hello <NAME>"
    ],
    "entities": [
        [
            {
                "entity_type": "NAME",
                "output_start": 6,
                "output_end": 12
            }
        ]
    ],
    "stats": {
        "total_tokens": 7,
        "tps": 4718.14
    }
}

Why PII Eraser?

Global & Europe-First Localization

Unlike many US-centric solutions, PII Eraser is built with native, deep support for Western European languages and data formats alongside comprehensive English-language coverage:

Region Countries
DACH Germany, Austria, Switzerland
France & Benelux France, Belgium, Netherlands, Luxembourg
UK & Ireland United Kingdom, Ireland
Southern Europe Italy, Spain
North America United States, Canada
Oceania Australia

Country-specific identifiers — such as the German Steuer-Identifikationsnummer, the French Numéro de sécurité sociale, or the Australian Medicare Number — are detected out of the box. No language codes or country codes are required; PII Eraser handles multilingual and mixed-language input automatically.

See Supported Languages and Supported Entity Types for the complete coverage matrix.

Industry-Leading Accuracy

PII Eraser uses the latest transformer technology to detect sensitive entities. This delivers higher accuracy than legacy regex or rule-based detectors — particularly on real-world data that doesn't fit rigid formats or contain explicit PII type descriptors.

Consider the difference when processing natural conversation:

"Yeah, you can reach me at four nine five five four seven four three."

Pattern-based systems often miss PII expressed in natural language. PII Eraser's transformer models understand context and semantics, catching entities that regex-based approaches cannot. PII Eraser is also optimized for long inputs and numerical entity types such as PCI and identification numbers, areas where transformer models usually perform poorly.

LLM & GenAI Ready

PII Eraser provides dedicated /chat/* endpoints that accept and return messages in the standard OpenAI Chat Completions format. Sanitize conversations before they leave your infrastructure and forward the output directly to any compatible LLM provider — no custom parsing or reconstruction logic required.

The chat endpoints leverage full conversational context for improved detection accuracy, support selective role processing (e.g., anonymize only user messages), and offer incremental processing for latency-sensitive real-time applications.

Massive Context Window

PII Eraser supports up to 1 million tokens per API request and features special optimizations to maintain accuracy on larger inputs. Process entire documents, call transcripts, or database exports in a single call — no chunking, no splitting, no reassembly logic. The limit can be raised further via the max_tokens configuration parameter.

Drop-In Presidio Replacement

PII Eraser provides full compatibility endpoints for Microsoft Presidio Analyzer, allowing you to upgrade your detection accuracy and performance without rewriting your application logic and to continue using Presidio Analyzer.

Enterprise-Grade Security

PII Eraser is built for regulated environments:

  • Air-Gapped by Design: PII Eraser deploys as a single, stateless container that runs entirely offline. No telemetry, no usage analytics, no external API calls — ever.
  • CPU-Only Inference: No GPU or CUDA dependencies, eliminating the management overhead and persistent patching cycles associated with the large software stack required to use GPUs.
  • Minimal Attack Surface: Built on Chainguard distroless base images with a minimal dependency tree, targeting zero known CVEs at build time.
  • Hardened Runtime: Read-only filesystem, all Linux capabilities dropped, no root access.

For the full security model, see Security. For support channels, response targets, and vulnerability reporting, see Support.

Optimized Compute Performance

Highly optimized for modern x86 architectures with AVX-512 VNNI and AMX instruction sets. A single c8a.xlarge AWS instance (4 vCPUs) delivers over 3,500 tokens/second, scaling to over 5,800 tokens/second on a c8a.2xlarge. See Benchmarks & Hardware Selection for full results, including Fargate serverless benchmarks. PII Eraser also runs natively on serverless platforms like AWS Fargate and Azure Container Instances without specialized instance provisioning.

Documentation Overview

Explore the full documentation to get the most out of PII Eraser.

User Guide

Section Description
Processing Text Detect, redact, mask, and hash PII in text strings.
Processing LLM Chats Anonymize and detect PII in OpenAI-format conversations before sending them to LLM providers. Covers conversational context, selective role processing, and incremental processing.
Supported Entity Types Full reference of general and country-specific entity types, including PCI data, government IDs, and financial identifiers.
Supported Languages Supported languages and countries.
Customization Customize detection via allow lists, block lists and more.
Presidio Compatibility Drop-in migration guide for teams currently using Microsoft Presidio Analyzer.
Performance Tuning Concurrency, batching, connection pooling, and CPU selection for maximum throughput.

Deployment & Installation

Section Description
Getting Started Prerequisites and general deployment guidance.
Running with Docker Local and single-host container setup.
AWS Deployment Production-grade CloudFormation reference implementation with ECS Fargate and EC2 support.
Other Platforms Guidelines for Kubernetes and other orchestrators.
Benchmarks & Hardware Selection Hardware selection guide and processing throughput by AWS instance type for EC2 and Fargate.
Security Container hardening, network isolation, and compliance considerations.

Reference

Section Description
API Reference Interactive OpenAPI documentation for all endpoints.
Config File Reference Complete reference for all config.yaml parameters.
Third-Party Licenses Open-source attribution and license notices.