Skip to content

Customization

PII Eraser features powerful ML models that work out-of-the-box for a wide range of use cases. For more control, you can customize detection behavior using allow & block lists, specify which entity types to detect, and adjust detection confidence thresholds.

Allow/Block lists are case-insensitive.

Customization Methods

PII Eraser supports two complementary methods for configuring detection and transformation behavior:

REST API

Some parameters like entity_types and operator can be set directly in the JSON body of each API request. This is ideal for experimentation and for applications that need different settings for different inputs.

{
  "text": ["Call me at +49 89 123456"],
  "entity_types": ["PHONE", "EMAIL"],
  "operator": "mask"
}

Configuration File

The config.yaml file sets global defaults that apply to every request. It also supports parameters that are not available via the REST API, such as allow_list, block_list, and enable_presidio_aliases. See the Config File Reference for the complete list of parameters and their descriptions. The configuration file can be provided in two ways described below.

Volume Mount

Mount the file directly into the container at /app/config.yaml:

docker run -p 8000:8000 \
  -v <path to config.yaml>:/app/config.yaml:ro \
  <container-image>

The :ro flag mounts the file as read-only, which is recommended for security.

Environment Variable

Base64-encode the file contents and pass them via the CONFIG_B64 environment variable. This method may be preferable for some deployment methods where environment variables are easier to manage than config files. The AWS CloudFormation reference implementation also uses this method.

Base64 encoding example:

# Encode the config.yaml file (removing newlines to ensure a continuous string)
CONFIG_B64=$(cat config.yaml | base64 | tr -d '\n')

# Verify (decode and inspect)
echo "$CONFIG_B64" | base64 -d

Run with Docker using the environment variable:

CONFIG_B64=$(cat config.yaml | base64 | tr -d '\n')
docker run -p 8000:8000 \
  -e CONFIG_B64="$CONFIG_B64" \
  <container-image>

Precedence

When a parameter is set in both config.yaml and the REST API request, the REST API value takes precedence. This lets you set sensible defaults in the config file and override them on a per-request basis when needed.

  1. Experiment with REST API parameters to find the right settings for your use case.
  2. Consolidate your final settings into config.yaml so that your API requests stay clean and minimal.
  3. Override only when specific requests need different behavior (e.g., a different operator for a specific pipeline).

You can also find ready-to-use example configurations and a guided walkthrough template with descriptions and example values of each parameter in the GitHub repository.

Allow List

The allow list (also known as a pass list or white list) prevents specific terms from being flagged as PII. Any detected entity whose text matches an allow list entry exactly (case-insensitive) is silently discarded.

Common use cases include:

  • Your own company name, brand names and public contact details
  • Department or team names within your organization
  • Names of public institutions that should not be redacted, in your particular use case

The allow list must be specified in the config.yaml file:

allow_list:
  - "techNova gmbh"
  - "technova"
  - "Amazon"  # Allow list entries can be cased, but as matching is case-insensitive it has no effect
  - "tribunal judiciaire"
  - "support@technova.de"

Example: With the allow list above, the text Contact TechNova GmbH at info@technova.de would only detect info@technova.de as EMAIL.

  • The company name TechNova GmbH would usually be detected as ORGANIZATION, but is allowed through.
  • info@technova.de isn't allowed through, despite technova being in the allow list. Matches must be exact.

See the example configurations for real-world allow list usage, including the German Call Centre and French Legal Tech examples.

Block List & Custom Entity Types

The block list (also known as a deny list) lets you force specific terms to always be detected as a given entity type. It has two primary use cases, described below.

Guaranteeing Detection of Known Entities

If you have a list of terms that must always be caught—such as the names of important clients or partners—you can add them under an existing entity type:

block_list:
  ORGANIZATION:
    - "acme corp"
    - "acme corporation"
    - "globex inc"
  NAME:
    - "John Doe"  # Similar to the allow list, casing has no effect

Creating Entirely New Entity Types

You can define custom entity types that don't exist in PII Eraser's built-in set. This is useful for protecting internal project codenames, SKUs, competitor names, or any company-specific internal identifiers:

entity_types:
  - NAME
  - EMAIL
  # Custom types defined in block_list must always be explicitly enabled
  - PROJECT_CODENAME
  - COMPETITOR

block_list:
  PROJECT_CODENAME:
    - "project apollo"
    - "operation titan"
  COMPETITOR:
    - "Acme Corp"
    - "Initech"

Custom entity types must be added to entity_types

When you define a new entity type via the block list, you must also include it in the entity_types parameter (either in config.yaml or the API request) for it to be detected. If entity_types is omitted entirely, all built-in types are detected automatically—but custom types must always be listed explicitly.

Similar to the allow list, block list matching is exact and case-insensitive.

See the UK M&A Deal Room example configuration for a comprehensive block list in a real-world financial scenario.

Detection Confidence Threshold

Custom Confidence Thresholds

In most cases, specifying a custom confidence threshold is not recommended. PII Eraser's ML models are already calibrated and tested across multiple languages and entity types for optimal precision-recall balance.

Every detected entity carries a confidence score in the range (0, 1). PII Eraser applies automatic thresholding by default, but you can override this with the score_threshold parameter:

  • Lowering the threshold increases recall (catches more PII) but may increase false positives.
  • Raising the threshold reduces false positives but may miss less obvious PII instances.

The threshold can be set globally in config.yaml:

score_threshold: 0.9

Or per-request in the API body:

{
  "text": ["Call me at 555-0123"],
  "score_threshold": 0.9
}

When specified in both places, the API request value takes precedence.