Benchmarks & Hardware Selection

This section covers hardware selection guidance and processing throughput benchmarks to help you choose the right infrastructure for your PII Eraser deployment.

For client-side optimizations (concurrency, batching, connection pooling), see Performance Tuning.

CPU Architecture & Instruction Sets

The choice of CPU has a dramatic impact on throughput — up to a 5× difference between older and current-generation processors. This is because the latest x86 CPUs include dedicated instructions for neural network workloads, which PII Eraser's inference engine automatically detects and utilizes:

Instruction Set	CPUs	Approximate Speedup
AVX-512 VNNI	Intel Ice Lake+, AMD Zen 4+	2–3× over baseline
AMX (Advanced Matrix Extensions)	Intel Sapphire Rapids+ (AWS c7i, c8i)	3–5× over baseline

Choosing the right CPU generation is often the single biggest performance lever available to you. When selecting an instance type, prioritize CPU generation over instance size — a 4-vCPU c8a.xlarge significantly outperforms an 8-vCPU c5.2xlarge.

Serverless Compatibility

Because PII Eraser runs entirely on CPUs with no GPU or CUDA dependencies, it is natively compatible with serverless container platforms such as AWS Fargate, Azure Container Instances, and Google Cloud Run. This eliminates the need for GPU instance provisioning, driver management, and CUDA compatibility matrices — you simply deploy a container and it runs.

See the Fargate benchmarks below and the AWS Deployment guide for a production-ready Fargate configuration.

Processing Throughput Benchmarks

The benchmarks below were conducted against the service endpoint created by the CloudFormation reference implementation using a single EC2 instance inserted into the test Lambda Security Group to act as a load generator. The stack was configured to run a single instance of the specified type with no other workloads on the host.

Test Methodology

All benchmarks used 1 text per request. Sending multiple texts per API request had a very minor impact on processing throughput.
Each instance was the sole PII Eraser container running on its host, consistent with the resource isolation requirements.
Throughput values represent sustained processing speed measured in tokens per second (tok/s).

EC2

Instance Type	1 Concurrent Req (tok/s)	4 Concurrent Reqs (tok/s)
c7a.xlarge	1739	1634
c7i.xlarge	2000	2190
c8a.xlarge	3515	3430
c8i.xlarge	2204	2456
m8i.xlarge	2157	2415
c5.2xlarge	747	805
c7a.2xlarge	2875	2932
c7i.2xlarge	3130	3823
c8a.2xlarge	5676	5837
c8i.2xlarge	3497	4444
c7a.4xlarge	2327	3064
c7i.4xlarge	4543	6648
c8a.4xlarge	3549	4615
c8i.4xlarge	4833	7545

Fargate

CPU Units	1 Concurrent Req (tok/s)	4 Concurrent Reqs (tok/s)
2048	674	676
4096	971	900
8192	1917	2638
16384	3096	4455

Key Observations

Instance families and RAM

m (general purpose) and c (compute optimized) instance types of the same generation (e.g., c8i and m8i) have nearly identical throughput. Only c benchmarks are shown for most sizes.
The amount of RAM provisioned has a minimal impact on throughput for both EC2 and Fargate, so long as the minimum requirement (7 GB) is met.

Scaling behavior

Bigger instances are not always better. For c8a instances, the 4xlarge is slower than the 2xlarge. This is likely due to the workload being split across multiple CCDs (Core Complex Dies) on AMD processors, introducing cross-die latency. For most workloads, running multiple xlarge or 2xlarge instances behind a load balancer is more cost-effective than using fewer, larger instances.
Intel instances with AMX (c7i, c8i) scale better to larger sizes because AMX benefits from more cores working on matrix operations.

Fargate considerations

Fargate performance varies considerably depending on the underlying CPU type assigned by AWS, which is not user-selectable. Results may differ between deployments and regions.
Despite this variability, Fargate remains an attractive option for teams that prefer zero infrastructure management. Because PII Eraser is CPU-only, it works on Fargate without any specialized configuration — see Serverless Compatibility above.

ARM support

PII Eraser currently does not support ARM instance types such as AWS Graviton. An ARM build is in development; however, throughput on current ARM server CPUs is inferior to the latest x86 CPUs, particularly c8a instances which feature 1 logical core per vCPU (no SMT overhead). ARM support will be revisited with alternative inference engines once AWS Graviton 5 CPUs are generally available.

Recommendations

The optimal instance choice depends on whether you are optimizing for cost efficiency, single-request latency, or maximum throughput under concurrent load.

Use Case	Recommended Instance	Rationale
Best throughput per dollar	`c8a.xlarge`	Delivers 3,500 tok/s at the lowest cost per token on a single instance. For higher total throughput, scale horizontally with multiple `c8a.xlarge` instances behind a load balancer.
Latency-sensitive applications	`c8a.2xlarge`	Best single-instance throughput at 5,800 tok/s under concurrency. Ideal for real-time LLM chat pipelines where per-request latency matters.
Maximum concurrent throughput	`c8i.4xlarge`	Highest throughput under concurrent load (7,500+ tok/s) thanks to Intel AMX. Best when saturated with multiple parallel requests.
Serverless / Variable workloads	Fargate 16384 CPU	Delivers 4,400+ tok/s under concurrency with zero infrastructure management. Ideal if you prefer operational simplicity over raw performance, as throughput varies based on the underlying CPU AWS provisions.

Scale Out, Not Up

For most workloads, running multiple xlarge or 2xlarge instances behind a load balancer delivers better cost efficiency and availability than a single large instance. See Performance Tuning for client-side optimizations that maximize throughput.