Benchmarks & Hardware Selection
This section covers hardware selection guidance and processing throughput benchmarks to help you choose the right infrastructure for your PII Eraser deployment.
For client-side optimizations (concurrency, batching, connection pooling), see Performance Tuning.
CPU Architecture & Instruction Sets
The choice of CPU has a dramatic impact on throughput — up to a 5× difference between older and current-generation processors. This is because the latest x86 CPUs include dedicated instructions for neural network workloads, which PII Eraser's inference engine automatically detects and utilizes:
| Instruction Set | CPUs | Approximate Speedup |
|---|---|---|
| AVX-512 VNNI | Intel Ice Lake+, AMD Zen 4+ | 2–3× over baseline |
| AMX (Advanced Matrix Extensions) | Intel Sapphire Rapids+ (AWS c7i, c8i) | 3–5× over baseline |
Choosing the right CPU generation is often the single biggest performance lever available to you. When selecting an instance type, prioritize CPU generation over instance size — a 4-vCPU c8a.xlarge significantly outperforms an 8-vCPU c5.2xlarge.
Serverless Compatibility
Because PII Eraser runs entirely on CPUs with no GPU or CUDA dependencies, it is natively compatible with serverless container platforms such as AWS Fargate, Azure Container Instances, and Google Cloud Run. This eliminates the need for GPU instance provisioning, driver management, and CUDA compatibility matrices — you simply deploy a container and it runs.
See the Fargate benchmarks below and the AWS Deployment guide for a production-ready Fargate configuration.
Processing Throughput Benchmarks
The benchmarks below were conducted against the service endpoint created by the CloudFormation reference implementation using a single EC2 instance inserted into the test Lambda Security Group to act as a load generator. The stack was configured to run a single instance of the specified type with no other workloads on the host.
Test Methodology
- All benchmarks used 1 text per request. Sending multiple texts per API request had a very minor impact on processing throughput.
- Each instance was the sole PII Eraser container running on its host, consistent with the resource isolation requirements.
- Throughput values represent sustained processing speed measured in tokens per second (tok/s).
EC2
| Instance Type | 1 Concurrent Req (tok/s) | 4 Concurrent Reqs (tok/s) |
|---|---|---|
| c7a.xlarge | 1739 | 1634 |
| c7i.xlarge | 2000 | 2190 |
| c8a.xlarge | 3515 | 3430 |
| c8i.xlarge | 2204 | 2456 |
| m8i.xlarge | 2157 | 2415 |
| c5.2xlarge | 747 | 805 |
| c7a.2xlarge | 2875 | 2932 |
| c7i.2xlarge | 3130 | 3823 |
| c8a.2xlarge | 5676 | 5837 |
| c8i.2xlarge | 3497 | 4444 |
| c7a.4xlarge | 2327 | 3064 |
| c7i.4xlarge | 4543 | 6648 |
| c8a.4xlarge | 3549 | 4615 |
| c8i.4xlarge | 4833 | 7545 |
Fargate
| CPU Units | 1 Concurrent Req (tok/s) | 4 Concurrent Reqs (tok/s) |
|---|---|---|
| 2048 | 674 | 676 |
| 4096 | 971 | 900 |
| 8192 | 1917 | 2638 |
| 16384 | 3096 | 4455 |
Key Observations
Instance families and RAM
m(general purpose) andc(compute optimized) instance types of the same generation (e.g.,c8iandm8i) have nearly identical throughput. Onlycbenchmarks are shown for most sizes.- The amount of RAM provisioned has a minimal impact on throughput for both EC2 and Fargate, so long as the minimum requirement (7 GB) is met.
Scaling behavior
- Bigger instances are not always better. For
c8ainstances, the4xlargeis slower than the2xlarge. This is likely due to the workload being split across multiple CCDs (Core Complex Dies) on AMD processors, introducing cross-die latency. For most workloads, running multiplexlargeor2xlargeinstances behind a load balancer is more cost-effective than using fewer, larger instances. - Intel instances with AMX (
c7i,c8i) scale better to larger sizes because AMX benefits from more cores working on matrix operations.
Fargate considerations
- Fargate performance varies considerably depending on the underlying CPU type assigned by AWS, which is not user-selectable. Results may differ between deployments and regions.
- Despite this variability, Fargate remains an attractive option for teams that prefer zero infrastructure management. Because PII Eraser is CPU-only, it works on Fargate without any specialized configuration — see Serverless Compatibility above.
ARM support
- PII Eraser currently does not support ARM instance types such as AWS Graviton. An ARM build is in development; however, throughput on current ARM server CPUs is inferior to the latest x86 CPUs, particularly
c8ainstances which feature 1 logical core per vCPU (no SMT overhead). ARM support will be revisited with alternative inference engines once AWS Graviton 5 CPUs are generally available.
Recommendations
The optimal instance choice depends on whether you are optimizing for cost efficiency, single-request latency, or maximum throughput under concurrent load.
| Use Case | Recommended Instance | Rationale |
|---|---|---|
| Best throughput per dollar | c8a.xlarge | Delivers 3,500 tok/s at the lowest cost per token on a single instance. For higher total throughput, scale horizontally with multiple c8a.xlarge instances behind a load balancer. |
| Latency-sensitive applications | c8a.2xlarge | Best single-instance throughput at 5,800 tok/s under concurrency. Ideal for real-time LLM chat pipelines where per-request latency matters. |
| Maximum concurrent throughput | c8i.4xlarge | Highest throughput under concurrent load (7,500+ tok/s) thanks to Intel AMX. Best when saturated with multiple parallel requests. |
| Serverless / Variable workloads | Fargate 16384 CPU | Delivers 4,400+ tok/s under concurrency with zero infrastructure management. Ideal if you prefer operational simplicity over raw performance, as throughput varies based on the underlying CPU AWS provisions. |
Scale Out, Not Up
For most workloads, running multiple xlarge or 2xlarge instances behind a load balancer delivers better cost efficiency and availability than a single large instance. See Performance Tuning for client-side optimizations that maximize throughput.