Scaling OpenTelemetry Kafka ingestion by 150% from 192K → 480K EPS

1 o11yguru 0 8/19/2025, 2:53:52 PM
We recently worked with a customer whose log ingestion pipeline (OpenTelemetry Collector + Kafka) was falling behind fast. Throughput capped at ~12K EPS per partition (192K total across 16 partitions), while their production required much more. Consumer lag was accelerating, and the backlog was weeks deep.

We spent several weeks digging into the Kafka receiver and testing different configurations. Key findings:

Kafka client swap → Opted into the Franz-Go client (via feature gate), yielding a 25% improvement.

Encoding mismatch → Discovered OTLP JSON was being used for raw app logs; switching to plain JSON nearly tripled throughput.

Export protocol choice → gRPC introduced extra overhead from JSON→protobuf conversion; HTTPS was ~3K EPS faster.

Batching strategies → Placement of the batch processor changed CPU/memory efficiency depending on workload.

After applying all fixes, we sustained 30K EPS per partition / 480K EPS total — a 150% improvement — and cleared the backlog in under 48 hours.

Full use case (with benchmarks, config details, and lessons learned): https://bindplane.com/blog/kafka-performance-crisis-how-we-scaled-opentelemetry-log-ingestion-by-150

Has anyone else hit scaling limits with the OTel Kafka receiver? Curious what approaches others have tried.

Comments (0)

No comments yet