Scaling OpenTelemetry Kafka ingestion by 150% from 192K → 480K EPS
We spent several weeks digging into the Kafka receiver and testing different configurations. Key findings:
Kafka client swap → Opted into the Franz-Go client (via feature gate), yielding a 25% improvement.
Encoding mismatch → Discovered OTLP JSON was being used for raw app logs; switching to plain JSON nearly tripled throughput.
Export protocol choice → gRPC introduced extra overhead from JSON→protobuf conversion; HTTPS was ~3K EPS faster.
Batching strategies → Placement of the batch processor changed CPU/memory efficiency depending on workload.
After applying all fixes, we sustained 30K EPS per partition / 480K EPS total — a 150% improvement — and cleared the backlog in under 48 hours.
Full use case (with benchmarks, config details, and lessons learned): https://bindplane.com/blog/kafka-performance-crisis-how-we-scaled-opentelemetry-log-ingestion-by-150
Has anyone else hit scaling limits with the OTel Kafka receiver? Curious what approaches others have tried.
No comments yet