Show HN: GPU Profiling That's Useful in 60 Seconds

1 technoabsurdist 0 8/9/2025, 9:37:39 PM keysandcaches.com ↗
Hey HN! We're building a profiler for ML inference that actually shows what's happening at the hardware level without having to manually parse through flame graphs, or set up nsys and ncu.

The problem: Current ML profilers either dump too much data (torch.profiler) or abstract away the details you need. You can't see why your model is actually slow - is it memory bandwidth? Kernel launch overhead? Cache misses?

Our approach: We're reverse engineering GPU execution to trace from Python ops down to PTX instructions. One decorator gives you the full execution graph with actual bottlenecks highlighted.

Technical details: - Traces Python → CUDA kernels → PTX with timing breakdowns - Shows memory access patterns and bandwidth utilization - Kernel occupancy and scheduling analysis - Works with PyTorch/JAX, TensorFlow coming

We used this to optimize Llama inference and found bottlenecks we couldn't see before - got 50%+ speedup: https://www.herdora.com/blog/the-overlooked-gpu

Free beta with 10 hours of profiling: https://keysandcaches.com Github: https://github.com/Herdora/kandc Docs: https://www.keysandcaches.com/docs

Curious what inference bottlenecks others are hitting that current tools can't diagnose. What's your experience with existing profilers? Would be very useful to hear thoughts from the community :)

Comments (0)

No comments yet