Benchmark Scores Aren't Enough: A/B Testing AI in Production

Comments (1)

royalfig · 3d ago

Goodhart's Law states, "When a measure becomes a target, it ceases to be a good measure."

This applies to our favorite LLM models, too, meaning that as they optimize for scoring high on benchmarks, how do we know that's also good for real-world performance like accuracy, latency, or user engagement?

A/B testing AI models helps give you real feedback on how your LLM and its configuration is performing.

Toma (YC W24) Is Hiring Engs #3-4 (AI for Automotive) (ycombinator.com)

Waypoint Transit (YC W25) is hiring a software engineer (workatastartup.com)

GroMo (YC W21) Is Hiring (ycombinator.com)

Archil (YC F24) Is Hiring a Distributed Systems Engineer (In-Person, SF)

Modern Realty (YC S24) Is Hiring (workatastartup.com)

Hestus, Inc. (YC S24) Is Hiring an ML Engineer to Revolutionize CAD (ycombinator.com)

Activeloop (YC S18) is hiring a VP of Engineering in Mountain View (on-site) (careers.activeloop.ai)

Optery (YC W22) – Engineering Team Lead and Engineers with Node.js (U.S., Latam) (jobs.ashbyhq.com)

Extend (YC W23) is hiring engineers to build LLM document processing (jobs.ashbyhq.com)

Parity (YC S24) is hiring founding engineers to build an AI SRE (in-person, SF) (ycombinator.com)

Freshpaint (YC S19) is hiring back end and front end engineers (Remote, US only)

MobileBoost (YC S21) Is Hiring a Founding Back End/Platform Engineer (Remote) (ycombinator.com)

Gym Class (YC W22) Is Hiring Character Animation Engineering Lead (ycombinator.com)

Foundry (YC F24) is hiring – Come build a world model for the web

Bild AI (YC W25) is hiring a founding engineer in SF (ycombinator.com)

Tenjin (YC S14) Is Hiring a Senior Ad Attribution Engineer (Ruby, Go) (ycombinator.com)

Onyx (YC W24) Is Hiring for ML Engineer (ycombinator.com)

Recover (YC W21) Is Hiring (ycombinator.com)

GiveCampus (YC S15) Is Hiring Sr engineers passionate about education (givecampus.breezy.hr)

Cekura (Formerly Vocera) (YC F24) Is Hiring (ycombinator.com)

Spark AI (YC W24) is hiring a full-stack engineer in San Francisco (ycombinator.com)

FurtherAI (YC W24) Is Hiring Software and AI Engineers (ycombinator.com)

Weave (YC W25) is hiring a founding engineer (ycombinator.com)

Infisical (YC W23) Is Hiring Design Engineer in San Francisco (ycombinator.com)

Dot (YC S21) is hiring a sales engineer to automate analytics (fully remote) (ycombinator.com)

Mux (YC W16) is hiring engineering managers for video at scale (mux.com)

EasyPost (YC S13) Is Hiring (easypost.com)

Tesorio (YC S15) Is Hiring a Senior Back End Engineer in Latam (100% Remote) (tesorio.com)

Formance (YC S21) Is Hiring Engineers to Build OSS Financial Infrastructure (ycombinator.com)

OneSignal (YC S11) Is Hiring Engineers (onesignal.com)

Hive.co (YC S14) Is Hiring a Head of Engineering (jobs.ashbyhq.com)

Benchmark Scores Aren't Enough: A/B Testing AI in Production

Comments (1)