Show HN: Pi Co-pilot – Evaluation of AI apps made easy
Our original idea [1] was to help software engineers build high-quality LLM applications by integrating their domain knowledge into a scoring system, which could then drive everything from prompt tuning to fine-tuning, RL, and data filtering. But what we quickly learned (with the help of HN – thank you!) is that most people aren’t optimizing as their first, second, or even third step — they’re just trying to ship something reasonable using system prompts and off-the-shelf models.
In looking to build a product that’s useful to a wider audience, we found one piece of the original product that most people _did_ notice and want: the ability to check that the outputs of their AI apps look good. Whether you’re tweaking a prompt, switching models, or just testing a feature, you still need a way to catch regressions and evaluate your changes. Beyond basic correctness, developers also wanted to measure more subtle qualities — like whether a response feels friendly.
So we rebuilt the product around this single use case: helping developers define and apply subjective, nuanced evals to their LLM outputs. We call it Pi Co-pilot.
You can start with any/all of the below:
- a few good/bad examples
- a system prompt, or app description
- an old eval prompt you wrote
The co-pilot helps you turn that into a scoring spec — a set of ~10–20 concrete questions that probe the output against dimensions of quality you care about (e.g. “is it verbose?”, “does it have a professional tone?”, etc). For each question, it selects either:
- a fast encoder-based model (trained for scoring) – Pi scorer. See our original post [1] for more details on why this is a good fit for scoring compared to the “LLM as a judge” pattern.
- or generates Python functions when that makes more sense (word count, regex etc.)
You iterate over examples, tweak questions, adjust scoring behavior, and quickly reach a spec that reflects your actual taste — not some generic benchmark or off-the-shelf metrics. Then you can plug the scoring system into your own workflow: Python, TypeScript, Promptfoo, Langfuse, Spreadsheets, whatever. We provide easy integrations with these systems.
We took inspiration from tools like v0 and Bolt: natural language on the left, structured artifacts on the right. That pattern felt intuitive — explore conversationally, and let the underlying system crystallize it into things you can inspect and use (scoring spec, examples and code). Here is a loom demo of this [2]
We’d appreciate feedback from the community on whether this second iteration of our product feels more useful. We are offering $10 of free credits (about 25M input tokens), so you can try out the Pi co-pilot for your use-cases. No sign-in required to start exploring: https://withpi.ai
Overall stack: Co-pilot next.js and Vercel on GCP. Models: 4o on Azure, fine tuned Llama & ModernBert on GCP. Training: Runpod and SFCompute.
– Achint (co-founder, Pi Labs)
[1] https://news.ycombinator.com/item?id=43362535
[2] https://www.loom.com/share/82c2e7b511854a818e8a1f4eabb1a8c2
No comments yet