Show HN: Pi Co-pilot – Evaluation of AI apps made easy
Our original idea was to help software engineers build high-quality LLM applications by integrating their domain knowledge into a scoring system, which could then drive everything from prompt tuning to fine-tuning, RL, and data filtering. But what we quickly learned (with the help of HN – thank you!) is that most people aren’t optimizing as their first, second, or even third step — they’re just trying to ship something reasonable using system prompts and off-the-shelf models.
In looking to build a product that’s useful to a wider audience, we found one piece of the original product that most people _did_ notice and want: the ability to check that the outputs of their AI apps look good. Whether you’re tweaking a prompt, switching models, or just testing a feature, you still need a way to catch regressions and evaluate your changes. Beyond basic correctness, developers also wanted to measure more subtle qualities — like whether a response feels friendly.
So we rebuilt the product around this single use case: helping developers define and apply subjective, nuanced evals to their LLM outputs. We call it Pi Co-pilot.
You can start with any/all of the below:
- a few good/bad examples
- a system prompt, or app description
- an old eval prompt you wrote
The co-pilot helps you turn that into a scoring spec — a set of ~10–20 concrete questions that probe the output against dimensions of quality you care about (e.g. “is it verbose?”, “does it have a professional tone?”, etc). For each question, it selects either:
- a fast encoder-based model (trained for scoring) – Pi scorer. See our original post [1] for more details on why this is a good fit for scoring compared to the “LLM as a judge” pattern.
- or generates Python functions when that makes more sense (word count, regex etc.)
You iterate over examples, tweak questions, adjust scoring behavior, and quickly reach a spec that reflects your actual taste — not some generic benchmark or off-the-shelf metrics. Then you can plug the scoring system into your own workflow: Python, TypeScript, Promptfoo, Langfuse, Spreadsheets, whatever. We provide easy integrations with these systems.
We took inspiration from tools like v0 and Bolt: natural language on the left, structured artifacts on the right. That pattern felt intuitive — explore conversationally, and let the underlying system crystallize it into things you can inspect and use (scoring spec, examples and code). Here is a loom demo of this: https://www.loom.com/share/82c2e7b511854a818e8a1f4eabb1a8c2
We’d appreciate feedback from the community on whether this second iteration of our product feels more useful. We are offering $10 of free credits (about 25M input tokens), so you can try out the Pi co-pilot for your use-cases. No sign-in required to start exploring: https://withpi.ai
Overall stack: Co-pilot next.js and Vercel on GCP. Models: 4o on Azure, fine tuned Llama & ModernBert on GCP. Training: Runpod and SFCompute.
– Achint (co-founder, Pi Labs)
1. If you're going to record a demo, invest $100 into a real microphone. The sound quality of the loom demo is really off-putting. It might also be over-compression, but it gives me a headache to listen to this kind of sound.
2. The demo has left me more confused. Rather than going step by step, you take the Blue-peter approach of "Here's one I made earlier" and suddenly you're switching tab to something different. Show me the product in action.
I guess I'm not in the market for this, but it feels like UI-heavy for something that's evaluating agents / infrastructure-as-code. I'd have thought if I was going to not just automat something, but also automate the evaluation of that automation, then I'd want a pipeline / process for that, and not actually scan down the criteria trying to work out which blog-posts are which and how the scores relate.
We have a spreadsheet integration (which I might post as a comment) for the usecase you mentioned. The scorer is quite light weight so easy to integrate it in your existing pipelines instead of building yet another pipeline/framework. The co-pilot is specifically for triangulating the right set of metrics (that are subjective based on your taste), which does require looking at examples a few at a time and make a judgement call. But I agree that once you are done with that you want to quickly transition off of this to either code or other frameworks like sheets, promptfoo etc.
I joined Pi 3 months ago after a decade at Google. It was partly the HN community that inspired me to make the switch to a smaller company where I could have more direct impact. Working at a start-up has been quite an adjustment: while the work is extremely rewarding and fun, the pre- product/market fit phase is challenging in ways I've never experienced before in my career.
That's why I asked the team to post here, and am excited to show off this launch to see whether it meets a need that developers have (or learn why if not!)
Just for fun, I spent the last 5 minutes making an evaluation system for Show HN posts, so you can look at a real example if you'd rather not make your own [1]. If you sign in, you can fork and modify it, but you can also go directly to the homepage to try your own hand at it without any sign-in.
[1] https://withpi.ai/project/Xxyhrg2UR8kZHeNmbdV3
No comments yet