CCPS: Calibrating LLM Confidence via Perturbation Stability

Comments (1)

erfan_mhi · 7h ago

Author here. Our paper “Calibrating LLM Confidence by Probing Perturbed Representation Stability” was accepted to EMNLP 2025 Main Conference (top 15%) with a final rating of 9 (strong accept).

High-level summary: We probe LLM hidden states with slight perturbations to check answer stability—stable implies confidence; unstable implies uncertainty. This lightweight method delivers >50% reductions in calibration error (down to ~4.5%) across LLaMA, Mistral, Qwen on MMLU & MMLU-Pro, with no LLM fine-tuning.

Results, code, and dataset are available at: - Code: https://github.com/ledengary/CCPS - Data: https://huggingface.co/datasets/ledengary/CCPS

Happy to discuss technical details or calibration deployment strategies.

Ask HN: The government of my country blocked VPN access. What should I use?

Ask HN: What to Learn for Math for Modeling?

Ask HN: What to do when you suspect your interview is with a state operative?

Ask HN: How much better can the LLMs become assuming no AGI

Ask HN: Why hasn't x86 caught up with Apple M series?

Ask HN: How to teach a 4 year old to code?

Ask HN: Services for Shutting Down a Startup?

Ask HN: Where can I see a live octopus in Maine?

Ask HN: Anyone working on bringing software back from US clouds?

CompactifAI Inference API

Ask HN: How can I recover and run my old mobile game from the 2010s?

Ask HN: Did modern AI's coding abilities make you lose interest in programming?

Petition to stop Google from restricting sideloading and FOSS apps

Anthropick.com Redirects to ChatGPT

Ask HN: What are the best Google alternatives in 2025?

Ask HN: GitHub Copilot down?

Ask HN: What to Do with Old iPads?

Ask HN: Windows 11 Update Fail – Linux Distro Suggestions?

Units of Economics of LLMs. Reply to Ed Zitron's "AI Is a Money Trap"

Ask HN: Best codebases to study to learn software design?

Ask HN: Does sentience put stress on the brain?

Ask HN: How to Learn to Build Agentic AI Systems (Like Claude Code)

Ask HN: Is there a temp phone number like temp email?

Ask HN: What measures are you taking to stop AI crawlers?

Out of curiosity: what kind of people use this "forum" (I mean Hacker News)?

Ask HN: Is backlink trading still a problem worth solving?

Ask HN: Why are so many services rejecting Google Voice numbers for signups?

Ask HN: What should I use to run React Native tests on a device?

Stop squashing your commits. You're squashing your AI too

Ask HN: Are AI filters becoming stricter than society itself?

Tell HN: any reasonably used DB will likely outlast the programs using it

Ask HN: Any experienced devs who use AI extensively in their work?

Ask HN: I just abandoned my PyCharm subscription, what should I use now?

Ask HN: How do you find early stage startups to join

Ask HN: How are you attributing your AI usage when developing software?

Patient Lisp Hacker Seeks Same for Long Walks Through IPL-V Code

Ask HN: Has anyone else used online communities that are archetypically "savvy"?

CCPS: Calibrating LLM Confidence via Perturbation Stability – EMNLP 2025

Comments (1)