Deriving Accurate Nocturnal HR, RMSSD and Frequency HRV from Oura Ring (2024) (pmc.ncbi.nlm.nih.gov)

We trained those models using a novel Elo score inspired pipeline which we describe in detail in the blog attached. In a nutshell, here is an outline of the training steps: * Collect soft preferences between pairs of documents using an ensemble of LLMs. * Fit an ELO-style rating system (Bradley-Terry) to turn pairwise comparisons into absolute per-document scores. * Normalize relevance scores across queries using a bias correction step, modeled using cross-query comparisons and solved with MLE.

You can try the models either through our API (https://docs.zeroentropy.dev/models), or via HuggingFace (https://huggingface.co/zeroentropy/zerank-1-small).

We would love this community's feedback on the models, and the training approach. A full technical report is also going to be released soon.

Thank you!

Comments (33)

Alex3917 · 48m ago

Out of curiosity, is there a reason why you are using ELO proper, rather than one of the ELO variants that doesn't make assumptions about the distribution of results? E.g.:

https://github.com/pfmonville/whole_history_rating

npip99 · 8m ago

Hey! We actually did a lot of research into ELO consistency, i.e. to check whether or not the NxN pairwise matrix followed the ELO model. It was a long road that's probably grounds for an entirely separate blog post, but the TLDR is that we observe that:

For each document, there is a secret hidden score "s" which is the "fundamental relevance according to the LLM". Then, when we sample (q, d1, d2) from the LLM, the LLM follows the statistical property that:

- The "fundamental hidden preference" is `pref = s_{d1} - s_{d2}`, usually ranging between -4 and 4.

- The LLM will sample a normal distribution around the `pref` with stddev ~0.2, which is some "inner noise" that the LLM experiences before coming to a judgement.

- The preference will pass through the sigmoid to get a sampled_score \in [0, 1].

- There is an additional 2% noise. i.e., 0.98 * sampled_score + 0.02 * random.random()

When we use Maximum Likelihood Estimation to find the most likely predicted "hidden scores" \hat{s} associated with each document, then we go ahead and sample pairwise matrices according to `0.98 * sigmoid( \hat{s}_1 - \hat{s}_2 + N(0, 0.02) ) + Uniform(0.02)`, then we get a pairwise matrix with virtually identical statistical properties to the observed pairwise matrices.

ashwindharne · 3h ago

Cool stuff! We use a similar process internally to rerank and filter our cold outbound lists. We just use an off-the-shelf model as the judge, give it a custom criteria, and let it run until some set number of iterations. It's helped narrow down wide searches to the maximally relevant set of people (few thousand medium-bad matches to few hundred good matches)

It's not cheap and it's not fast, but it definitely works pretty well!

jayunit · 2h ago

Very interesting! What are some examples of criteria that you can evaluate pairwise, but couldn't score individually?

ashwindharne · 1h ago

It's all unstructured text (title, company, company size, experience, skills, raw text, etc.) and LLMs are pretty bad at assigning numerical scores in a vacuum. To make it work, we'd have to provide a representative set of examples, break scoring down by specific field, etc.

Kind of a lot of work compared to just dumping the text of 2 profiles into a context window along with a vague description of what I want, and having the LLM make the binary judgment.

bravura · 1h ago

Pairwise rank constraints involve fewer assumptions that per-item scoring about the underlying nature of the data, thus they are more robust.

Neywiny · 52m ago

I have a paper that got denied but it was about using 2AFC sorting to do this instead of elo. It has a defined end unlike elo scores. The code is on my github and focuses on humans sorting images but basically if you have a python sort function, you put your comparison as the key instead of assigning the comparison a numeric score. Then the algorithm does the rest

reactordev · 34m ago

I was going to mention this approach as well. The problem with the OP is that it has assumption bias and the entire chain is based on that assumption. It’s novel. But the original idea was to more evenly distribute scores so you can find real relevance and I think 2AFC is better. But I don’t have time to verify and post a paper about it.

Neywiny · 21m ago

It's probably because that's what we used, but nAFC has been my go-to since I first learned about it. Literally any time there's a ranking, even for dumb stuff like tier list videos on YouTube, they're too arbitrary. Ok you ranked this snack an 8/10. Based on what? And then they go back and say "actually I'm going to move that to a 7". AFC fixes all of that.

ghita_ · 50m ago

would love to check out the code if you have it!

Neywiny · 23m ago

https://github.com/Neywiny/merge-sort

It was actually done to counter Elo based approaches so there's some references in the readme on how to prove who's better. I haven't run this code in 5 years and haven't developed on it in maybe 6, but I can probably fix any issues that come up. My co-author looks to have diverged a bit. Haven't checked out his code. https://github.com/FrankWSamuelson/merge-sort . There may also be a fork by the FDA itself, not sure. This work was done for the FDA's medical imaging device evaluation division

etk934 · 36m ago

Will the reranker trained with MSE be better calibrated than those trained with InfoNCE? Will threshold on reranker scores be more useful in RAG applications?

seanhunter · 3h ago

Fun fact about ELO. It's natural to think that it is some kind of initialism, but in fact ELO doesn't stand for anything. It's the name of the guy who invented the system. https://en.wikipedia.org/wiki/Arpad_Elo

So don't say it "E.L.O." (unless you're talking about the band, I guess), say "ee-low"

npip99 · 7m ago

I often see it rendered as "Elo" but I've always found it more natural to capitalize as "ELO", but perhaps I should swap to "Elo" given this. Pronouncing "ee-low" is certainly the way it's done in chess/esports though!

kayge · 14m ago

Thanks for this :) I had never heard of Elo until I noticed this morning that the new Chess course in Duolingo gives you an Elo ranking after a few rounds against Oscar. Probably would have skipped right over this story and comments otherwise, but now I have a fun bit of non-tech trivia to share if it ever comes up in small talk someday.

esafak · 1h ago

It should be Elo rating! https://en.wikipedia.org/wiki/Elo_rating_system

reactordev · 52m ago

It’s also popular in ranking online players in games… really any game where there’s an win/loss ranking..

ghita_ · 3h ago

oh interesting, had no idea, thanks for sharing

amelius · 2h ago

What was his ELO rating?

homarp · 2h ago

https://chess.stackexchange.com/questions/35420/what-was-arp...

2065

mkaszkowiak · 2h ago

Happy to see competition in rerankers! Good luck with your product.

My questions: what languages do your models currently support? Did you perform multilingual benchmarks? Couldn't find an answer on the website

ghita_ · 2h ago

Thanks! We trained on most european languages (english, french, spanish, russian...), arabic, and chinese so it does well on those! We haven't tested too much on other languages, but happy to do so if there is a use case

rahulnair23 · 3h ago

Interesting work.

For a slightly different take using a similar intuition, see our paper [at ACL 2024](https://arxiv.org/abs/2402.14860) on ranking LLMs which may be of interest.

Our HuggingFace space has some examples: https://huggingface.co/spaces/ibm/llm-rank-themselves

ghita_ · 2h ago

thank you, will check out the paper, the hf space is very cool!

yalok · 3h ago

What’s the expected additional latency due to running this re-ranker?

ghita_ · 3h ago

It actually runs pretty fast, our benchmarks show ~149ms for 12665 bytes. It's faster than many other models

esafak · 3h ago

I would prominently display your benchmarks (against your competitors, of course). That's your selling point, right?

ghita_ · 2h ago

Yes! We did this here: https://www.zeroentropy.dev/blog/announcing-zeroentropys-fir... We wanted to share the approach with the community in this post. It does do better than competitors though!

esafak · 4h ago

I would have titled it "Improving ranking..."

I like that it works with `sentence_transformers`

dang · 3h ago

We could change the title to "Improving search ranking with chess Elo scores". Anybody object?

Edit: ok, done. Submitted title was "Show HN: Improving RAG with chess Elo scores".

ghita_ · 4h ago

yes we found it hard to find a good title for this, thanks for the feedback

sippeangelo · 4h ago

Really cool stuff! Just want to let you know you forgot to link to the evals at the end.

ghita_ · 3h ago

oh waw thanks for flagging, just fixed, thanks!

Show HN: Hoff (github.com)

Why One Geologist Thinks We Should All Pay More Attention to Rocks (atmos.earth)

Why Women in Tech isn't enough (whitep4nth3r.com)

Employers Pay Problem-Solvers, Not Encyclopedias (michaelbastos.com)

Hyprland 0.50.0 Released (hypr.land)

Gitplay: Learn how a software project (using Git) evolved over time (github.com)

Google will enable Gemini to harass and annoy businesses with its questions (theverge.com)

I analyzed 50k+ negative app reviews (from 5k+ mobile apps) to get app ideas (bigideasdb.com)

Show HN: Mocksmith.dev – Instantly Generate Relational Test Data (mocksmith.dev)

Poker Bot (pokerbotai.com)

Double Pendulums are Chaoticn't [video] (youtube.com)

Nihon Hidankyo (en.wikipedia.org)

Enhancing COBOL Code Explanations: A Multi-Agents Approach Using LLMs (arxiv.org)

Reactive Java Operator-fusion (2016) (akarnokd.blogspot.com)

The Rules for Rulers [video] (youtube.com)

Deriving Accurate Nocturnal HR, RMSSD and Frequency HRV from Oura Ring (2024) (pmc.ncbi.nlm.nih.gov)

The next 700 programming languages (dl.acm.org)

Man has shoes tattooed onto his feet because he's 'tired of paying' for new ones (mirror.co.uk)

Typogram Studio: A New Tool for Beautiful Typography Design (typogram.co)

Intel's retreat is unlike anything it's done before in Oregon (oregonlive.com)

Mapping the AI economy: Which regions are ready for the next technology leap (brookings.edu)

High-Speed Rail Route Proposed Between Los Angeles and New York (newsweek.com)

Op against Russian hackers Noname, 5 arrest warrants (ansa.it)

I Got into Deep Learning (vikas.sh)

Steam bans games that violate the 'rules and standards' of payment processors (engadget.com)

We Have Made the Decision to Not Continue Paying for BBB Accreditation (mycherrytree.com)

Biphasic liquids with shape-shifting and bistable microdomains (nature.com)

ReDB – platform for moving data between database technologies (github.com)

Techniques for Limiting Manipulation of URLs (PDF, Patent) (image-ppubs.uspto.gov)

Grappling with the Existential Panic over AI (easydns.com)

Tim O'Reilly -Where Is AI on the Enshittification Curve? (oreilly.com)

How I got hired by AWS (2008) (simon.medium.com)

Ex-OpenAI engineer pulls the curtain back on a chaotic hot mess (theregister.com)

React JavaScript: From Zero to Hero (reactz2h.com)

My friends made plans without me – is it weird to invite myself? (theguardian.com)

8.8 Zero Day CVE in Chrome Patched (bleepingcomputer.com)

Benefits of weight loss on fat tissue (imperial.ac.uk)

How (and why) to do pop-up communes (supernuclear.substack.com)

Space Force Accepts New GPS Control System After Years of Delays (airandspaceforces.com)

Trump sues Corporation for Public Broadcasting directors who refused to be fired (arstechnica.com)

Hit (raw.githubusercontent.com)

Brazilian Payment System Pix Under Investigation (ustr.gov)

Abraham Lincoln Drew Poetry and Power from His Suicidal Depression (themarginalian.org)

Signs of Autism Could Be Encoded in the Way You Walk (sciencealert.com)

Agricultural liming in the US is a large CO₂ sink, say researchers (phys.org)

Attorney General Bailey Fights to Expose Big Tech Censorship of President Trump (ago.mo.gov)

HopFront: Generate API-based dashboards instantly (hopfront.com)

Font-size-adjust Is Useful (matklad.github.io)

Innovation starts with consumers, not academia (lemire.me)

Crims hijacking patched SonicWall VPNs to deploy stealthy backdoor and rootkit (theregister.com)

Show HN: Improving search ranking with chess Elo scores

Comments (33)