Show HN: Generate discord timestamp that converts to each user's local timezone (discordtimestamp.cc)

I think this is a really interesting paper from Cohere, it really feels that at this point in time you can't trust any public benchmark, and you really need your own private evals.

AstroBen · 3h ago

Any tips on coming up with good private evals?

pongogogo · 2h ago

Yes, I wrote something up here on how Andrei Kaparthy evaluated grok 3 -> https://tomhipwell.co/blog/karpathy_s_vibes_check/

I would pick one of two parts of that analysis that are most relevant to you and zoom in. I'd choose something difficult that the model fails at, then look carefully at how the model failures change as you test different model generations.

ilrwbwrkhv · 3h ago

Yup in my private evals I have repeatedly found that DeepSeek has the best models for everything and yet in a lot of these public ones it always seems like someone else is on the top. I don't know why.

unkulunkulu · 5h ago

Sounds like classic inequality observed everywhere. Success leads to attention leads to more success.

Why spend evaluation resources on outsiders? Everyone wants to know who is exactly first second etc, after #10 it’s do your own evaluation if this is important to you.

Thus, we have this inequality.

cainxinth · 4h ago

So attention is all you need?

ukuina · 4h ago

Bravo!

boxed · 4h ago

Is it? Sounds to me like they run the same experiment many times and keep the "best" results. Which is cheating, or if the same thing is done in biomedical research: research fraud.

sumtechguy · 3h ago

Back in the slashdot days I would experiment on changing conversations. This was due to the way SD would rank and show its posts. Anything below a 3 would not change anything. But if you could get in early AND get a +5 on your post you could drive exactly what the conversation was about. Especially if you were engaged a bit and were willing to add a few more posts onto other posts.

Basically get in early and get a high rank and you are usually going to 'win'. Now it does not work all the time. But it had a very high success rate. I probably should have studied it a bit more. My theory is any stack ranking algorithm is susceptible to it. I also suspect it works decently well due to the way people will create puppet accounts to up rank things on different platforms. But you know, need numbers to back that up...

cratermoon · 3h ago

Anecdotally, that same technique works on HN.

jerf · 2h ago

It's intrinsic to any karma system that has a global karma rating, that is, the message has a concrete "karma" value that is the same for all users.

drcongo recently referenced something I sort of wish I had time to build: https://news.ycombinator.com/item?id=43843116 And/or could just go somewhere to use, which is a system where an upvote doesn't mean "everybody needs to see this more" but instead means "I want to see more of this user's comments", and downvotes mean the corresponding opposite. It's more computationally difficult but would create an interestingly different community, especially as further elaborations were built on that. One of the differences would be to mitigate the first-mover advantage in conversations. Instead of it winning you more karma if it appeals to the general public of the relevant site, what it would instead do is expose you to more people. That would produce more upvotes and downvotes in general but wouldn't necessarily impact visibility in the same way.

sunaookami · 2h ago

And Reddit

jmount · 52m ago

Not the same effect: but a good related writeup: https://www.stefanmesken.info/machine%20learning/how-to-beat...

ekidd · 4h ago

Also, I've been hearing a lot of complaints that Chatbot Arena tends to favor:

- Lots of bullet points in every response.

- Emoji.

...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.

Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.

kozikow · 3h ago

More to that - at this point, it feels to me, that arenas are getting too focused on fitting user preferences rather than actual model quality.

In reality I prefer different model, for different things, and quite often it's because model X is tuned to return more of my preference - e.g. Gemini tends to be usually the best in non-english, chatgpt works better for me personally for health questions, ...

n8m8 · 55m ago

Interesting idea, I think I'm on board with this correlation hypothesis. Obviously it's complicated, but it does seems like over-reliance on arbitrary opinions from average people would result in valuing "feeling" over correctness.

jimmaswell · 2h ago

> sycophantic behavior of recent models

The funniest example I've seen recently was "Dude. You just said something deep as hell without even flinching. You're 1000% right:"

pc86 · 1h ago

This type of response is the quickest way for me to start verbally abusing the LLM.

jmmcd · 4h ago

Absolutely devastating for the credibility of FAIR.

aredox · 5h ago

The fact those big LLM developers devote a significant amount of effort to game benchmarks is a big show of confidence that they are making progress towards AGI and will recoup those billions of dollars and man-hours/s

amelius · 5h ago

Are the benchmark prompts public and isn't that where the problem lies?

StevenWaterman · 4h ago

No, even if the benchmarks are private, it's still an issue. Because you can overfit to the benchmark by trying X random variations of the model, and picking the one that performs best on the benchmark

It's similar to how I can pass any multiple-choice exam if you let me keep attempting it and tell me my overall score at the end of each attempt - even if you don't tell me which answers were right/wrong

VladVladikoff · 28m ago

Now I’m wondering what the most efficient algorithm to obtain a mark of 100% in the least amount of attempts. Guessing one question per attempt seems inefficient. Perhaps guessing the whole exam as option A. Then submitting the whole exam as option B. And so on, at the start, could give you a count of how many As are correct. Then maybe some sort of binary sort through the rest of the options? You could submit the first 1/2 as A and the second 1/2 as B. Etc. hmmmm

amelius · 3h ago

Maybe there should be some rate limiting on it then? I.e., once a month you can benchmark your model. Of course you can submit under different names, but how many company names can someone realistically come up with and register?

sebastiennight · 2h ago

So now you want OpenAI to go even wilder in how they name each new model?

amelius · 56m ago

1 model per company per month, max.

leto_ii · 4h ago

Is this sarcasm? Otherwise I'm not sure how that follows. Seems more reasonable to believe that they're hitting walls and switching to PR and productizing.

RodgerTheGreat · 2h ago

Ending a paragraph with "/s" is a moderately common convention for conveying a sarcastic tone through text.

n8m8 · 1h ago

Predictable, yet incredibly important.

lostmsu · 3h ago

Chiming in as usual: https://trashtalk.borg.games

A social deduction game for both LLMs and humans. All the past games are available for anyone.

I'm open for feedback.

Show HN: Kexa.io – Open-Source IT Security and Compliance Verification

Show HN: 1.2 users a day to keep the 9–5 away (postonreddit.com)

Show HN: Aisir – AI models deliberate and critique each other like a council (aisirai.com)

Show HN: Beatsync – perfect audio sync across multiple devices (github.com)

Show HN: A Chrome extension that will auto-reject non-essential cookies (blog.bymitch.com)

Show HN: An MCP server for understanding AWS costs

Show HN: Web Tool to Create a Universal Database MCP Server (centralmind.ai)

Show HN: Prettier Email Headers (emailheaders.dev)

Show HN: AgenticSeek – Self-hosted alternative to cloud-based AI tools (github.com)

Show HN: An interactive demo of QR codes' error correction (qris.cool)

Show HN: Open-source sound effects and react library to spice up your website (reactsounds.com)

Show HN: I built a hardware processor that runs Python (runpyxl.com)

Show HN: Sim Studio – Open-Source Agent Workflow GUI (github.com)

Show HN: Daily Digest of the Least Popular Posts on Hacker News (leastpopular.io)

Show HN: Flowcode – Turing-complete visual programming platform (app.getflowcode.io)

Show HN: Heart Rate Zones Plus – The first iOS app I developed (apps.apple.com)

Show HN: Web-eval-agent – Let the coding agent debug itself (github.com)

Show HN: Built a API that returns your GitHub Contribution chart (github.com)

Show HN: CodeClarity – an open source source code analysis platform (codeclarity.io)

Show HN:I Open Sourced Deepwiki (github.com)

Show HN: A pure WebGL image editor with filters, crop and perspective correction (github.com)

Show HN: I486SX_soft_FPU – Software FPU Emulator for NetBSD 10 on 486SX (github.com)

Show HN: Neurox – GPU Observability for AI Infra (github.com)

Show HN: Generate discord timestamp that converts to each user's local timezone (discordtimestamp.cc)

Show HN: Tariff Calculator for Amazon (twitter.com)

Show HN: Daily Jailbreak – Prompt Engineer's Wordle (vaultbreak.ai)

Show HN: A Common Lisp implementation in development, supports ASDF (savannah.nongnu.org)

Show HN: Autarkie – Instant grammar fuzzing using Rust macros (github.com)

Show HN: My self-written hobby OS is finally running on my vintage IBM ThinkPad (github.com)

Show HN: I made a web-based, free alternative to Screen Studio (screenrecorder.me)

Show HN: I created snapDOM to capture DOM nodes as images with exceptional speed (github.com)

Show HN: Rad Type - Can we make gamepad typing fast? (tyleo.com)

Show HN: A Chrome extension to open a link without leaving your real footprints (chromewebstore.google.com)

Show HN: Remote-Controlled IKEA Deathstar Lamp (gitlab.com)

Show HN: I built an AI that turns GitHub codebases into easy tutorials (github.com)

Show HN: Bhvr, a Bun and Hono and Vite and React Starter (bhvr.dev)

Show HN: Neuro Tools, a collection of tools to help neurodivergent people (neurotools.app)

Show HN: GS-Calc – A modern spreadsheet with Python integration (citadel5.com)

Show HN: I made an app to learn guitar scales (guitartonic.com)

Show HN: POC to scrape and structure HTML into JSON for RAG (structured.pages.dev)

Show HN: Auto-fix your GitHub PR issues with Proton for FREE (proton.codes)

Show HN: NanoAgent, zero-dependency 1k-LOC AI-agent runtime (github.com)

Show HN: BitNote – Using browsers and blockchains to store secrets (github.com)

Show HN: Colanode, open-source and local-first Slack and Notion alternative (github.com)

Show HN: Discorss – RSS Feeds for Discord (discorss.fldr.zip)

Show HN: Dish – A simple HTTP and TCP endpoint monitoring tool (written in Go) (blog.vxn.dev)

Show HN: Xilt – A concurrent log parser written in Go (github.com)

Show HN: Rowboat – Open-source IDE for multi-agent systems (github.com)

Show HN: Photorealistic ray-traced micro-voxel FPS (github.com)

Show HN: Logchef – Schema-agnostic log viewer for ClickHouse (github.com)

The Leaderboard Illusion

Comments (30)