68.1M Amazon Affiliate links. See what products are trending (affiliate-tracking.com)

It's pretty good for summaries etc, can even make simple index.html sites if you're teaching students but it can't really vibecode in my opinion. However for local automation tasks like summarizing your emails, or home automation or whatever it is excellent.

It's crazy that we're at this point now.

film42 · 1h ago

Is there a crowd-sourced sentiment score for models? I know all these scores are juiced like crazy. I stopped taking them at face value months ago. What I want to know is if other folks out there actually use them or if they are unreliable.

hnfong · 44m ago

Besides the LM Arena Leaderboard mentioned by a sibling comment, if go to the r/LocalLlama/ subreddit, you can very unscientifically get a rough sentiment of the performance of the models by reading the comments (and maybe even check the upvotes). I think the crowd's knee-jerk reaction is unreliable though, but that's what you asked for.

nurettin · 1h ago

This has been around for a while https://lmarena.ai/leaderboard/text/coding

klohto · 1h ago

openrouter usage stats

setsewerd · 18m ago

Since the ranking is based on token usage, wouldn't this ranking be skewed by the fact that small models' APIs are often used for consumer products, especially free ones? Meanwhile reasoning models skew it in the opposite direction, but to what extent I don't know.

It's an interesting proxy, but idk how reliable it'd be.

esafak · 1h ago

https://openrouter.ai/rankings

The new qwen3 model is not out yet.

svnt · 9m ago

It is interesting to think about how they are achieving these scores. The evals are rated by GPT-4.1. Beyond just overfitting to benchmarks, is is possible the models are internalizing how to manipulate the ratings model/agent? Is anyone manually auditing these performance tables?

esafak · 2h ago

This one should work on personal computers! I'm thankful for Chinese companies raising the floor.

No comments yet

frontsideair · 2h ago

According to the benchmarks, this one is improved in every one of them compared to the previous version, some better than 30B-A3B. Definitely worth a try, it’ll easily fit into memory and token generation speed will be pleasantly fast.

GaggiX · 1h ago

There is a new Qwen3-30B-A3B, you are compare it to the old one. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

gok · 2h ago

So this 4B dense model gets very similar performance to the 30B MoE variant with 7.5x smaller footprint.

smallerize · 1h ago

It gets similar performance to the old version of the 30B MoE model, but not the updated version. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

Imustaskforhelp · 7m ago

I still think that its still very commendable though.

I am running this beast on my dumb pc with no gpu, now we are talking!

jampa · 1h ago

I am reading this right, is this model way better than Gemma 3n[1]? (For only the benchmarks that are common among the models)

=====

LiveCodeBench

E4B IT: 13.2

Qwen: 55.2

===== AIME25

E4B IT: 11.6

Qwen: 81.3

[1]: https://huggingface.co/google/gemma-3n-E4B

meatmanek · 6m ago

Reasoning models do a lot better at AIME than non-reasoning models, with o3 mini getting 85% and 4o-mini getting 11%. It makes some sense that this would apply to small models as well.

tolerance · 2h ago

Is there like a leaderboard or power rankings sort of thing that tracks these small open models and assigns ratings or grades to them based on particular use cases?

esafak · 2h ago

https://artificialanalysis.ai/leaderboards/models?open_weigh...

cowpig · 1h ago

Compare these rankings to actual usage: https://openrouter.ai/rankings

Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?

Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.

My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.

But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.

threeducks · 12m ago

OpenRouter rankings conflate many factors like price, popularity, output quality and legal concerns. They can not tell us whether a model is popular because it is free, or because many people have heard about it, or because a model is genuinely good, or because the lawyers trust the provider.

byefruit · 1h ago

The openrouter rankings can be biased.

For example, Google's inexplicable design decisions around libraries and APIs means it's often worth the 5% premium to just use OpenRouter to access their models. In other cases it's about which models particular agents default to.

Sonnet 4 is extremely good for tool-usage agentic setups though - something I have found other models struggle to do over a long-context.

ImageXav · 1h ago

Thanks for sharing that. Interesting that the leaderboard is dominated by Anthropic, Google and DeepSeek. Openai doesn't even register.

reilly3000 · 32m ago

OpenAI has a lot of share that simply doesn’t exist via OpenRouter. Typical enterprise chat bot apps use it directly without paying a tax and may use litellm with another vendor for fallback.

esafak · 56m ago

I shared a link to small, open source models; Claude is neither.

GaggiX · 1h ago

Claude Opus is in the top 10, also people via OpenRouter mostly use these models for coding and Claude models are particularly good at this, the benchmark doesn't account only for coding capacities tho

whimsicalism · 48m ago

grok is not bad, i think 4 is better than claude for most things other than tool calling.

of course, this is a politically charged subject now so fair assessments might be hard to come by - as evidenced by the downvotes i've already gotten on this comment