Activeloop (YC S18) Is Hiring AI Search and Python Back End Engineers(Onsite,MV) (careers.activeloop.ai)

This wasn't some obscure edge case... it was basic data visualization that any decent model should handle. Yet somehow Grok 4 is "competing with humans" and has "99% tool accuracy"...

I don't buy it..

links: Claude: https://claude.ai/share/7a413a6a-5c01-44a1-aaed-8b237e5e9e94 Chatgpt: https://chatgpt.com/canvas/shared/687a9f9d4304819187ac7d98d3... Grok 4: https://grok.com/share/c2hhcmQtMw%3D%3D_20b61291-e1bb-45e5-a...

These benchmarks are either just wrong or measuring something completely divorced from practical utility imo...

ajd555 · 5h ago

Grok 4 has about 99% accuracy in picking the right tools and making tool calls with proper arguments almost every single time.

Where did this number come from? What is "the right tool"? I find this extremely subjective. As most engineers know, there is no right tool, but mostly a compromise where you pick the least worst tool and choose what risks you're willing to manage or not.

mdaniel · 5h ago

I believe in this context it means "tool" as in the MCP definition, e.g. "of the catalog of MCP integrations, it doesn't try to use the playright one to browse the web, it'd use the AWS docs one directly"

This is just my speculation, though, as I've never used Grok anything

ajd555 · 5h ago

Yeah, based on a previous comment, that makes sense. I am a little reassured that is what the author meant.

Byamarro · 5h ago

That's langchain terminology. LLMs usually are exposed to a set of tools. It's usually pretty obvious which are obvious, since there's only one tool that's even remotely associated with the task at hand.

ajd555 · 5h ago

Thanks for the info. This makes the article slightly less intolerable!

patrickhogan1 · 5h ago

On your intelligence graph where it shows Grok 4 and OpenAI o4-mini as comparable (and among the highest intelligence rated models), it doesn’t have OpenAI o3 or o3-pro.

Yet all of my tests show o3 blows o4-mini out of the water.

What are you classifying as intelligence?

4b11b4 · 4h ago

This article seems like pure garbage

knes · 5h ago

Didn't the tldr of grok 4 was their over tuned for bencmhark results but in day to day tasks . It's actual not better than o3 / gpt5

OrvalWintermute · 5h ago

grok4 is tortiously slow compared to all the other LLMs I use :(

amitksingh1490 · 5h ago

Ya, even I feel its slow, Thats why I use it only for architecture planning and finding complex issue

aitacobell · 5h ago

> To be honest, this model not only competes with other AI models but also with humans, making it the first of its kind

Is this a joke

kolektiv · 5h ago

I can't take anything seriously with phrases like "it has not yet achieved AGI, but it is one leap forward in the race to AGI" - based on what? Nobody knows whether LLMs are a viable approach to AGI, nobody really agrees on what AGI is, hell, people don't really agree on what "I" is.

This is just not even science at all at this point, we're just into solid cargo cult.

CamperBob2 · 5h ago

If the answer involves giving even more money to Elon Musk, you asked the wrong question.

Mango Health (YC W24) Is Hiring (ycombinator.com)

Resolve (YC W15) Is Hiring an Operations and Billing Lead for Construction VR

Arva AI (YC S24) Is Hiring an AI Research Engineer (London, UK) (arva.ai)

Rejoy Health (YC W21) Is Hiring (ycombinator.com)

Weave (YC W25) is hiring an AI engineer (ycombinator.com)

CoinTracker (YC W18) is hiring to solve crypto taxes and accounting (remote)

Crimson (YC X25) is hiring founding engineers in London (ycombinator.com)

Martin (YC S23) Is Hiring Founding Engineers to Build a Better Siri (ycombinator.com)

Meticulous (YC S21) is hiring in UK to redefine software dev (tinyurl.com)

Infisical (YC W23) Is Hiring DevRel Engineers (ycombinator.com)

Sieve (YC X25) is hiring researchers to build large video datasets for AI labs (sievedata.com)

Activeloop (YC S18) Is Hiring AI Search and Python Back End Engineers(Onsite,MV) (careers.activeloop.ai)

Attimet (YC F24) – Quant Trading Research Lab – Is Hiring Founding Researcher (ycombinator.com)

Metriport (YC S22) is hiring engineers to improve healthcare data exchange (ycombinator.com)

Telli (YC F24) Is Hiring Engineers [On-Site Berlin] (hi.telli.com)

Continue (YC S23) is hiring software engineers in San Francisco (ycombinator.com)

UpCodes (YC S17) is hiring a Head of Ops to automate construction compliance (up.codes)

Enhanced Radar (YC W25) is hiring a founding engineer

Converge (YC S23) well-capitalized New York startup seeks product developers (runconverge.com)

Kyber (YC W23) Is Hiring Enterprise BDRs (ycombinator.com)

MindsDB (YC W20) is hiring an AI solutions engineer (job-boards.greenhouse.io)

Recurse Center (YC S10) Is Hiring a Career Facilitator (recurse.notion.site)

Cua (YC X25) is hiring an engineer (ycombinator.com)

Noloco (YC S21) is hiring a founder's associate in Barcelona (ycombinator.com)

14.ai (YC W24) hiring founding engineers in SF to build a Zendesk alternative (14.ai)

Lago (Open-Source Usage Based Billing) is hiring for ten roles (ycombinator.com)

Spark AI (YC W24) is hiring a full-stack engineer in SF (founding team) (ycombinator.com)

Bitmovin (YC S15) Is Hiring a Junior Solutions Engineer in Denver (bitmovin.com)

SigNoz (YC W21, Open Source Datadog) Is Hiring DevRel Engineers (Remote)(US) (ycombinator.com)

AccessOwl (YC S22) is hiring an Elixir Engineer to connect 100s of SaaS (ycombinator.com)

FurtherAI (YC W24) Is Hiring for Software and AI Roles (ycombinator.com)

Everything You Need to Know About Grok 4

Comments (15)