Activeloop (YC S18) Is Hiring Member of Technical Staff – Back End Engineering (careers.activeloop.ai)

So I vibe-code a lot these days and recently i decided to give the same prompt to several llms, then get their codes and later give each code to every single one of them to ask which one they think is the most useful without telling them that they or the other 2 llms wrote it. The overall consensus is: gpt5. True I only compared gpt5 vs claude 4.1 vs qwen 230bn. OSS 120b, gemini and grok 4 were excluded since well i don't have the time. And obvious failures like amazon nova or anything from meta weren't even planned. Deepseek (both) seem a bit underperforming . Personally I'd say it's a close call between claude opus 4.1 vs both gpt4 and gpt5 (ironically gpt5 sometimes performs worse than 4, i think this has been addressed by many people already). That's just my personal experience, i know HumanEwal or SWE or whatever give various performance but idk, Musk used the benchmarks as "proof" to hype Grok and in my experience grok 4 is between LLAMA4 and obviously behind gpt4 or some variations of qwen.

Again this is coding only: Python and C. For physics, chemistry, scifi novels or whatever the case may be very different. Another kudos to OSS 120bn btw: it's very generous on tokens...like it will write a small programming book if it takes to in one reply, unless of course you tell it to be more limited, this is a huge plus for me since the code I demand should be complex and not some 20 lines nova "pro" joke.

Comments (5)

zahlman · 3h ago

> recently i decided to give the same prompt to several llms, then get their codes and later give each code to every single one of them to ask which one they think is the most useful without telling them that they or the other 2 llms wrote it.

The fact that you expect the result of this experiment to be useful, is more interesting than the actual result.

adinhitlore · 3h ago

vibe-coding is the future, drop conservatism....'free palestine' i mean you get the idea: be progressive and open minded.

pavel_lishin · 3h ago

Those seem like completely orthogonal concepts.

incomingpain · 3h ago

all ive done with gpt5 for coding was a major db refactor. i had run out of gemini limit for the day.

certainly got the job done. I doubt my gpt 20b or ~30b local llm would have been as capable. Overall it was about ~2000 lines of code to change, probably only 100,000 context.

gpt5 didnt one shot it. there were many steps inbetween. At the end, few hours, i had >50 linter warnings from tripled imports, loads of dead code that wouldnt be touched and for some reason gpt5 just couldnt fix any of this. Ended up increasing the warnings and added an error. My expectation is that any of the big guys could immediately fix it. Even restarted fresh context and gpt just wasnt having any of it. im certain even gpt 20b would have completed it in a minute. Curious.

I went to gemini flash, very generic prompt about linter warnings and it fixed it in 30 seconds.

Just kind of weirdness that benchmarks will never be able to catch. It's also going to be very dependent. A rust programmer might have a favourite, whereas python programmer benefits from another model. There can never be a best.

adinhitlore · 3h ago

I had similar experience, usually I'd ignore Gemini be it flash or pro but on several occasions it fixed complex errors like it's nothing. Yet when it comes to codegen it is "cheap" on tokens and struggles outputting complex logic. As a great bonus: their easy to setup API is freemium but a generous freemium (google AI studio I mean). My "ecosystem" atm will be something like: gpt5, claude 4.1 - if they both fail: try to fix with gemini. I'd skip Grok for privacy issues mostly not that I completely ignore its capabilities, qwen is good but sometimes 'overengineered' i don't need 400bn , given the large params maybe it will work for non-coding like if you ask it some exotic questions about science: casimir effect, acoustic levitation, ununennium etc etc you name it.

SigNoz (YC W21, Open Source Datadog) Is Hiring Platform Engineers (Remote) (jobs.ashbyhq.com)

Motion (YC W20) Is Hiring Principal Software Engineers (jobs.ashbyhq.com)

Bild AI (YC W25) Is Hiring an Applied AI Engineer (workatastartup.com)

Text.ai (YC X25) Is Hiring Founding Full-Stack Engineer (ycombinator.com)

Cua (YC X25) is hiring design engineers in SF (ycombinator.com)

Activeloop (YC S18) Is Hiring Member of Technical Staff – Back End Engineering (careers.activeloop.ai)

Coris (YC S22) Is Hiring (ycombinator.com)

14.ai (YC W24) is hiring engineers in SF to build an AI-native Zendesk (14.ai)

Spice Data (YC S19) Is Hiring a Product Associate (New Grad) (ycombinator.com)

Ashby (YC W19) Is Hiring Design Engineers in AMER and EMEA (ashbyhq.com)

EasyPost (YC S13) Is Hiring (easypost.com)

Tesorio (YC S15) Is Hiring a Senior GenAI Engineer (100% Remote) (tesorio.com)

OneSignal (YC S11) Is Hiring Engineers (onesignal.com)

Axle (YC S22) is hiring product engineers (ycombinator.com)

Mbodi AI (YC X25) Is Hiring a Founding Research Engineer (Robotics) (ycombinator.com)

ReadMe (YC W15) Is Hiring a Developer Experience PM (readme.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Depot (YC W23) Is Hiring a Community and Events Manager (Remote) (ycombinator.com)

CoLoop (YC S21) Is Hiring AI Engineers in London

Trellis (YC W24) Is Hiring: Automate Prior Auth in Healthcare (ycombinator.com)

Type (YC W23) is hiring a founding engineer to build an AI-native doc editor (ycombinator.com)

Foundry (YC F24) is hiring staff-level product engineers (ycombinator.com)

GoGoGrandparent (YC S16) Is Hiring Back End and Full-Stack Engineers

Kyber (YC W23) is hiring enterprise account executives (ycombinator.com)

Converge (YC S23) well-capitalized New York startup seeks product developers (runconverge.com)

Great Question (YC W21) Is Hiring a VP of Engineering (Remote) (ycombinator.com)

Coverage Cat (YC S22) Is Hiring a Senior, Staff, or Principal Engineer (coveragecat.com)

Kaizen (YC X25) is hiring engineers to build browser agents that work (kaizenautomation.com)

Infracost (YC W21) hiring first PM to shift $600B cloud spend to proactive (ycombinator.com)

Sei (YC W22) Is Hiring a Full Stack Engineer in Chennai, India (ycombinator.com)

Artie (YC S23) Is Hiring Founding AEs (ycombinator.com)

GPT5 is the best coding LLM because other LLMs admit it?

Comments (5)