We Built an AI Data Team with Pydantic AI (pydantic.dev)

Cool project! I have a couple of questions that would be nice in the writeup: * How did you generate your example problems? Did you take an existing benchmark? Or did you have LLMs generate the problems? * Do you have any thought to adding a second "base programming language" to alter? I'm not sure that there's enough variation as there is. (Another thought would be to generate 4 or 5 different new languages, each quite different, and then run the benchmark on each of those languages? I'm not sure how much the fact that it is randomly generated each time matters that much?)

But overall, a clever idea!

chromaton · 13h ago

Generating the problems: I just thought up a few simple things that the computer might be able to do. In the future, I hope to expand to more complex problems, based upon common business situations: reading CSVs, parsing data, etc. I'll probably add new tests once I get multi-shot and reliability working correctly.

New base programming languages would be great, but what would be even better is some sort of meta-language where many features can be turned on or off, rather than just scrambling the keywords like I do now.

I did some vibe testing with a current frontier model, and it gets quite confused and keeps insisting that there's a control structure that definitely doesn't exist in the TiānshūBench language with seed=1.

JSR_FDED · 1d ago

Would it be useful to generate Procedural, OOP and Functional variations of the problems?

chromaton · 13h ago

Yes, it would be fantastic to have more languages to test off of. I picked the base language I did (Mamba) because it was easy to modify and integrate into Python.

We Built an AI Data Team with Pydantic AI (pydantic.dev)

The peculiar bathroom habits of Westerners (2019) (bbc.com)

'Humanity deserves better': Jony Ive, Laurene Powell Jobs on tech's next chapter (ft.com)

Beyond the Black Box: Interpretability of LLMs in Finance (arxiv.org)

AI is learning to escape human control (web.archive.org)

Game engine for Gabriel Knight 3 (github.com)

Show HN: Moodlets – A Micro Mood Logger (moodlets.pages.dev)

Show HN: Fast Random Library for C++17 (github.com)

Show HN: Koro-koro – a pathing puzzle game (geonot.github.io)

Selling small online business – advice needed

Tesla Superchargers to Be Removed from New Jersey Turnpike (nj.com)

Our production Ruby on Rails stack (attendlist.com)

There's Something About Miriam (en.wikipedia.org)

Yambda-5B – A Large-Scale Multi-Modal Dataset for Ranking and Retrieval (arxiv.org)

TiRex Leads Gift Eval (huggingface.co)

WWDC: Disappointing in Terms of Apple AI? (heise.de)

Ignoring the value of "quiet work" starts in the classroom (blog.medium.com)

iNymbus (inymbus.com)

Forcing AI Personas to Admit Ignorance Makes Them More Realistic (askrally.com)

Revolutionizing Open Source: How Our OSPO Transformed Our Strategy (medium.com)

A.I. Is Coming for the Coders Who Made It (nytimes.com)

Flux Kontext: A new generation of multimodal image generation and editing tools (kontextflux.com)

We've Been Moving Data Around for Decades (datamanagement.ai)

Kurzweil: We'll Outpace Aging by 2029 (popularmechanics.com)

DNS Does Not Have to Be Hard (danielfullstack.com)

ThorVG: Super Lightweight Vector Graphics Engine (thorvg.org)

The LLM is just guessing and that's quite okay (ralphminderhoud.com)

Show HN: Aruko – Plan group travel without the chaos (no app download needed) (aruko.world)

TradExpert: Revolutionizing Trading with Mixture of Expert LLMs (arxiv.org)

Returning UnitedHealth CEO to face questions over pay and share price (ft.com)

Computer science has one of the highest unemployment rates (newsweek.com)

DSPy in Elixir (github.com)

The Leporine Trap (leporinetrap.wordpress.com)

Apple Challenges EU Order to Increase Compatibility with Rivals' Products (wsj.com)

The Simple Macroeconomics of AI (2024) [pdf] (economics.mit.edu)

Welcome to the age of $10/month Lakehouses (tobilg.com)

I've built an open source streaming library for async pipelines (github.com)

Random Silicon Sampling with AI Personas (askrally.com)

NSDL IPO – Dates, GMP, Price Band, Financials and Subscription Info (ipogyan.in)

Upcoming IPOs in June 2025 (ipogyan.in)

Things I Learned This Week: Patching Pitfalls, Go's OOP Philosophy, Python Async (krthr.co)

Meta Aims to Automate Ad Creation Using AI (wsj.com)

Schej (schej.it)

Kan.bn – An open-source alterative to Trello (github.com)

SIMTELNET Mirror (From bu.edu) (April 2013) (archive.org)

"I vibe coded and shipped an app in three days. It got hacked. Twice." (threadreaderapp.com)

Understanding Consistency in Databases: Beyond the Basics (medium.com)

Proxomitron.info the Webhiker's Guide to Proxomitron (proxomitron.info)

Is the Windows 10 to Windows 11 upgrade free? (huzit.net)

Awesome-ArXiv: curated tools for discovering and working with ArXiv papers (github.com)

Introducing TiānshūBench (天书Bench)

Comments (4)