The Illusion of Thinking: Strengths and limitations of reasoning models [pdf] (ml-site.cdn-apple.com)

Hey HN! We’re Vaibhav and Marcello. We’re building Plexe (https://github.com/plexe-ai/plexe), an open-source agent that turns natural language task descriptions into trained ML models. Here’s a video walkthrough: https://www.youtube.com/watch?v=bUwCSglhcXY.

There are all kinds of uses for ML models that never get realized because the process of making them is messy and convoluted. You can spend months trying to find the data, clean it, experiment with models and deploy to production, only to find out that your project has been binned for taking so long. There are many tools for “automating” ML, but it still takes teams of ML experts to actually productionize something of value. And we can’t keep throwing LLMs at every ML problem. Why use a generic 10B parameter language model, if a logistic regression trained on your data could do the job better?

Our light-bulb moment was that we could use LLMs to generate task-specific ML models that would be trained on one’s own data. Thanks to the emergent reasoning ability of LLMs, it is now possible to create an agentic system that might automate most of the ML lifecycle.

A couple of months ago, we started developing a Python library that would let you define ML models on structured data using a description of the expected behaviour. Our initial implementation arranged potential solutions into a graph, using LLMs to write plans, implement them as code, and run the resulting training script. Using simple search algorithms, the system traversed the solution space to identify and package the best model.

However, we ran into several limitations, as the algorithm proved brittle under edge cases, and we kept having to put patches for every minor issue in the training process. We decided to rethink the approach, throw everything out, and rebuild the tool using an agentic approach prioritising generality and flexibility. What started as a single ML engineering agent turned into an agentic ML "team", with all experiments tracked and logged using MLFlow.

Our current implementation uses the smolagents library to define an agent hierarchy. We mapped the functionality of our previous implementation to a set of specialized agents, such as an “ML scientist” that proposes solution plans, and so on. Each agent has specialized tools, instructions, and prompt templates. To facilitate cross-agent communication, we implemented a shared memory that enables objects (datasets, code snippets, etc) to be passed across agents indirectly by referencing keys in a registry. You can find a detailed write-up on how it works here: https://github.com/plexe-ai/plexe/blob/main/docs/architectur...

Plexe’s early release is focused on predictive problems over structured data, and can be used to build models such as forecasting player injury risk in high-intensity sports, product recommendations for an e-commerce marketplace, or predicting technical indicators for algorithmic trading. Here are some examples to get you started: https://github.com/plexe-ai/plexe/tree/main/examples

To get it working on your data, you can dump any CSV, parquet, etc and Plexe uses what it needs from your dataset to figure out what features it should use. In the open-source tool, it only supports adding files right now but in our platform version, we'll have support for integrating with Postgres where it pulls all available data based on an SQL query and dumps it into a parquet file for the agent to build models.

Next up, we’ll be tackling more of the ML project lifecycle: we’re currently working on adding a “feature engineering agent” that focuses on the complex data transformations that are often required for data to be ready for model training. If you're interested, check Plexe out and let us know your thoughts!

Comments (49)

Stiopa · 33d ago

Awesome work.

Only watched demo, but judging from the fact there are several agent-decided steps in the whole model generation process, I think it'd be useful for Plexe to ask the user in-between if they're happy with the plan for the next steps, so it's more interactive and not just a single, large one-shot.

E.g. telling the user what features the model plans to use, and the user being able to request any changes before that step is executed.

Also wanted to ask how you plan to scale to more advanced (case-specific) models? I see this as a quick and easy way to get the more trivial models working especially for less ML-experienced people, but am curious what would change for more complicated models or demanding users?

impresburger · 33d ago

Agree. We've designed a mechanism to enable any of the agents to ask for input from the user, but we haven't implemented it yet. Especially for more complex use cases, or use cases where the datasets are large and training runs are long, being able to interrupt (or guide) the agents' work would really help avoid "wasted" one-shot runs.

Regarding more complicated models and demanding users, I think we'd need:

1. More visibility into the training runs; log more metrics to MLFlow, visualise the state of the multi-agent system so the user knows "who is doing what", etc. 2. Give the user more control over the process, both before the building starts and during. Let the user override decisions made by the agents. This will require the mechanism I mentioned for letting both the user and the agents send each other messages during the build process. 3. Run model experiments in parallel. Currently the whole thing is "single thread", but with better parallelism (and potentially launching the training jobs on a separate Ray cluster, which we've started working on) you could throw more compute at the problem.

I'm sure there are many more things that would help here, but these are the first that come to mind off the top of my head.

What are your thoughts? Anything in particular that you think a demanding user would want/need?

Stiopa · 30d ago

Those sound like great next steps! Even without parallel runs, just by having the ability to tweak things during model builds would be super valuable; sort of like how many of the AI IDEs today (Cursor) make changes in increments and, crucially, always ask you whether to proceed or not (and how). At least that's what comes first to mind!

vaibhavdubey97 · 29d ago

Absolutely! We started off thinking that we wanted to automate the whole way in one go and then add restrictions and interruptions based on areas where users face issues. This is great feedback so thank you!

thefourthchime · 33d ago

This is a really interesting idea! I'll be honest, it took me a minute to really get what it was doing. The GitHub page video doesn't play with any audio, so it's not clear what's happening.

Once I watched the video, I think I have a better understanding. One thing I would like to see is more of a breakdown of how this solves a problem that just a big model itself wouldn't.

vaibhavdubey97 · 33d ago

Thank you!

Yeah we rushed to create a "Plexe in action" video for our Readme. We'll put a link to the YouTube video on the Readme so it's easier.

Using large generative models enables fast prototyping, but runs into several issues: generic LLMs have high latency and cost, and fine-tuning/distilling doesn’t address the fundamental size issue. Given these pain points, we realized the solution isn’t bigger generic models (fine-tuned or not), but rather automating the creation, deployment, and management of lightweight models built on domain-specific data. An LLM can detect if an email is malicious, but a classifier built specifically for detecting malicious emails is orders of magnitude smaller and more efficient. Plus, it's easier to retrain with more data.

Oras · 33d ago

I like the idea of trying multiple solutions.

Does it decide based on data if it should make its own ML model or fine-tune a relevant one?

Also, does it detect issues with the training data? When I was doing NLP ML models before LLMs, the tasks that took all my time were related to data cleaning, not the training or choosing the right approach.

impresburger · 33d ago

Currently it decides whether to make its own model or fine-tune a relevant one based primarily on the problem description. The agent's ability to analyse the data when making decisions is pretty limited right now, and something we're currently working on (i.e. let the agent look at the data whenever relevant, etc).

I guess that kind of answers your second question, too: it does not currently detect issues with the training data. But it will after the next few pull requests we have lined up!

And yes, completely agree about data cleaning vs. model building. We started from model building as that's the "easier" problem, but our aim is to add more agents to the system to also handle reviewing the data, reasoning about it, creating feature engineering jobs, etc.

dweinus · 33d ago

I don't want to hate, what you built is really cool and should save time in a data scientist's workflow, but... we did this. It won't "automate most of the ML lifecycle." Back in ~2018 "autoML" was all the rage. It failed because creating boilerplate and training models are not the hard parts of ML. The hard parts are evaluating data quality, seeking out new data, designing features, making appropriate choices to prevent leakage, designing evaluation appropriate to the business problem, and knowing how this will all interact with the model design choices.

impresburger · 33d ago

Hey, one of the authors here! I completely agree with your comment. Training ML models on a clean dataset is the "easy" and fun part of an ML engineer's job.

While we do think our approach might have some advantages compared to "2018-style" AutoML (more flexibility, easier to use, potentially more intelligence solution space exploration), we know it suffers from the issue you highlighted. For the time being, this is aimed primarily at engineers who don't have ML expertise: someone who understands the business context, knows how to build data processing pipelines and web services, but might not know how to build the models.

Our next focus area is trying to apply the same agentic approach to the "data exploration" and "feature ETL engineering" part of the ML project lifecycle. Think a "data analyst agent" or "data engineering agent", with the ability to run and deploy feature processing jobs. I know it's a grand vision, and it won't happen overnight, but it's what we'd like to accomplish!

Would love to hear your thoughts :)

lamename · 32d ago

> this is aimed primarily at engineers who don't have ML expertise: someone who understands the business context, knows how to build data processing pipelines and web services, but might not know how to build the models.

I respect software engineers a lot, however ANYONE who "doesn't know how to build models" also doesn't know what data leakage is, how to evaluate a model more deeply than simple metrics/loss, and can easily trick themselves into building a "great" model that ends up falling on its face in prod. So apologies if I'm highly skeptical of the admittedly very very cool thing you have built. I'd love to hear your thoughts.

impresburger · 32d ago

I think you're probably right. As an example of this challenge, I've noticed that engineers who don't have a background in ML often lack the "mental models" to understand how to think about testing ML models (i.e. statistical testing as opposed to the kind of pass/fail test cases that are used to test code).

The way I look at this is that plexe can be useful even if it doesn't solve this fundamental problem. When a team doesn't have ML expertise, their choices are A) don't use ML B) acquire ML expertise C) use ChatGPT as your predictor. Option C suffers of the same problem you mentioned, in addition to latency/scalability/cost and the model not being trained on your data etc. So something like Plexe could be an improvement on option C by at least addressing the latter pain points.

Plus: we can keep throwing more compute at the agentic model building process, doing more analysis, more planning, more evaluation, more testing, etc. It still won't solve the problem you bring up, but hopefully it gets us closer to the point of "good enough to not matter" :)

Would love to hear your thoughts on this.

janalsncm · 33d ago

Just a thought, but maybe a good angle would be to interview data analysts and ask them what the most annoying parts of their jobs are, to figure out how to automate the drudge work. If you can make their lives easier, they’ll sell the product for you.

vaibhavdubey97 · 33d ago

Absolutely! When we started building this out, we knew that we had to build an agent to perform data cleaning and feature transformations. After speaking to data analysts, PMs and engineers over the last few weeks, we've received strong feedback about adding this capability to Plexe and we're actively working on it. We've already added some features related to this and hopefully will roll out the whole agent very soon!

janalsncm · 33d ago

Yes, this is the issue. In any reasonably-sized enterprise you’re not going to have a clean CSV to plug in to a model generator. You’re either going to have 1) 50 different excel spreadsheets to wrangle and combine somehow or 2) 50+ terabytes of messy logs to process.

Creating something that can grok MNIST is certainly cool, but it’s kind of the equivalent of saying leetcode is equivalent to software engineering.

Second, and more practically speaking, you are automating (what I think of as) the most fun part of ML: the creativity of framing a problem and designing a model to solve that problem.

vaibhavdubey97 · 32d ago

Agree completely. We built Plexe with that first scenario in mind - the messy spreadsheet problem that's so common in enterprise. You can connect multiple data sources, and Plexe will identify what it needs based on the problem description. We're also gradually developing support for handling terabyte-scale data, though we're not there yet. We started by validating our approach on well-defined problems with clean datasets, but we've been systematically adding capabilities to handle increasingly complex scenarios.

On your second point about automating the "fun part", we see Plexe as amplifying that creativity rather than automating it. We're trying to make it easier to design the experiments and evaluating results. But would love to hear your feedback on this!

fzysingularity · 33d ago

Is there a benchmark or eval for why this might be a better approach than actually modeling the problem? If you're selling this a non-ML person, I get the draw. But you'd still have to show why using these LLMs would be better than training it with something simpler / more lightweight.

That said, it's likely that you'll get good zero-shot performance, so the model building phase could benefit from fine-tuning the prompt given the dataset - instead of training the underlying model itself.

impresburger · 33d ago

Just to clarify, we're not directly using the LLMs as the "predictor" models for the task. We're making the LLMs do the modeling work for you.

For example, take the classic "house price prediction" problem. We don't use an LLM to make the predictions, we use LLMs to model the problem and write code that trains an ML models to predict house prices. This would most likely end up being an xgboost regressor or something like that.

As to your point about evals, great question! We've done some testing but haven't yet carried out a systematic eval. We intend to run this on OpenAI's MLE-Bench to quantify how well it actually does as creating models.

Hope I didn't misunderstand your comment!

marinr1 · 32d ago

hey this is very cool I work at a bank and we are starting to look at something like this mainly to automate boilerplate code for experimentation and model training however we are a GCP shop, I might play with this over the weekend to see if i can add support for vertex.ai experiments.

Have you thought about extending this to cover the model development lifecycle and perhaps having agents to help with EDA, model selection, explanation and feature engineering? this is where we are seeing a lot of demand from users as well but we are starting out with experiment/ pipeline / serving boilerplate.

vaibhavdubey97 · 32d ago

Hey! That sounds great. Happy to help in case you face any issues while adding support for vertex.ai.

We’ve added a tool which does EDA and when the model package is created, it contains a file called metadata.json which has detailed explanations for why a model was chosen, preprocessing steps and technical strengths & limitations. We’re working on adding an agent for performing feature engineering and should be out soon!

vessenes · 33d ago

I like this a lot, thank you for building it.

Any review of smolagent? This combination of agents approach seems likely to be really useful in a lot of places, and I’m wondering if you liked it, loved it, hated it, …

impresburger · 33d ago

Hey, I'm one of the authors of Plexe. Overall, I'd say we like smolagents: it's simple, easy to understand, and you can get a project set up very quickly. It also has some neat features, such as the "step callbacks" (functions that are executed after every step the agent takes).

However, the library does feel somewhat immature, and has some drawbacks that hinder building a production application. Some of the issues we've ran into include:

1. It's not easy to customise the agents' system prompts. You have to "patch" the smolagents library's YAML templates in a hacky way. 2. There is no "shared memory" abstraction out of the box to help you manage communication between agents. We had to implement an "ObjectRegistry" class into which the agents can register objects, so that another agent can retrieve the object just by knowing the object's key string. As we scale, we will need to build more complex communication abstractions (tasks queues etc). Given that communication is a key element of multi-agent systems, I would have expected a popular library like smolagents to have some kind of built-in support for it. 3. No "structured response" where you can pass a Pydantic BaseModel (or similar) to specify what structure the agent response should have. 4. "Managed agents" are always executed synchronously. If you have a hierarchy of managed agents, only one agent will ever be working at any given time. So we'll have to build an async execution mechanism ourselves.

I think we've run into some other limitations as well, but these are the first that come to my mind :) hope this helps!

vessenes · 33d ago

Thanks - super helpful. Passing state around to agents feels like a big pain point right now. That said just getting simple state transition libraries working with agents is a bit of a pain point as well.

Feels like there might be a good infra company in there for someone to build.

mikeqq2024 · 29d ago

Could you elaborate usage about state transition libraries with agents further?

vaibhavdubey97 · 33d ago

Thank you!

Smolagents works great for us but we did run into some limitations. For example, it lacks structured output enforcement, parallel execution, and in-built shared memory, which are crucial features for orchestrating a multi-layer agent hierarchy beyond simple chatbots. We've also been playing around with Pydantic AI due to its benefits with validation and type enforcement but haven't shifted yet.

revskill · 33d ago

Instead of "Attention is all we need", i expect an "Intention is all we need".

vaibhavdubey97 · 33d ago

Absolutely! And hopefully an input/output schema for the model :)

drlobster · 33d ago

That's great. Is there anyway to make it part of a scikit-learn compatible pipeline.?

impresburger · 33d ago

Do you mean being able to wrap the created model in a scikit-learn Pipeline? This isn't something we've thought about and we haven't explicitly built support for it, though we could.

As of now, I think you could relatively easily wrap the plexe model, which has a `predict()` method, in a scikit-learn Estimator. You could then plug it into a Pipeline.

What do you have in mind? How would you want to use this with scikit-learn pipelines?

drlobster · 33d ago

I think what I'm after is being able to put these in pipeline.

I.e. if I already have some data cleaning/normalisation, some dimensional reduction and then some fitting, being able to drop the Agent in place with an appropriate description and task.

Cleaning: Feed it a data frame and have it figure out what needs imputing etc.

The rest: Could either be separate tasks or one big task for the Agent..

impresburger · 33d ago

Interesting! We don't currently support this explicitly.

You could wrap the Plexe-built model in a scikit-learn Estimator like I mentioned, and you can specify the desired input/output schema of the model when you start building it, so it will fit into your Pipeline.

This is an interesting requirement for us to think about though. Maybe we'll build proper support for the "I want to use this in a Pipeline" use case :)

srameshc · 33d ago

I am just trying to understand and an honest question: Are we getting a fine tuned model from the dataset ?

vaibhavdubey97 · 33d ago

Plexe analyzes your data and task description, then builds custom ML models using standard Python libraries (like scikit-learn, XGBoost, etc.). If your problem is best solved by a regression model, it will build that. If classification is more appropriate, it will implement that instead.

Fine-tuning existing language models is also an option in Plexe's toolkit. For example, when we needed to classify prompt injections for LLMs, Plexe determined fine-tuning RoBERTa was the best approach. But for most structured data problems (like forecasting or recommendations), Plexe typically builds lightweight models from scratch that are trained directly on your dataset.

throwaway314155 · 33d ago

So just to be clear, you aren't building _deep_ learning models, or even NN-based models automatically?

vaibhavdubey97 · 33d ago

Sorry I think I explained poorly. Plexe does build deep learning models automatically. When it gets a dataset and a problem description, it automatically evaluates various model architectures (NNs being one of them).

Plexe experiments with multiple approaches - from traditional algorithms like gradient boosting to deep neural networks. It runs the training jobs and compares performance metrics across different architectures to identify which solution best fits your specific data and problem constraints.

throwaway314155 · 33d ago

Oh okay! In that case, my faith is restored. Sounds like a cool project.

vaibhavdubey97 · 33d ago

phew that was close. I'm glad your faith is restored :)

impresburger · 33d ago

No, not by default. In fact, the default installation of plexe doesn't include deep learning libraries.

Plexe _can_ build deep learning models using `torch` and `transformers`, and often the experimentation process will include some NN-based solutions as well, but that's just one of the ML frameworks available to the agent. It can also build models using xgboost, scikit-learn, and several others.

You can also explicitly tell Plexe not to use neural nets, if that's a requirement.

throwaway314155 · 33d ago

Indeed your colleague explained similarly. Seems like a great project.

MarcoDewey · 33d ago

I love that you all are doing real old school machine learning and not just LLM transformer based work!

vaibhavdubey97 · 32d ago

Thank you! Hopefully more complicated transformers are coming soon too :)

yu3zhou4 · 33d ago

Nice execution! I built a simpler version of it a year ago https://github.com/jmaczan/csv-to-ml I hope you succeed with the product and push the automl forward

impresburger · 33d ago

Hey, this is super cool! We found a few projects working on similar things to Plexe, but were not aware of yours. Thanks for sharing, will check it out!

vaibhavdubey97 · 33d ago

Very cool, thanks for sharing! :)

ratatoskrt · 33d ago

In my experience, humans are really bad at statistics and LLMs are even worse because they basically just mimic all the typical mistakes people make.

vaibhavdubey97 · 33d ago

You're right. We've seen the "garbage in, garbage out" problem firsthand.

We've seen the models hit typical statistical pitfalls like overfitting and data leakage during testing. We've improved by implementing strict validation protocols and guardrails around data handling. While we've fixed the agents getting stuck in recursive debugging loops, statistical validity remains an ongoing challenge. We're actively working on better detection of these issues, but ultimately, we rely on domain expertise from users for evaluating model performance.

ratatoskrt · 30d ago

Yeah, I don't mean to indicate that the model is bad. It's just that statistics are notoriously complicated (both in terms of the mathematics involved but also intuitively understanding the impact of non-perfect data and potential interactions is really hard), and most people really, really suck at it. Once you move from the maths to actually modelling data, you have to rely a lot on (niche) domain knowledge and experience.

I'm a bit biased because I work in a space with actual statisticians, but I'd wager that it's almost impossible for current LLMs to distinguish between good and bad examples in their training data. After all, even very smart humans fail to do that.

gitroom · 32d ago

well i actually like when folks push old school ML instead of just LLM stuff everywhere. makes me feel like we're not losing the basics

vaibhavdubey97 · 32d ago

Thank you! Making sure that basics don't going anywhere :)

My AI skeptic friends are all nuts (fly.io)

Self-hosting your own media considered harmful according to YouTube (jeffgeerling.com)

Bill Atkinson has died (daringfireball.net)

The time bomb in the tax code that's fueling mass tech layoffs (qz.com)

OpenAI slams court order to save all ChatGPT logs, including deleted chats (arstechnica.com)

Cloudlflare builds OAuth with Claude and publishes all the prompts (github.com)

FFmpeg merges WebRTC support (git.ffmpeg.org)

If you are useful, it doesn't mean you are valued (betterthanrandom.substack.com)

IRS Direct File on GitHub (chrisgiven.com)

A proposal to restrict sites from accessing a users’ local network (github.com)

Quarkdown: A modern Markdown-based typesetting system (github.com)

How to post when no one is reading (jeetmehta.com)

The last six months in LLMs, illustrated by pelicans on bicycles (simonwillison.net)

Merlin Bird ID (merlin.allaboutbirds.org)

Cursor 1.0 (cursor.com)

Why I wrote the BEAM book (happihacking.com)

Deep learning gets the glory, deep fact checking gets ignored (rachel.fast.ai)

How we decreased GitLab repo backup times from 48 hours to 41 minutes (about.gitlab.com)

The impossible predicament of the death newts (crookedtimber.org)

Covert web-to-app tracking via localhost on Android (localmess.github.io)

Meta: Shut down your invasive AI Discover feed (mozillafoundation.org)

Show HN: Kan.bn – An open-source alterative to Trello (github.com)

Tesla seeks to guard crash data from public disclosure (reuters.com)

EU Commission refuses to disclose authors behind its mass surveillance proposal (old.reddit.com)

Show HN: Air Lab – A portable and open air quality measuring device (networkedartifacts.com)

Prompt engineering playbook for programmers (addyo.substack.com)

Falsehoods programmers believe about aviation (flightaware.engineering)

Washington Post's Privacy Tip: Stop Using Chrome, Delete Meta Apps (and Yandex) (tech.slashdot.org)

My experiment living in a tent in Hong Kong's jungle (corentin.trebaol.com)

The Right to Repair Is Law in Washington State (eff.org)

Researchers develop ‘transparent paper’ as alternative to plastics (japannews.yomiuri.co.jp)

Google restricts Android sideloading (puri.sm)

(On | No) Syntactic Support for Error Handling (go.dev)

Convert photos to Atkinson dithering (gazs.github.io)

Ask HN: How do I learn robotics in 2025?

Getting Past Procrastination (spectrum.ieee.org)

The Illusion of Thinking: Strengths and limitations of reasoning models [pdf] (ml-site.cdn-apple.com)

Self-Host and Tech Independence: The Joy of Building Your Own (ssp.sh)

Joining Apple Computer (2018) (folklore.org)

A year of funded FreeBSD development (daemonology.net)

DiffX – Next-Generation Extensible Diff Format (diffx.org)

Ask HN: Who is hiring? (June 2025)

Builder.ai Collapses: $1.5B 'AI' Startup Exposed as 'Indians'? (ibtimes.co.uk)

Cockatoos have learned to operate drinking fountains in Australia (science.org)

The iPhone 15 Pro’s Depth Maps (tech.marksblogg.com)

Autonomous drone defeats human champions in racing first (tudelft.nl)

How Ukraine’s killer drones are beating Russian jamming (spectrum.ieee.org)

Gemini-2.5-pro-preview-06-05 (deepmind.google)

Apple Notes Will Gain Markdown Export at WWDC, and, I Have Thoughts (daringfireball.net)

Cloud Run GPUs, now GA, makes running AI workloads easier for everyone (cloud.google.com)

Show HN: Plexe – ML Models from a Prompt

Comments (49)