Show HN: Plexe – ML Models from a Prompt

69 vaibhavdubey97 33 5/6/2025, 3:38:04 PM github.com ↗
Hey HN! We’re Vaibhav and Marcello. We’re building Plexe (https://github.com/plexe-ai/plexe), an open-source agent that turns natural language task descriptions into trained ML models. Here’s a video walkthrough: https://www.youtube.com/watch?v=bUwCSglhcXY.

There are all kinds of uses for ML models that never get realized because the process of making them is messy and convoluted. You can spend months trying to find the data, clean it, experiment with models and deploy to production, only to find out that your project has been binned for taking so long. There are many tools for “automating” ML, but it still takes teams of ML experts to actually productionize something of value. And we can’t keep throwing LLMs at every ML problem. Why use a generic 10B parameter language model, if a logistic regression trained on your data could do the job better?

Our light-bulb moment was that we could use LLMs to generate task-specific ML models that would be trained on one’s own data. Thanks to the emergent reasoning ability of LLMs, it is now possible to create an agentic system that might automate most of the ML lifecycle.

A couple of months ago, we started developing a Python library that would let you define ML models on structured data using a description of the expected behaviour. Our initial implementation arranged potential solutions into a graph, using LLMs to write plans, implement them as code, and run the resulting training script. Using simple search algorithms, the system traversed the solution space to identify and package the best model.

However, we ran into several limitations, as the algorithm proved brittle under edge cases, and we kept having to put patches for every minor issue in the training process. We decided to rethink the approach, throw everything out, and rebuild the tool using an agentic approach prioritising generality and flexibility. What started as a single ML engineering agent turned into an agentic ML "team", with all experiments tracked and logged using MLFlow.

Our current implementation uses the smolagents library to define an agent hierarchy. We mapped the functionality of our previous implementation to a set of specialized agents, such as an “ML scientist” that proposes solution plans, and so on. Each agent has specialized tools, instructions, and prompt templates. To facilitate cross-agent communication, we implemented a shared memory that enables objects (datasets, code snippets, etc) to be passed across agents indirectly by referencing keys in a registry. You can find a detailed write-up on how it works here: https://github.com/plexe-ai/plexe/blob/main/docs/architectur...

Plexe’s early release is focused on predictive problems over structured data, and can be used to build models such as forecasting player injury risk in high-intensity sports, product recommendations for an e-commerce marketplace, or predicting technical indicators for algorithmic trading. Here are some examples to get you started: https://github.com/plexe-ai/plexe/tree/main/examples

To get it working on your data, you can dump any CSV, parquet, etc and Plexe uses what it needs from your dataset to figure out what features it should use. In the open-source tool, it only supports adding files right now but in our platform version, we'll have support for integrating with Postgres where it pulls all available data based on an SQL query and dumps it into a parquet file for the agent to build models.

Next up, we’ll be tackling more of the ML project lifecycle: we’re currently working on adding a “feature engineering agent” that focuses on the complex data transformations that are often required for data to be ready for model training. If you're interested, check Plexe out and let us know your thoughts!

Comments (33)

Stiopa · 3h ago
Awesome work.

Only watched demo, but judging from the fact there are several agent-decided steps in the whole model generation process, I think it'd be useful for Plexe to ask the user in-between if they're happy with the plan for the next steps, so it's more interactive and not just a single, large one-shot.

E.g. telling the user what features the model plans to use, and the user being able to request any changes before that step is executed.

Also wanted to ask how you plan to scale to more advanced (case-specific) models? I see this as a quick and easy way to get the more trivial models working especially for less ML-experienced people, but am curious what would change for more complicated models or demanding users?

impresburger · 3h ago
Agree. We've designed a mechanism to enable any of the agents to ask for input from the user, but we haven't implemented it yet. Especially for more complex use cases, or use cases where the datasets are large and training runs are long, being able to interrupt (or guide) the agents' work would really help avoid "wasted" one-shot runs.

Regarding more complicated models and demanding users, I think we'd need:

1. More visibility into the training runs; log more metrics to MLFlow, visualise the state of the multi-agent system so the user knows "who is doing what", etc. 2. Give the user more control over the process, both before the building starts and during. Let the user override decisions made by the agents. This will require the mechanism I mentioned for letting both the user and the agents send each other messages during the build process. 3. Run model experiments in parallel. Currently the whole thing is "single thread", but with better parallelism (and potentially launching the training jobs on a separate Ray cluster, which we've started working on) you could throw more compute at the problem.

I'm sure there are many more things that would help here, but these are the first that come to mind off the top of my head.

What are your thoughts? Anything in particular that you think a demanding user would want/need?

thefourthchime · 5h ago
This is a really interesting idea! I'll be honest, it took me a minute to really get what it was doing. The GitHub page video doesn't play with any audio, so it's not clear what's happening.

Once I watched the video, I think I have a better understanding. One thing I would like to see is more of a breakdown of how this solves a problem that just a big model itself wouldn't.

vaibhavdubey97 · 5h ago
Thank you!

Yeah we rushed to create a "Plexe in action" video for our Readme. We'll put a link to the YouTube video on the Readme so it's easier.

Using large generative models enables fast prototyping, but runs into several issues: generic LLMs have high latency and cost, and fine-tuning/distilling doesn’t address the fundamental size issue. Given these pain points, we realized the solution isn’t bigger generic models (fine-tuned or not), but rather automating the creation, deployment, and management of lightweight models built on domain-specific data. An LLM can detect if an email is malicious, but a classifier built specifically for detecting malicious emails is orders of magnitude smaller and more efficient. Plus, it's easier to retrain with more data.

Oras · 4h ago
I like the idea of trying multiple solutions.

Does it decide based on data if it should make its own ML model or fine-tune a relevant one?

Also, does it detect issues with the training data? When I was doing NLP ML models before LLMs, the tasks that took all my time were related to data cleaning, not the training or choosing the right approach.

impresburger · 4h ago
Currently it decides whether to make its own model or fine-tune a relevant one based primarily on the problem description. The agent's ability to analyse the data when making decisions is pretty limited right now, and something we're currently working on (i.e. let the agent look at the data whenever relevant, etc).

I guess that kind of answers your second question, too: it does not currently detect issues with the training data. But it will after the next few pull requests we have lined up!

And yes, completely agree about data cleaning vs. model building. We started from model building as that's the "easier" problem, but our aim is to add more agents to the system to also handle reviewing the data, reasoning about it, creating feature engineering jobs, etc.

fzysingularity · 4h ago
Is there a benchmark or eval for why this might be a better approach than actually modeling the problem? If you're selling this a non-ML person, I get the draw. But you'd still have to show why using these LLMs would be better than training it with something simpler / more lightweight.

That said, it's likely that you'll get good zero-shot performance, so the model building phase could benefit from fine-tuning the prompt given the dataset - instead of training the underlying model itself.

impresburger · 3h ago
Just to clarify, we're not directly using the LLMs as the "predictor" models for the task. We're making the LLMs do the modeling work for you.

For example, take the classic "house price prediction" problem. We don't use an LLM to make the predictions, we use LLMs to model the problem and write code that trains an ML models to predict house prices. This would most likely end up being an xgboost regressor or something like that.

As to your point about evals, great question! We've done some testing but haven't yet carried out a systematic eval. We intend to run this on OpenAI's MLE-Bench to quantify how well it actually does as creating models.

Hope I didn't misunderstand your comment!

dweinus · 4h ago
I don't want to hate, what you built is really cool and should save time in a data scientist's workflow, but... we did this. It won't "automate most of the ML lifecycle." Back in ~2018 "autoML" was all the rage. It failed because creating boilerplate and training models are not the hard parts of ML. The hard parts are evaluating data quality, seeking out new data, designing features, making appropriate choices to prevent leakage, designing evaluation appropriate to the business problem, and knowing how this will all interact with the model design choices.
janalsncm · 11m ago
Yes, this is the issue. In any reasonably-sized enterprise you’re not going to have a clean CSV to plug in to a model generator. You’re either going to have 1) 50 different excel spreadsheets to wrangle and combine somehow or 2) 50+ terabytes of messy logs to process.

Creating something that can grok MNIST is certainly cool, but it’s kind of the equivalent of saying leetcode is equivalent to software engineering.

Second, and more practically speaking, you are automating (what I think of as) the most fun part of ML: the creativity of framing a problem and designing a model to solve that problem.

impresburger · 4h ago
Hey, one of the authors here! I completely agree with your comment. Training ML models on a clean dataset is the "easy" and fun part of an ML engineer's job.

While we do think our approach might have some advantages compared to "2018-style" AutoML (more flexibility, easier to use, potentially more intelligence solution space exploration), we know it suffers from the issue you highlighted. For the time being, this is aimed primarily at engineers who don't have ML expertise: someone who understands the business context, knows how to build data processing pipelines and web services, but might not know how to build the models.

Our next focus area is trying to apply the same agentic approach to the "data exploration" and "feature ETL engineering" part of the ML project lifecycle. Think a "data analyst agent" or "data engineering agent", with the ability to run and deploy feature processing jobs. I know it's a grand vision, and it won't happen overnight, but it's what we'd like to accomplish!

Would love to hear your thoughts :)

janalsncm · 8m ago
Just a thought, but maybe a good angle would be to interview data analysts and ask them what the most annoying parts of their jobs are, to figure out how to automate the drudge work. If you can make their lives easier, they’ll sell the product for you.
srameshc · 1h ago
I am just trying to understand and an honest question: Are we getting a fine tuned model from the dataset ?
vaibhavdubey97 · 51m ago
Plexe analyzes your data and task description, then builds custom ML models using standard Python libraries (like scikit-learn, XGBoost, etc.). If your problem is best solved by a regression model, it will build that. If classification is more appropriate, it will implement that instead.

Fine-tuning existing language models is also an option in Plexe's toolkit. For example, when we needed to classify prompt injections for LLMs, Plexe determined fine-tuning RoBERTa was the best approach. But for most structured data problems (like forecasting or recommendations), Plexe typically builds lightweight models from scratch that are trained directly on your dataset.

throwaway314155 · 17m ago
So just to be clear, you aren't building _deep_ learning models, or even NN-based models automatically?
impresburger · 4m ago
No, not by default. In fact, the default installation of plexe doesn't include deep learning libraries.

Plexe _can_ build deep learning models using `torch` and `transformers`, and often the experimentation process will include some NN-based solutions as well, but that's just one of the ML frameworks available to the agent. It can also build models using xgboost, scikit-learn, and several others.

You can also explicitly tell Plexe not to use neural nets, if that's a requirement.

throwaway314155 · 2m ago
Indeed your colleague explained similarly. Seems like a great project.
vaibhavdubey97 · 8m ago
Sorry I think I explained poorly. Plexe does build deep learning models automatically. When it gets a dataset and a problem description, it automatically evaluates various model architectures (NNs being one of them).

Plexe experiments with multiple approaches - from traditional algorithms like gradient boosting to deep neural networks. It runs the training jobs and compares performance metrics across different architectures to identify which solution best fits your specific data and problem constraints.

throwaway314155 · 5m ago
Oh okay! In that case, my faith is restored. Sounds like a cool project.
drlobster · 2h ago
That's great. Is there anyway to make it part of a scikit-learn compatible pipeline.?
impresburger · 1h ago
Do you mean being able to wrap the created model in a scikit-learn Pipeline? This isn't something we've thought about and we haven't explicitly built support for it, though we could.

As of now, I think you could relatively easily wrap the plexe model, which has a `predict()` method, in a scikit-learn Estimator. You could then plug it into a Pipeline.

What do you have in mind? How would you want to use this with scikit-learn pipelines?

drlobster · 4m ago
I think what I'm after is being able to put these in pipeline.

I.e. if I already have some data cleaning/normalisation, some dimensional reduction and then some fitting, being able to drop the Agent in place with an appropriate description and task.

Cleaning: Feed it a data frame and have it figure out what needs imputing etc.

The rest: Could either be separate tasks or one big task for the Agent..

vessenes · 4h ago
I like this a lot, thank you for building it.

Any review of smolagent? This combination of agents approach seems likely to be really useful in a lot of places, and I’m wondering if you liked it, loved it, hated it, …

impresburger · 4h ago
Hey, I'm one of the authors of Plexe. Overall, I'd say we like smolagents: it's simple, easy to understand, and you can get a project set up very quickly. It also has some neat features, such as the "step callbacks" (functions that are executed after every step the agent takes).

However, the library does feel somewhat immature, and has some drawbacks that hinder building a production application. Some of the issues we've ran into include:

1. It's not easy to customise the agents' system prompts. You have to "patch" the smolagents library's YAML templates in a hacky way. 2. There is no "shared memory" abstraction out of the box to help you manage communication between agents. We had to implement an "ObjectRegistry" class into which the agents can register objects, so that another agent can retrieve the object just by knowing the object's key string. As we scale, we will need to build more complex communication abstractions (tasks queues etc). Given that communication is a key element of multi-agent systems, I would have expected a popular library like smolagents to have some kind of built-in support for it. 3. No "structured response" where you can pass a Pydantic BaseModel (or similar) to specify what structure the agent response should have. 4. "Managed agents" are always executed synchronously. If you have a hierarchy of managed agents, only one agent will ever be working at any given time. So we'll have to build an async execution mechanism ourselves.

I think we've run into some other limitations as well, but these are the first that come to my mind :) hope this helps!

vessenes · 3h ago
Thanks - super helpful. Passing state around to agents feels like a big pain point right now. That said just getting simple state transition libraries working with agents is a bit of a pain point as well.

Feels like there might be a good infra company in there for someone to build.

vaibhavdubey97 · 4h ago
Thank you!

Smolagents works great for us but we did run into some limitations. For example, it lacks structured output enforcement, parallel execution, and in-built shared memory, which are crucial features for orchestrating a multi-layer agent hierarchy beyond simple chatbots. We've also been playing around with Pydantic AI due to its benefits with validation and type enforcement but haven't shifted yet.

yu3zhou4 · 4h ago
Nice execution! I built a simpler version of it a year ago https://github.com/jmaczan/csv-to-ml I hope you succeed with the product and push the automl forward
impresburger · 4h ago
Hey, this is super cool! We found a few projects working on similar things to Plexe, but were not aware of yours. Thanks for sharing, will check it out!
vaibhavdubey97 · 4h ago
Very cool, thanks for sharing! :)
ratatoskrt · 5h ago
In my experience, humans are really bad at statistics and LLMs are even worse because they basically just mimic all the typical mistakes people make.
vaibhavdubey97 · 4h ago
You're right. We've seen the "garbage in, garbage out" problem firsthand.

We've seen the models hit typical statistical pitfalls like overfitting and data leakage during testing. We've improved by implementing strict validation protocols and guardrails around data handling. While we've fixed the agents getting stuck in recursive debugging loops, statistical validity remains an ongoing challenge. We're actively working on better detection of these issues, but ultimately, we rely on domain expertise from users for evaluating model performance.

revskill · 4h ago
Instead of "Attention is all we need", i expect an "Intention is all we need".
vaibhavdubey97 · 4h ago
Absolutely! And hopefully an input/output schema for the model :)