Show HN: Plexe – ML Models from a Prompt
There are all kinds of uses for ML models that never get realized because the process of making them is messy and convoluted. You can spend months trying to find the data, clean it, experiment with models and deploy to production, only to find out that your project has been binned for taking so long. There are many tools for “automating” ML, but it still takes teams of ML experts to actually productionize something of value. And we can’t keep throwing LLMs at every ML problem. Why use a generic 10B parameter language model, if a logistic regression trained on your data could do the job better?
Our light-bulb moment was that we could use LLMs to generate task-specific ML models that would be trained on one’s own data. Thanks to the emergent reasoning ability of LLMs, it is now possible to create an agentic system that might automate most of the ML lifecycle.
A couple of months ago, we started developing a Python library that would let you define ML models on structured data using a description of the expected behaviour. Our initial implementation arranged potential solutions into a graph, using LLMs to write plans, implement them as code, and run the resulting training script. Using simple search algorithms, the system traversed the solution space to identify and package the best model.
However, we ran into several limitations, as the algorithm proved brittle under edge cases, and we kept having to put patches for every minor issue in the training process. We decided to rethink the approach, throw everything out, and rebuild the tool using an agentic approach prioritising generality and flexibility. What started as a single ML engineering agent turned into an agentic ML "team", with all experiments tracked and logged using MLFlow.
Our current implementation uses the smolagents library to define an agent hierarchy. We mapped the functionality of our previous implementation to a set of specialized agents, such as an “ML scientist” that proposes solution plans, and so on. Each agent has specialized tools, instructions, and prompt templates. To facilitate cross-agent communication, we implemented a shared memory that enables objects (datasets, code snippets, etc) to be passed across agents indirectly by referencing keys in a registry. You can find a detailed write-up on how it works here: https://github.com/plexe-ai/plexe/blob/main/docs/architectur...
Plexe’s early release is focused on predictive problems over structured data, and can be used to build models such as forecasting player injury risk in high-intensity sports, product recommendations for an e-commerce marketplace, or predicting technical indicators for algorithmic trading. Here are some examples to get you started: https://github.com/plexe-ai/plexe/tree/main/examples
To get it working on your data, you can dump any CSV, parquet, etc and Plexe uses what it needs from your dataset to figure out what features it should use. In the open-source tool, it only supports adding files right now but in our platform version, we'll have support for integrating with Postgres where it pulls all available data based on an SQL query and dumps it into a parquet file for the agent to build models.
Next up, we’ll be tackling more of the ML project lifecycle: we’re currently working on adding a “feature engineering agent” that focuses on the complex data transformations that are often required for data to be ready for model training. If you're interested, check Plexe out and let us know your thoughts!
Only watched demo, but judging from the fact there are several agent-decided steps in the whole model generation process, I think it'd be useful for Plexe to ask the user in-between if they're happy with the plan for the next steps, so it's more interactive and not just a single, large one-shot.
E.g. telling the user what features the model plans to use, and the user being able to request any changes before that step is executed.
Also wanted to ask how you plan to scale to more advanced (case-specific) models? I see this as a quick and easy way to get the more trivial models working especially for less ML-experienced people, but am curious what would change for more complicated models or demanding users?
Regarding more complicated models and demanding users, I think we'd need:
1. More visibility into the training runs; log more metrics to MLFlow, visualise the state of the multi-agent system so the user knows "who is doing what", etc. 2. Give the user more control over the process, both before the building starts and during. Let the user override decisions made by the agents. This will require the mechanism I mentioned for letting both the user and the agents send each other messages during the build process. 3. Run model experiments in parallel. Currently the whole thing is "single thread", but with better parallelism (and potentially launching the training jobs on a separate Ray cluster, which we've started working on) you could throw more compute at the problem.
I'm sure there are many more things that would help here, but these are the first that come to mind off the top of my head.
What are your thoughts? Anything in particular that you think a demanding user would want/need?
Once I watched the video, I think I have a better understanding. One thing I would like to see is more of a breakdown of how this solves a problem that just a big model itself wouldn't.
Yeah we rushed to create a "Plexe in action" video for our Readme. We'll put a link to the YouTube video on the Readme so it's easier.
Using large generative models enables fast prototyping, but runs into several issues: generic LLMs have high latency and cost, and fine-tuning/distilling doesn’t address the fundamental size issue. Given these pain points, we realized the solution isn’t bigger generic models (fine-tuned or not), but rather automating the creation, deployment, and management of lightweight models built on domain-specific data. An LLM can detect if an email is malicious, but a classifier built specifically for detecting malicious emails is orders of magnitude smaller and more efficient. Plus, it's easier to retrain with more data.
Does it decide based on data if it should make its own ML model or fine-tune a relevant one?
Also, does it detect issues with the training data? When I was doing NLP ML models before LLMs, the tasks that took all my time were related to data cleaning, not the training or choosing the right approach.
I guess that kind of answers your second question, too: it does not currently detect issues with the training data. But it will after the next few pull requests we have lined up!
And yes, completely agree about data cleaning vs. model building. We started from model building as that's the "easier" problem, but our aim is to add more agents to the system to also handle reviewing the data, reasoning about it, creating feature engineering jobs, etc.
That said, it's likely that you'll get good zero-shot performance, so the model building phase could benefit from fine-tuning the prompt given the dataset - instead of training the underlying model itself.
For example, take the classic "house price prediction" problem. We don't use an LLM to make the predictions, we use LLMs to model the problem and write code that trains an ML models to predict house prices. This would most likely end up being an xgboost regressor or something like that.
As to your point about evals, great question! We've done some testing but haven't yet carried out a systematic eval. We intend to run this on OpenAI's MLE-Bench to quantify how well it actually does as creating models.
Hope I didn't misunderstand your comment!
While we do think our approach might have some advantages compared to "2018-style" AutoML (more flexibility, easier to use, potentially more intelligence solution space exploration), we know it suffers from the issue you highlighted. For the time being, this is aimed primarily at engineers who don't have ML expertise: someone who understands the business context, knows how to build data processing pipelines and web services, but might not know how to build the models.
Our next focus area is trying to apply the same agentic approach to the "data exploration" and "feature ETL engineering" part of the ML project lifecycle. Think a "data analyst agent" or "data engineering agent", with the ability to run and deploy feature processing jobs. I know it's a grand vision, and it won't happen overnight, but it's what we'd like to accomplish!
Would love to hear your thoughts :)
Creating something that can grok MNIST is certainly cool, but it’s kind of the equivalent of saying leetcode is equivalent to software engineering.
Second, and more practically speaking, you are automating (what I think of as) the most fun part of ML: the creativity of framing a problem and designing a model to solve that problem.
Any review of smolagent? This combination of agents approach seems likely to be really useful in a lot of places, and I’m wondering if you liked it, loved it, hated it, …
However, the library does feel somewhat immature, and has some drawbacks that hinder building a production application. Some of the issues we've ran into include:
1. It's not easy to customise the agents' system prompts. You have to "patch" the smolagents library's YAML templates in a hacky way. 2. There is no "shared memory" abstraction out of the box to help you manage communication between agents. We had to implement an "ObjectRegistry" class into which the agents can register objects, so that another agent can retrieve the object just by knowing the object's key string. As we scale, we will need to build more complex communication abstractions (tasks queues etc). Given that communication is a key element of multi-agent systems, I would have expected a popular library like smolagents to have some kind of built-in support for it. 3. No "structured response" where you can pass a Pydantic BaseModel (or similar) to specify what structure the agent response should have. 4. "Managed agents" are always executed synchronously. If you have a hierarchy of managed agents, only one agent will ever be working at any given time. So we'll have to build an async execution mechanism ourselves.
I think we've run into some other limitations as well, but these are the first that come to my mind :) hope this helps!
Feels like there might be a good infra company in there for someone to build.
Smolagents works great for us but we did run into some limitations. For example, it lacks structured output enforcement, parallel execution, and in-built shared memory, which are crucial features for orchestrating a multi-layer agent hierarchy beyond simple chatbots. We've also been playing around with Pydantic AI due to its benefits with validation and type enforcement but haven't shifted yet.
As of now, I think you could relatively easily wrap the plexe model, which has a `predict()` method, in a scikit-learn Estimator. You could then plug it into a Pipeline.
What do you have in mind? How would you want to use this with scikit-learn pipelines?
I.e. if I already have some data cleaning/normalisation, some dimensional reduction and then some fitting, being able to drop the Agent in place with an appropriate description and task.
Cleaning: Feed it a data frame and have it figure out what needs imputing etc.
The rest: Could either be separate tasks or one big task for the Agent..
You could wrap the Plexe-built model in a scikit-learn Estimator like I mentioned, and you can specify the desired input/output schema of the model when you start building it, so it will fit into your Pipeline.
This is an interesting requirement for us to think about though. Maybe we'll build proper support for the "I want to use this in a Pipeline" use case :)
Fine-tuning existing language models is also an option in Plexe's toolkit. For example, when we needed to classify prompt injections for LLMs, Plexe determined fine-tuning RoBERTa was the best approach. But for most structured data problems (like forecasting or recommendations), Plexe typically builds lightweight models from scratch that are trained directly on your dataset.
Plexe experiments with multiple approaches - from traditional algorithms like gradient boosting to deep neural networks. It runs the training jobs and compares performance metrics across different architectures to identify which solution best fits your specific data and problem constraints.
Plexe _can_ build deep learning models using `torch` and `transformers`, and often the experimentation process will include some NN-based solutions as well, but that's just one of the ML frameworks available to the agent. It can also build models using xgboost, scikit-learn, and several others.
You can also explicitly tell Plexe not to use neural nets, if that's a requirement.
We've seen the models hit typical statistical pitfalls like overfitting and data leakage during testing. We've improved by implementing strict validation protocols and guardrails around data handling. While we've fixed the agents getting stuck in recursive debugging loops, statistical validity remains an ongoing challenge. We're actively working on better detection of these issues, but ultimately, we rely on domain expertise from users for evaluating model performance.