Show HN: Muscle-Mem, a behavior cache for AI agents
Muscle Mem is an SDK that records your agent's tool-calling patterns as it solves tasks, and will deterministically replay those learned trajectories whenever the task is encountered again, falling back to agent mode if edge cases are detected. Like a JIT compiler, for behaviors.
At Pig, we built computer-use agents for automating legacy Windows applications (healthcare, lending, manufacturing, etc).
A recurring theme we ran into was that businesses already had RPA (pure-software scripts), and it worked for them in most cases. The pull to agents as an RPA alternative was not to have an infinitely flexible "AI Employees" as tech Twitter/X may want you to think, but simply because their RPA breaks under occasional edge-cases and agents can gracefully handle those cases.
Using a pure-agent approach proved to be highly wasteful. Window's accessibility APIs are poor, so you're generally stuck using pure-vision agents, which can run around $40/hr in token costs and take 5x longer than a human to perform a workflow. At this point, you're better off hiring a human.
The goal of Muscle-Mem is to get LLMs out of the hot path of repetitive automations, intelligently swapping between script-based execution for repeat cases, and agent-based automations for discovery and self-healing.
While inspired by computer-use environments, Muscle Mem is designed to generalize to any automation performing discrete tasks in dynamic environments. It took a great deal of thought to figure out an API that generalizes, which I cover more deeply in this blog: https://erikdunteman.com/blog/muscle-mem/
Check out the repo, consider giving it a star, or dive deeper into the above blog. I look forward to your feedback!
There are definitely a lot of potential solutions, I'm curious where this goes. IMO an embeddings approach won't be enough. I'm more than happy to discuss what we did internally to achieve a decent rate though, the space is super promising for sure.
To clarify, the use of CLIP embeddings in the CUA example is an implementation decision for the CUA example, not core to the engine itself.
This was very intentional in the design of Check being a pair of Capture() -> T and Compare(current: T, candidate: T) -> bool. T can be any data type that can serialize to a DB, and the comparison is user-defined to operate on that generic type T.
A more complete CUA example would store features like OCR'ed text, Accessibility Tree data, etc.
I'll use now to call out a few outstanding questions that I don't yet have answers for:
- Parameterization. Rather than caching and reusing strict coordinates, what happens when the arguments of a tool call are derived from the top level prompt, or even more challenging, as the result of a previous tool call. In the case of computer use, perhaps a very specific element x-path is needed, but that element is not "compile time known", rather derived mid-trajectory.
- What would it look like to stack compare filters? IE, if a user wanted to first filter by cosine distance, and then later apply more strict checks on OCR contents.
- As you mentioned, how can you store some knowledge of environment features where change *is* expected. Datetime in the bottom right is the perfect example of this.
For me, I want to reduce the friction on repeatable tasks. For example, I often need to create a new GraphQL query, which also requires updating the example query collection, creating a basic new integration test, etc. If I had a MCP-accessible prompt, I hoped the agent would realize I have a set of instructions on how to handle this request when I make it.
One form factor I toyed with was the idea of a dynamically evolving list of tool specs given to a model or presented from an MCP server, but I wasn't thrilled by:
- that'd still require a model in the loop to choose the tool, albeit just once for the whole trajectory vs every step
- it misses the sneaky challenge of Muscle Memory systems, which is continuous cache validation. An environment can change unexpectedly mid-trajectory and a system would need to adapt, so no matter what, it needs something that looks like Muscle Mem's Check abstraction for pre/post step cache validation
If I understand correctly, the engine caches trajectories in the simplest way possible, so if you have a cached trajectory a-b-c, and you encounter c-b-d, there's no way to get a "partial" cache hit, right? As I'm wrapping my head around this, I'm thinking that the engine would have to be a great deal more complicated to be able to judge when it's a safe partial hit.
Basically, I'm trying to imagine how applicable this approach could be to a significantly noisier environment.
I definitely want to support sub-trajectories. In fact, I believe an absolutely killer feature for this system would be decomposing a large trajectory into smaller, more repeated sub-trajectories.
Jeff from trychroma.com often talks about agent engineering as being more like industrial engineering than software eng, and I'd agree.
One part of the original spec I wrote for this included a component I call the "Compactor", which would be a background agent process to modify and compress learned skills, is similar to Letta's sleep time agents:
https://docs.letta.com/guides/agents/sleep-time-agents
My fear with this is it goes against the `No hidden nondeterminism` design value I stated in the launch blog. There's plenty of things we can throw background agents at, from the Compactor to parameterizing trajectories, but that's risky territory from an observability and debugability stance.
For simplicity, I just decided treat every trajectory as distinct, even if portions of it are redundant. If a cached trajectory fails a check halfway through, the agent proceeding from there just makes its own partial trajectory. Still unclear if we call that a trajectory for the same named task, or if we annotated it as a task recovery.
We can always increase cache-hit rate over time, worst case is the agent just does redundant work which is the status quo anyway.
I wish v0, lovable, bolt et al did something like this with their suggested prompts
It’s such a poor user experience to pick a template, wait 3-5min to generate and then be dumped randomly either on an incredibly refined prototype, or a ton of code that just errors out. And in either case, having no clue what to do next
How do you justify this vs fixing the software to enable scripting? That seems both cheaper and easier to achieve and with far higher yields. Assume market rate servicing of course.
Plus; how do you force an "agent" to correct its behavior?
Legacy software is frequently useful but difficult to install and access.
O, you're correct. This business is built on sand.
I think the problem will be defining wether there is a cache-hit or not, since "agents" are loosely defined and the tasks include basically anything.
If you boil it down, for a generic enough task and environment, the engine is just a database of previous environments and a user-provided filter function for cache validation
what does that analog look like in more traditional computer use?
Currently, a user would have to explicitly have their @engine.tool call take an element ID as an argument to a click_element_by_name(id) in order for it to be reused. This works, but for Muscle Mem would lead to the codebase getting littered with hyper-specific functions that are there just to differentiate tools for Muscle Mem, which goes against the agent agnostic thesis of the project.
Still figuring out how to do this.
Talking with users who just write their own RPA, they most loved APIs for doing so was consistently https://github.com/asweigart/pyautogui, which does offer A11y APIs but they're messy enough that many of the teams I talked to used the pyautogui.locateOnScreen('button.png') fuzzy image matching feature.
For what that may look like, I'll reuse a brainstorm on this topic that a friend gave me recently:
"Instead of relying on an LLM to understand where to click, the click area itself is the token. And the click is a token and the objective is a token and the output is whatever. Such that, click paths aren't "stored", they're embedded within the training of the LAM/LLM"
Whatever it ends up looking like, as long as it gets the job done and remains debuggable and extensible enough to not immediately eject a user once they hit any level of complexity, I'd be happy for it to be a part of Muscle Mem.
Curious how effective this is.
I keep imagining an Agent that writes a bunch of custom tools when it needs it and "saves" them for later use. Creating pipelines in code/config that it can reuse instead of solving from 0 each time.
Essentially, I want to use LLM for what they are good for (edge cases, fuzzy instructions/data) and have it turn around to write reusable tools so that the next time it doesn't have to run the full LLM, it can use a tiny LLM router up front to determine if there exists a tool to do this already. I'm not talking about MCP (though that is cool), this would use MCP tools but it could make new ones from the existing.
Here is an example.
Imagine I have an Agent with MCP tools to read/write to my email, calendar, ticketing system, and slack. I can ask the LLM to slack me every morning with an overview of my events for the day and anything outstanding I need to address. Maybe the first pass uses a frontier model to determine which tools to use and it accomplishes this task. Once I'm happy with the output then the Agent feeds the conversation/tool calls into another LLM to distill it to a Python/Node/Bash/whatever script. That script would call the MCP tools to do the same thing and use small LLMs to glue the results together and then it creates a cron (or similar) entry to have that run every morning.
I feel like this would remove a ton of the uncertainty when it comes to which tools an LLM uses without requiring humans to write custom flows with limited tools available for each task.
So the first pass would be:
Then a fresh LLM reviews that and writes a script to do all the tool calls and jump to the last "Please use the following data" prompt which can be reused (cron'd or just called when it makes sense).I might be way off-base and I don't work in the space (I just play around the edges) but this feels like a way to let agents "learn" and grow. I've just found that in practice you don't get good results from throwing all your tools at 1 big LLM with your prompt, you're better off limiting the tools and even creating compound tools for certain jobs you do over and over. I've found that lots of little tool calls add up and take a long time so a way for the agent to dynamically create tools from combining other tools seems like a huge win.