Show HN: Muscle-Mem, a behavior cache for AI agents

226 edunteman 51 5/14/2025, 7:38:26 PM github.com ↗

Hi HN! Erik here from Pig.dev, and today I'd like to share a new project we've just open sourced:

Muscle Mem is an SDK that records your agent's tool-calling patterns as it solves tasks, and will deterministically replay those learned trajectories whenever the task is encountered again, falling back to agent mode if edge cases are detected. Like a JIT compiler, for behaviors.

At Pig, we built computer-use agents for automating legacy Windows applications (healthcare, lending, manufacturing, etc).

A recurring theme we ran into was that businesses already had RPA (pure-software scripts), and it worked for them in most cases. The pull to agents as an RPA alternative was not to have an infinitely flexible "AI Employees" as tech Twitter/X may want you to think, but simply because their RPA breaks under occasional edge-cases and agents can gracefully handle those cases.

Using a pure-agent approach proved to be highly wasteful. Window's accessibility APIs are poor, so you're generally stuck using pure-vision agents, which can run around $40/hr in token costs and take 5x longer than a human to perform a workflow. At this point, you're better off hiring a human.

The goal of Muscle-Mem is to get LLMs out of the hot path of repetitive automations, intelligently swapping between script-based execution for repeat cases, and agent-based automations for discovery and self-healing.

While inspired by computer-use environments, Muscle Mem is designed to generalize to any automation performing discrete tasks in dynamic environments. It took a great deal of thought to figure out an API that generalizes, which I cover more deeply in this blog: https://erikdunteman.com/blog/muscle-mem/

Check out the repo, consider giving it a star, or dive deeper into the above blog. I look forward to your feedback!

Comments (51)

mindwok · 124d ago

It's becoming increasingly clear that memory and context are the bottlenecks in advancing usage of AI. I can't help but feel there needs to be a general, perhaps even built into the model, solution for this - everyone seems to be building something on top that is roughly the same thing.

ramoz · 124d ago

Karpathy had a similar interesting take the other day

https://x.com/karpathy/status/1921368644069765486

hnuser123456 · 123d ago

I'm starting up experiments with having agents write system prompts for sub-agents. Specifically, have the LLM build, test, and validate a small, simple tool, and once validated, add it to its own system prompt listing available tools.

Anyone else experimenting with letting LLMs generate their own or sub-agent system prompts?

FisherKK · 123d ago

Skill Library!

pacjam · 123d ago

check out Letta - the OSS codebase (https://github.com/letta-ai/letta) is basically focused on solving the memory/context problem in a generalized way (via "agentic context management"). if you're more interested in papers, we also worked on MemGPT and more recently sleep-time compute (https://arxiv.org/abs/2504.13171)

edunteman · 123d ago

Love your sleep time stuff! It's an inspiration for Muscle Mem

hnuser123456 · 123d ago

Fine tuning should be combined with inference in some way. However this requires keeping the model loaded at high enough precision for backprop to work.

Instead of hundreds of thousands of us downloading the latest and greatest model that won't fundamentally update one bit until we're graced with the next one, I would think we should all be able to fine-tune the weights so that it can naturally memorize new additional info and preferences without using up context length.

tom_m · 123d ago

Absolutely. The "intelligence" isn't complete without a memory. In fact there's a whole lot more to it than that. The LLM is one component, a logic factory, but there's so much more than the LLM and the memory.

In fact, systems should be LLM agnostic or use different models for different needs.

I don't believe building something into the model will ever be the solution though. It is interesting what Google is trying to do with model caching but at the end of the day I believe the strength of agents here will rely heavily upon modularity.

deepdarkforest · 124d ago

Not sure if this can work. We played around with something similar too for computer use, but comparing embeddings to cache validate the starting position is super gray, no clear threshold. For example, the datetime on the bottom right changes. Or if it's an app with a database etc, it can change the embeddings arbitrarily. Also, you must do this in every step, because as you said, things might break at any point. I just don't see how you can reliably validate. If anything, if models are cheap, you could use another cheaper llm call to compare screenshots, or adjust the playwright/api script on the fly. We ended up writing up a quite different approach that worked surprisingly well.

There are definitely a lot of potential solutions, I'm curious where this goes. IMO an embeddings approach won't be enough. I'm more than happy to discuss what we did internally to achieve a decent rate though, the space is super promising for sure.

edunteman · 124d ago

Thanks for sharing your experience! I'd love to chat about what you did to make this work, if I may use it to inform the design of this system. I'm at erik [at] pig.dev

To clarify, the use of CLIP embeddings in the CUA example is an implementation decision for the CUA example, not core to the engine itself.

This was very intentional in the design of Check being a pair of Capture() -> T and Compare(current: T, candidate: T) -> bool. T can be any data type that can serialize to a DB, and the comparison is user-defined to operate on that generic type T.

A more complete CUA example would store features like OCR'ed text, Accessibility Tree data, etc.

I'll use now to call out a few outstanding questions that I don't yet have answers for:

- Parameterization. Rather than caching and reusing strict coordinates, what happens when the arguments of a tool call are derived from the top level prompt, or even more challenging, as the result of a previous tool call. In the case of computer use, perhaps a very specific element x-path is needed, but that element is not "compile time known", rather derived mid-trajectory.

- What would it look like to stack compare filters? IE, if a user wanted to first filter by cosine distance, and then later apply more strict checks on OCR contents.

- As you mentioned, how can you store some knowledge of environment features where change *is* expected. Datetime in the bottom right is the perfect example of this.

arathis · 124d ago

Hey, working on a personal project. Would love to dig into how you approached this.

web-cowboy · 124d ago

This seems like a much more powerful version of what I wanted MCP "prompts" to be - and I'm curious to know if I have that right.

For me, I want to reduce the friction on repeatable tasks. For example, I often need to create a new GraphQL query, which also requires updating the example query collection, creating a basic new integration test, etc. If I had a MCP-accessible prompt, I hoped the agent would realize I have a set of instructions on how to handle this request when I make it.

edunteman · 124d ago

In a way, a Muscle Mem trajectory is just a new "meta tool" that combines sequential use of other tools, with parameters that flow through it all.

One form factor I toyed with was the idea of a dynamically evolving list of tool specs given to a model or presented from an MCP server, but I wasn't thrilled by:

- that'd still require a model in the loop to choose the tool, albeit just once for the whole trajectory vs every step

- it misses the sneaky challenge of Muscle Memory systems, which is continuous cache validation. An environment can change unexpectedly mid-trajectory and a system would need to adapt, so no matter what, it needs something that looks like Muscle Mem's Check abstraction for pre/post step cache validation

parsabg · 124d ago

I've been thinking about this (funnily, also while building a browser use type agent [1]) and I think this is a solid direction to explore. My implementation stores tuples of (context, task, tool_sequence) after a successful task completion, e.g.: (Instagram.com, "check the user's notifications", [browser_click, browser_read_text, ...]).

One can imagine an agent-to-agent marketplace where agents publish and consume such memories, standardized by canonical references to the MCP tools they've used, and possibly put a price on it based on how much work it would take another agent to "discover" that useful computational path. Then the consuming agents can make a "build vs buy" decision.

The core issue is in creating meaningful notions of "context" across the universe of tasks and environments. I'm skeptical of embeddings for that reason, and I think reducing false positives/negatives for cache hits is more important than efficiency in the short term, so perhaps a rich textual description of the context is a good short term compromise.

[1] https://github.com/parsaghaffari/browserbee

swairshah · 124d ago

The tricky question is what env state do you compare in order to decide if its a cache hit or miss? just screen status, screen status + all open apps, that + all running processes etc. you know what i mean. I think its a solvable problem and very interesting one too. One now needs to think about what would a human consider while using a muscle memory and that varies based on the action i.e. "rm -rf ." requires knowing what directory i am in vs "click close + don't save" requires knowing I don't want the recent change.

dmos62 · 124d ago

I love the minimal approach and general-use focus.

If I understand correctly, the engine caches trajectories in the simplest way possible, so if you have a cached trajectory a-b-c, and you encounter c-b-d, there's no way to get a "partial" cache hit, right? As I'm wrapping my head around this, I'm thinking that the engine would have to be a great deal more complicated to be able to judge when it's a safe partial hit.

Basically, I'm trying to imagine how applicable this approach could be to a significantly noisier environment.

edunteman · 124d ago

I struggled with this one for a while in the design, and didn't want to be hasty in making any decisions that lock us into a direction.

I definitely want to support sub-trajectories. In fact, I believe an absolutely killer feature for this system would be decomposing a large trajectory into smaller, more repeated sub-trajectories.

Jeff from trychroma.com often talks about agent engineering as being more like industrial engineering than software eng, and I'd agree.

One part of the original spec I wrote for this included a component I call the "Compactor", which would be a background agent process to modify and compress learned skills, is similar to Letta's sleep time agents:

https://docs.letta.com/guides/agents/sleep-time-agents

My fear with this is it goes against the `No hidden nondeterminism` design value I stated in the launch blog. There's plenty of things we can throw background agents at, from the Compactor to parameterizing trajectories, but that's risky territory from an observability and debugability stance.

For simplicity, I just decided treat every trajectory as distinct, even if portions of it are redundant. If a cached trajectory fails a check halfway through, the agent proceeding from there just makes its own partial trajectory. Still unclear if we call that a trajectory for the same named task, or if we annotated it as a task recovery.

We can always increase cache-hit rate over time, worst case is the agent just does redundant work which is the status quo anyway.

dmos62 · 124d ago

It occurred to me that the cache could be indexed not only by environment state but also by intent. A second agent could subdivide trajectories into steps, upgrading trajectories into ordered lists of sub-trajectories. Each trajectory and list would have an intent attached and would be aware of parent list's (i.e. parent "super-trajectory's") intent. And therefore could be embedded and looked up by an agent given its own intent. Not sure if this train of thought is easy to follow.

That's more auto-magical than you might care for. I've been designing an IDE where you program with intent statements and the generated code is a second-class-citizen, so I might be biased in suggesting this.

edunteman · 124d ago

totally follows! thanks for sharing, will noodle on it

dbish · 124d ago

Do you see these trajectories being used to fine tune a model automatically in some way rather then just replay, that way similar workflows might be improved too?

edunteman · 124d ago

I believe explicit trajectories for learned behavior are significantly easier for humans to grok and debug, in contrast to reinforcement learning methods like deep Q-learning, so avoiding the use of models is ideal, but I imagine they'll have their place.

For what that may look like, I'll reuse a brainstorm on this topic that a friend gave me recently:

"Instead of relying on an LLM to understand where to click, the click area itself is the token. And the click is a token and the objective is a token and the output is whatever. Such that, click paths aren't "stored", they're embedded within the training of the LAM/LLM"

Whatever it ends up looking like, as long as it gets the job done and remains debuggable and extensible enough to not immediately eject a user once they hit any level of complexity, I'd be happy for it to be a part of Muscle Mem.

huevosabio · 124d ago

I love the idea!

I think the problem will be defining wether there is a cache-hit or not, since "agents" are loosely defined and the tasks include basically anything.

edunteman · 124d ago

I agree, Cache Validation is the singular concern of Muscle Mem.

If you boil it down, for a generic enough task and environment, the engine is just a database of previous environments and a user-provided filter function for cache validation

allmathl · 124d ago

> At Pig, we built computer-use agents for automating legacy Windows applications (healthcare, lending, manufacturing, etc).

How do you justify this vs fixing the software to enable scripting? That seems both cheaper and easier to achieve and with far higher yields. Assume market rate servicing of course.

Plus; how do you force an "agent" to correct its behavior?

nawgz · 124d ago

Sorry, am I missing something? They obviously do not control source for these applications, but are able to gain access to whatever benefit the software originally had - reporting, tracking, API, whatever - by automating data entry tasks with AI.

Legacy software is frequently useful but difficult to install and access.

No comments yet

nico · 124d ago

Very cool concept!

I wish v0, lovable, bolt et al did something like this with their suggested prompts

It’s such a poor user experience to pick a template, wait 3-5min to generate and then be dumped randomly either on an incredibly refined prototype, or a ton of code that just errors out. And in either case, having no clue what to do next

fkjadoon94 · 123d ago

Excited to check it out!

Erik is a true hustler and ML infra builder at the core, not in this for the antics. Known him since the Booste days.

edunteman · 123d ago

Great to see you here :) thanks for the kind words!

android521 · 124d ago

Hi, Erik, any plan to release a typescript library? I think you could reach more developers with a typescript version

edunteman · 124d ago

Definitely, but there's lots of API surface area to figure out first, so will stick strictly to python for now. I can see a future where this ends up with a native storage/query core and with python/js bindings on top.

craftedid · 124d ago

Would love to see a TypeScript version too — feels like it could unlock some interesting use cases beyond backend agents.

lherron · 124d ago

Feels kinda like JIT compiling your agent prompts into code. Awesome concept, hope it pans out.

zombiwoof · 124d ago

I love how we are back to “it could have just been a bash script” territory already

ramesh31 · 123d ago

Sure, but who wrote the script? How did they know what to write? The idea is that your agent now does this 100 times faster than a person.

DrNosferatu · 124d ago

Wrote something similar into my rules - obtained mixed results.

Curious how effective this is.

joshstrange · 124d ago

This is a neat idea and is similar to something I've been turning over in my head. LLMs are very powerful for taking a bunch of disparate tools/information/etc and generating good results but the speed is a big issue as well as reproducibility.

I keep imagining an Agent that writes a bunch of custom tools when it needs it and "saves" them for later use. Creating pipelines in code/config that it can reuse instead of solving from 0 each time.

Essentially, I want to use LLM for what they are good for (edge cases, fuzzy instructions/data) and have it turn around to write reusable tools so that the next time it doesn't have to run the full LLM, it can use a tiny LLM router up front to determine if there exists a tool to do this already. I'm not talking about MCP (though that is cool), this would use MCP tools but it could make new ones from the existing.

Here is an example.

Imagine I have an Agent with MCP tools to read/write to my email, calendar, ticketing system, and slack. I can ask the LLM to slack me every morning with an overview of my events for the day and anything outstanding I need to address. Maybe the first pass uses a frontier model to determine which tools to use and it accomplishes this task. Once I'm happy with the output then the Agent feeds the conversation/tool calls into another LLM to distill it to a Python/Node/Bash/whatever script. That script would call the MCP tools to do the same thing and use small LLMs to glue the results together and then it creates a cron (or similar) entry to have that run every morning.

I feel like this would remove a ton of the uncertainty when it comes to which tools an LLM uses without requiring humans to write custom flows with limited tools available for each task.

So the first pass would be:

    User: Please check my email, calendar, and slack for what I need to focus on today.
    
    LLM: Tool Call: Read Unread Email
    
    LLM: Tool Call: Read last 7 days of emails the user replied to
    
    LLM: Tool Call: Read this week's events from calendar
    
    LLM: Tool Call: Read unread slack messages

    LLM: Tool Call: Read tickets in this sprint

    LLM: Tool Call: Read unread comments on tickets assigned to me
    
    LLM: Tool Call: Read slack messages conversations from yesterday
    
    LLM: Please use the following data to determine what the user needs to focus on today: <Inject context from tool calls>
    
    LLM: It looks like have 3 meetings today at.....

Then a fresh LLM reviews that and writes a script to do all the tool calls and jump to the last "Please use the following data" prompt which can be reused (cron'd or just called when it makes sense).

I might be way off-base and I don't work in the space (I just play around the edges) but this feels like a way to let agents "learn" and grow. I've just found that in practice you don't get good results from throwing all your tools at 1 big LLM with your prompt, you're better off limiting the tools and even creating compound tools for certain jobs you do over and over. I've found that lots of little tool calls add up and take a long time so a way for the agent to dynamically create tools from combining other tools seems like a huge win.

edunteman · 124d ago

This is very similar to what Voyager did https://arxiv.org/abs/2305.16291

Their implementation uses actual code, JS scripts in their case, as the stored trajectories, which has the neat feature of parameterization built in so trajectories are more reusable.

I experimented with this for a bit for Muscle Mem, but a trajectory being performed by just-in-time generated scripts felt too magical and wild west. An explicit goal of Muscle Mem is to be a deterministic system, more like a DB, on which you as a user can layer as much nondeterminism as you feel comfortable with.

ivanovm · 124d ago

would just caching llm responses work here?

ramesh31 · 123d ago

Which responses? For how long? With what level of detail? Those are the questions we are all trying to figure out now, and the performance of your agent is highly dependent on the answer to that.

edunteman · 124d ago

you mean like https://www.anthropic.com/news/prompt-caching

or just saving LLM chat message history?

If the latter, saving chat history is useless without some snapshot of the environment in which it was performed. Muscle Mem is an environment cache more than it is an LLM cache.

revskill · 124d ago

Promp is everything. I do not trust ai intelligencd anymore. The worst case is ai is a bug generator at an expensive cost without any of quality.

hackgician · 124d ago

accessibility (a11y) trees are super helpful for LLMs; we use them extensively in stagehand! the context is nice for browsers, since you have existing frameworks like selenium/playwright/puppeteer for actually acting on nodes in the a11y tree.

what does that analog look like in more traditional computer use?

ctoth · 124d ago

There are a variety of accessibility frameworks from MSAA (old, windows-only) IA2, JAB, UIA (newer). NVDA from NV Access has an abstraction over these APIs to standardize gathering roles and other information from the matrix of a11y providers, though note the GPL license depending on how you want to use it.

edunteman · 124d ago

Our experience working with A11y apis like above is that data is frequently missing, and the APIs can be shockingly slow to read from. The highest performing agents in WindowsArena use a mixture of A11y and yolo-like grounding models such as Omniparser, with A11y seeming shifting out of vogue in favor of computer vision, due to it giving incomplete context.

Talking with users who just write their own RPA, they most loved APIs for doing so was consistently https://github.com/asweigart/pyautogui, which does offer A11y APIs but they're messy enough that many of the teams I talked to used the pyautogui.locateOnScreen('button.png') fuzzy image matching feature.

edunteman · 124d ago

another currently unanswered question in Muscle Mem is how to more cleanly express the targeting of named entities.

Currently, a user would have to explicitly have their @engine.tool call take an element ID as an argument to a click_element_by_name(id) in order for it to be reused. This works, but for Muscle Mem would lead to the codebase getting littered with hyper-specific functions that are there just to differentiate tools for Muscle Mem, which goes against the agent agnostic thesis of the project.

Still figuring out how to do this.

adchurch · 124d ago

now that's a beautiful api!

edunteman · 124d ago

Thank you!

Show HN: Pyproc – Call Python from Go Without CGO or Microservices (github.com)

Show HN: Daffodil – Open-Source Ecommerce Framework to connect to any platform (github.com)

Show HN: AI-powered web service combining FastAPI, Pydantic-AI, and MCP servers (github.com)

Show HN: A store that generates products from anything you type in search (anycrap.shop)

Show HN: Semlib – Semantic Data Processing (github.com)

Show HN: Datadef.io – Canvas for data lineage and metadata management (datadef.io)

Show HN: Omarchy on CachyOS (github.com)

Show HN: MCP Server Installation Instructions Generator (hyprmcp.com)

Show HN: HN Term – browse HN using the terminal (github.com)

Show HN: Dagger.js – A buildless, runtime-only JavaScript micro-framework (daggerjs.org)

Show HN: I reverse engineered macOS to allow custom Lock Screen wallpapers (cindori.com)

Show HN: InfiniteTalk AI – AI Lip-Sync Video Generator for Long Videos (infinitetalk.net)

Show HN: Blocks – Dream work apps and AI agents in minutes (blocks.diy)

Show HN: I built an app store for open-source financial plans (on spreadsheets) (finfam.app)

Show HN: Open Line Protocol – a minimal wire for AI agents (MIT) (github.com)

Show HN: Ruminate – AI reading tool for understanding hard things (tryruminate.com)

Show HN: A tool to make a bootable USB installer out of macOS, or download it (macdaddy.io)

Show HN: Small Transfers – charge from 0.000001 USD per request for your SaaS (smalltransfers.com)

Show HN: Labspace Directory – Biotech resource for lab space (labspacedirectory.com)

Show HN: Pooshit – Sync local code to remote Docker containers

Show HN: Vicinae – A native, Raycast-compatible launcher for Linux (github.com)

Show HN: Httpjail – HTTP(s) request filter for processes (github.com)

Show HN: A Daily Typing Challenge in the TUI (github.com)

Show HN: Helios, an open-source distributed AI network using idle community GPUs (github.com)

Show HN: Allzonefiles.io – download 307M registered domain names (allzonefiles.io)

Show HN: I made a generative online drum machine with ClojureScript (dopeloop.ai)

Show HN: Ultraplot – A succint wrapper for matplotlib (github.com)

Show HN: Building an open-source agentic terminal (davehudson.io)

Show HN: Term.everything – Run any GUI app in the terminal (github.com)

Show HN: CLAVIER-36 – A programming environment for generative music (clavier36.com)

Show HN: Interactive news headline generator compatible with i3/sway (github.com)

Show HN: Building a Deep Research Agent Using MCP-Agent (thealliance.ai)

Show HN: PaperSync, making ArXiv papers collaborative (hackcmu25.vercel.app)

Show HN: Demochain, a toy blockchain network that runs on the browser (github.com)

Show HN: TailGuard – Bridge your WireGuard router into Tailscale via a container (github.com)

Show HN: Bottlefire – Build single-executable microVMs from Docker images (bottlefire.dev)

Show HN: Making a cross-platform game in Go using WebRTC Datachannels (pion.ly)

Show HN: GitHub repo with 180 tools for investing (github.com)

Show HN: C++ Compiler Support Page (cppstat.dev)

Show HN: An MCP Gateway to block the lethal trifecta (github.com)

Show HN: Aris – a free AI-powered answer engine for kids (aris.chat)

Show HN: Haystack – Review pull requests like you wrote them yourself (haystackeditor.com)

Show HN: I made pgdbtemplate to cut PostgreSQL test time by 1.5x using templates (github.com)

Show HN: YC Startup Map – A Map Visualization of the YC Startup Directory (ycstartupmap.com)

Show HN: EpicPSA – Create PSA's for any message (epicpsa.com)

Show HN: DWS OS, a Plan 9 Inspired Web “OS” (dws.rip)

Show HN: Worried about your pet? Health assessments with instant answers (petcheckai.com)

Show HN: wcwidth-o1 – Find Unicode text cell width in no time for JavaScript/TS (github.com)

Show HN: From selling AI to QA teams to building a smooth test-management app (tester.desplega.ai)

Show HN: Swimming in Tech Debt (helpthisbook.com)

Show HN: Muscle-Mem, a behavior cache for AI agents

Comments (51)