Launch HN: Morph (YC S23) – Apply AI code edits at 4,500 tokens/sec
Here's a demo video: https://www.youtube.com/watch?v=LdT8epGHJPk.
Why? AI spits out code that can’t reliably be inserted into existing code. Full file rewrites, brittle search-and-replace hacks are too slow, expensive, or error-prone.
Morph's approach:
- Your agent outputs edits “lazily”, referencing unmodified lines in the existing file (ex: // ...existing code...)
- Morph instantly applies these edits to a file using our Fast Apply model + speculative decoding against the original file, making AI patches fast, reliable, and production-ready.
This approach was pioneered by Cursor last year, but their models aren’t available as APIs—so we built Morph for developers everywhere (with a large free tier!)
Live demo (no signup): https://morphllm.com/dashboard and docs: https://docs.morphllm.com/quickstart
We have 2 Fast Apply models: morph-v3-fast - 4500+ tok/sec, and morph-v3-large - 2500+ tok/sec. These models power Fast Apply at create.xyz, databutton, continue.dev, and more!
We also provide retrieval models for embedding + reranking. Next Up: Inline Edit Model (Cmd-K): Extremely fast inline edits - keep dev flow state; and Morph Tab API: Our Next Edit Prediction model guesses your next code edit + action with sub-500ms latency. It's currently in private beta, but you can request early access here: https://morphllm.com/tab
Hot takes:
1) Raw inference speed matters more than incremental accuracy gains for dev UX—agree or disagree?
2) Full-file rewrites by frontier models are legacy—Fast Apply edits win on speed, cost, reliability.
3) As benchmarks on narrow tasks saturate to 99%+, complexity is shifting from single frontier models to specialized inference-optimized models. As frontier models move upmarket, they'll leave simple tasks behind, and they'll be used to do tasks only frontier models can do
We’d love to hear your ideas and experiences with coding agents!
I know you are trying to generate some controversy/visibility, but i think if we are being transparent here, you know this is wrong. People prefer using larger (or reasoning) models, with much bigger diff in tok/sec just for quality in coding, it comes first. Even if i have a big edit to apply, like 5k tokens, 200-300ms of difference in edit time are nothing. Edit speed is definitely not a bottleneck for dev UX, quality is. A dev who wants to save 200ms every code change over quality is someone who well, i cannot relate. If im using 1-2 agents in parallel, most of the time the edits are already applied while im reviewing code from the other agents. But again maybe that's just me.
Speaking of quality, how do you measure it? Do you have any benchmarks? How big is the difference in error rate between the fast and large model?
There's definitely a tipping point though. If the accuracy gains are so high that I can check its work less carefully or less often, the benefits of inference speed are effectively nil.
But I honestly feel like the task of smartly applying edits falls somewhat within traditional coding tasks. What about it is so difficult it could not be done with a smart diffing algorithm?
It may take a bit if explaining and that's OK. But the big question is as someone doing my enterprise microservice who isn't heavy into AI why do I switch to you.
Quality is measured 2 main ways:
1) End-to-end: User query -> to task resolution. These are aider style benchmarks answering the question of actual task completion
2) Apply Quality: Syntax correctness, character diff, etc..
The error rate for large vs fast is around 2%. If you're doing code edits that are extremely complex or on obscure languages - large is the better option. There's also an auto option to route to the model we think is best for a task
https://inference.cerebras.ai/ and https://groq.com/ and https://deepmind.google/models/gemini-diffusion/ (waitlisted) are all 10 to 100x faster than regular models, which really does have a meaningful impact on how I interact with them because I don't have to disengage for 15+ seconds while I wait for a response.
I have video demos of a few of those: https://simonwillison.net/2024/Oct/25/llm-cerebras/ and https://simonwillison.net/2024/Oct/31/cerebras-coder/ and https://simonwillison.net/2025/May/21/gemini-diffusion/
fuck. THAT.
Personally I work on multiple repos at a time to solve for this
- Multiple repos or independent changes in monorepo
- First round of changes idgaf about anything beyond public interface and unit tests
Now it's dying in the same place. Thankfully I got to spend the brunt of my career working through the fun, intermediate years.
However, it's kind of a trope for me at this point that people assume a negative opinion of using generative AI in the development process is due to a lack of experience using it.
Personally, I find flow state hard to achieve when I constantly have to switch modes to debugging LLM output or an edit error that I missed.
When the majority of time is spent waiting for the main LLM to think, I will always wait a few extra seconds for a better edit than risk having to spend multiple cycles playing find-the-bug because something didn't get applied correctly somewhere.
I don't know if the quality and speed are linearly related, though.
I imagine the speed difference might not matter so much if you are performing seismic updates across a codebase though.
Intensely sticky user experience
Many tasks work better with iteration/supervision and Sonnet makes that feasible.
If these "hot takes" extend into Morph's own development philosophy, then I can be glad to not be a user.
I don't mean to be rude, but I can't imagine you're selling a product on-par with Claude 3.7. Some level of performance tradeoff has to be acceptable if you prioritize latency this hard.
Our whole thesis is that Claude and Gemini are extremely good at reasoning/coding - so you should let them do that, and pass it to Morph Fast Apply to merge changes in.
Whatever LLM you're using will have a baseline error rate a lot higher than 2%, so you're going to be reviewing all the code it outputs regardless.
Request: please provide a system prompt in the docs to help the llm generate the diff format that performs best w/ your models. LLMs frequently change the way they present diffs on upgrades and I don't want to be guessing which format is best.
EDIT: Please clarify your privacy policy. If my interpretation is correct, paying users will have their data retained and trained on? Is there any way to pay to use the service (w/o picking up the phone) and not have my data trained on?
[0] https://morphllm.com/privacyMorph via OpenRouter is always zero data retention
I used the provided HTML example on https://morphllm.com/dashboard/playground/apply. Without editing anything at all, I pressed apply.
Your model added a bunch of CSS even though that wasn't in the update instructions at all. It also added a contact section, which again, wasn't in the update instructions that your demo provided.
> 1) Raw inference speed matters [most] for dev UX—agree or disagree?
Or maybe incremental content-assist and full-file problem-solving are two significantly different uses, though they're both dev UX use cases.
Because they're confusingly similar, comparing them (and denigrate full-file solutions) wastes time/energy. You muddy your own message.
Just concentrate on showing the value of what you do where and when. To wit...
In the inference case, you're really using context to provide affordances -- next steps. In the full-file case, you're starting instead from a goal statement, with context providing constraints.
I think where you want to go is to show when the tool anticipates where you *should* go; i.e., the extent to which it can lead junior developers to the next step, and senior developers to the next constraint/issue they're ignoring.
I believe just as "attention is all you need" surprised people, this kind of bottom-up approach has more legs than people expect.
I understand the naked probability model is trained on world code corpus; what would interest me is whether you can also create a model that learns the developer's biases.
Then the work is to see the issues in the context, but address them in the order and manner that the developer would. Lock-in would occur because, well, the system understands me. And it would be particularly nice when Programmer A wants to code like Programmer B. If your assistant has a model of Programmer B, the assistant could guide Programmer A in that direction.
now if you meant one step further and meaning the literal single developer, that's probably best serve in context - albiet with a model that's learned developer biases
Would love to chat about integrating the models into Kilo Code if you’re interested
You can contact me at brendan [at] kilocode [dot] ai
Morph is a tool for integrating the output of other LLMs and not an LLM itself? It doesn't generate 4500 tok/sec, it can edit 4500 tok/sec?
Google wrote AKYNIA. OpenAI wrote ChatGPT.
Also, are there any benchmarks comparing your fast apply models to others like Relace or even Llama via Cerebras? I’m particularly interested in output accuracy.
[0] https://www.relace.ai/
I'm also really curious about the XML tool calls in the documentation. I have not heard of this being the norm for tools like Cursor. Is that still the case? I feel like I'm pretty in the know about this stuff but must have missed that trend.
https://aider.chat/2024/08/14/code-in-json.html
https://fireworks.ai/blog/cursor
Now I can be wrong, faster!
First I added their models to my ~/Library/Application Support/io.datasette.llm/extra-openai-models.yaml file:
Then I added the API key like this: Then I saved an LLM template with their prompting pattern: Now I can run operations like this: The -t option is the template I named when I ran --save. The -p name value options then set the content for the template $code and $update variables.Example transcript here: https://gist.github.com/simonw/de67818603d448a3fee788ace2976...
One thing that worries me: since it's using XML-style tags <code> and <update>, if my own source code contains those tags I expect it may get confused.
That is a horrifying answer.
https://docs.anthropic.com/en/docs/claude-code/hooks
Actually - that's what this company should do. It should be an MCP server so anyone could plug it into any agent with a url and an API key.
Edit: I'd be particularly interested if there's a way to run a sort of comparison mode for a while, so I can get a sense of how much accuracy I'm losing, if any. Even at the cost of initial performance.
My read is that despite Claude moving upmarket in what it can do, they are keen on clinging to all the (token heavy) tasks they're leaving behind
Yeah, I love reviewing and debugging thousands of lines of buggy and dirty AI generated code. Who cannot love it?
https://man7.org/linux/man-pages/man1/patch.1.html
https://github.com/x1xhlol/system-prompts-and-models-of-ai-t...
Sending code externally is meh especially for companies with tight security rules. We do self-hosting for them in their infra
because ruby no need corecting. It works.