Show HN: Magnitude – open-source, AI-native test framework for web apps
We know there's a lot of noise about different browser agents. If you've tried any of them, you know they're slow, expensive, and inconsistent. That's why we built an agent specifically for running test cases and optimized it just for that:
- Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)
- Use tiny VLM (Moondream) instead of OpenAI/Anthropic computer use for dramatically faster and cheaper execution
- Use two agents: one for planning and adapting test cases and one for executing them quickly and consistently.
The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.
It’s completely open source. Would love to have more people try it out and tell us how we can make it great.
I've been recently thinking about testing/qa w/ VLMs + LLMs, one area that I haven't seen explored (but should 100% be feasible) is to have the first run be LLM + VLM, and then have the LLM(s?) write repeatable "cheap" tests w/ traditional libraries (playwright, puppeteer, etc). On every run you do the "cheap" traditional checks, if any fail go with the LLM + VLM again and see what broke, only fail the test if both fail. Makes sense?
Instead of caching actual code, we cache a "plan" of specific web actions that are still described in natural language.
For example, a cached "typing" action might look like: { variant: 'type'; target: string; content: string; }
The target is a natural language description. The content is what to type. Moondream's job is simply to find the target, and then we will click into that target and type whatever content. This means it can be full vision and not rely on DOM at all, while still being very consistent. Moondream is also trivially cheap to run since it's only a 2B model. If it can't find the target or it's confidence changed significantly (using token probabilities), it's an indication that the action/plan requires adjustment, and we can dynamically swap in the planner LLM to decide how to adjust the test from there.
We have multiple fallbacks to prevent flakes; The "cheap" command, a description of the intended step, and the original prompt.
If any step fails, we fall back to the next source.
1. https://docs.testdriver.ai/reference/test-steps
However, I do not see a big advantage over Cypress tests.
The article mentions shortcomings of Cypress (and Playwright):
> They start a dev server with bootstrapping code to load the component and/or setup code you want, which limits their ability to handle complex enterprise applications that might have OAuth or a complex build pipeline.
The simple solution is to containerise the whole application (including whatever OAuth provider is used), which then allows you to simply launch the whole thing and then run the tests. Most apps (especially in enterprise) should already be containerised anyway, so most of the times we can just go ahead and run any tests against them.
How is SafeTest better than that when my goal is to test my application in a real world scenario?
I'll need a way to extract data as part of the tests, like screenshots and page content. This will allow supplementing the tests with non-magnitude features, as well as add things that are a bit more deterministic. Assert that the added todo item exactly matches what was used as input data, screenshot diffs when the planner fallback came into play, execution log data, etc.
This isn't currently possible from what I can see in the docs, but maybe I'm wrong?
It'd also be ideal if it had an LLM-free executor mode to reduce costs and increase speed (caching outputs, or maybe use accessibility tree instead of VLM), and also fit requirements when the planner should not automatically kick in.
We plan to (very soon) enable mixing standard Playwright or other code in between Magnitude steps, which should enable doing exact assertions or anything else you want to do.
Definitely understand the need to reduce costs / increase speed, which mainly we think will be best enabled by our plan-caching system that will get executed by Moondream (a 2B model). Moondream is very fast and also has self-hosted options. However there's no reason we couldn't potentially have an option to generate pure Playwright for people who would prefer to do that instead.
We have a discord as well if you'd like to easily stay in touch about contributing: https://discord.gg/VcdpMh9tTy
Of course that would be even more valuable for testing your MCP or A2A services, but could be useful for UI as well. Or it could be useless. It would be interesting to see if the same UI changes affect both human and AI success rate in the same way.
And if not, could an AI be trained to correlate more closely to human behavior. That could be a good selling point if possible.
But what determines that the UI has changed for a specific URL? Your software independent of the planner LLM or do you require the visual LLM to make a determination of change?
You should also stop saying 100% open source when test plan generation and execution depend on non-open source AI components. It just doesn’t make sense.
We say 100% open source because all of our code (test runner and AI agents) is completely open source. It’s also completely possible to run an entire OSS stack because you can configure with an open source planner LLM, and Moondream is open source. You could run it all locally even if you have solid hardware.
1. https://netflixtechblog.com/introducing-safetest-a-novel-app...
To test this, you need an openai api key and add it in the settings (it will be stored in your browser's localstorage). After that, you can use the microphone icon in the ribbon menu (press it once to start the recording, press it again to stop the recording, and the processing begins).
You can also test most things via text input in this app, but for example, I have another app for kids that supports audio input only. There, the kid can say 'I want to learn about apple trees' and the system creates apple tree content ;-) However, it also has some content filters to allow only content suited for certain age levels. That is something you might want to include in automated tests.
[1]: https://critically.app
Where it gets interesting, is that we can save the execution plan that the big model comes up with and run with ONLY Moondream if the plan is specific enough. Then switch back out to the big model if some action path requires adjustment. This means we can run repeated tests much more efficiently and consistently.
there's also https://github.com/lm-sys/RouteLLM
and other similar
I guess your system is not as open-ended task oriented so you can just build workflows deciding which model to use at each step, these routing mechanisms are more useful for open-ended tasks that dont fit on a workflow so well (maybe?)
One benefit not using pure vision is that it's a strong signal to developers to make pages accessible. This would let them off the hook.
Perhaps testing both paths separately would be more appropriate. I could imagine a different AI agent attempting to navigate the page through accessibility landmarks. Or even different agents that simulate different types of disabilities.
For instance, I just discovered there are a ton of high quality scans of film and slides available at the Library of Congress website, but I don't really enjoy their interface. I could build a scraping tool and get too much info, or suffer and use just clicking through their search UI. Or I could ask my browser tool wielding LLM agent to automate the boring stuff and provide a map of the subjects I would be interested in, and give me a different way to discover things. I've just discovered the entire browser automation thing, and I'm having fun have my LLM go "research" for a few minutes while I go do something else.
What this really means for developers writing the tests is you don't really have to worry about it. A "step" in Magnitude can map to any number of web actions dynamically based on the description, and the agents will figure out how to do it repeatably.
edit: tracking here https://github.com/magnitudedev/magnitude/issues/6