Pure-vision browser agent scores 94% on WebVoyager (SOTA)

Comments (1)

anerli · 3m ago

Hey HN, Anders and Tom from Magnitude (YC S25) here. On our last Show HN post about our open-source browser agent, someone left a comment - "there are multiple similar projects like this posted here daily, and this one likely isn't the best". So we asked ourselves, are they right? We decided to run on WebVoyager (a well known benchmark for browser agents) to test ourselves. We scored 94%, beating all other browser agents and making Magnitude state-of-the-art.

You can view the entire run here: https://magnitude-webvoyager.vercel.app/

The original WebVoyager benchmark was meant to demonstrate a new technique for interacting with the browser by annotating the DOM. Since then, vision models have come a long way in terms of accuracy and visual understanding. Our pure-vision approach with our framework and today's models surpasses the hybrid DOM strategies used by the original WebVoyager paper and other agents like browser-use.

So why does pure-vision beat hybrid DOM approaches?

- Generalizes far better - handles canvas elements, iframes, drag-and-drop, precise text selection, and many other scenarios elegantly where hybrid DOM would struggle and need to implement hacks for those cases to work

- Easier for the LLM - we think LLM performance is roughly proportional to prompt clarity. If the prompt contains a crowded screenshot with loads of colored boxes + a long list of element labels and is asked to pick one, vs given a clean screenshot + where do you want to click - the latter seems far easier

We believe another reason for our success is that we can still hook into the browser as needed. We can use browser-native actions like tab switching, can look at network traffic to know when a page is ready, or use the DOM for other purposes like data extraction. Computer use agents like Operator or Claude Computer Use on the other hand are limited to generic mouse and keyboard controls.

It's worth mentioning that WebVoyager is a strange and flawed benchmark. It contains many tasks that depend on the current date (and need their dates updated), tasks that depend on the time of day, and some tasks that are impossible or too ambiguous to properly evaluate. In the repo we detailed exactly the patches we made to the original WebVoyager benchmark such that each task is at least theoretically possible.

Why does this all matter? People are trying to adopt agents for real use cases, but they often fail to make it to production. We want to enable developers to build with production-ready browser agents - which is why it's important to get the fundamental interaction paradigm right. We think this benchmark is a step in the right direction, showing that pure-vision has best-in-class performance in the browser domain. Curious to hear what others think about this, would love to get your feedback!

Mercury: Ultra-Fast Language Models Based on Diffusion (arxiv.org)

Launch HN: Morph (YC S23) – Apply AI code edits at 4,500 tokens/sec

I used o3 to profile myself from my saved Pocket links (noperator.dev)

Adding a feature because ChatGPT incorrectly thinks it exists (holovaty.com)

Show HN: Ossia score – a sequencer for audio-visual artists (github.com)

Show HN: Unlearning Comparator, a visual tool to compare machine unlearning (gnueaj.github.io)

When Figma starts designing us (designsystems.international)

François Chollet: The Arc Prize and How We Get to AGI [video] (youtube.com)

My first verified (imperative) program (markushimmel.de)

The Era of Exploration (yidingjiang.github.io)

So you wanna build an aging company (librariesforthefuture.bio)

Bitchat – A decentralized messaging app that works over Bluetooth mesh networks (github.com)

Solving Wordle with uv's dependency resolver (mildbyte.xyz)

Lightfastness Testing of Colored Pencils (sarahrenaeclark.com)

SUS Lang: The SUS Hardware Description Language (sus-lang.org)

Hymn to Babylon, missing for a millennium, has been discovered (phys.org)

Dyson, techno-centric design and social consumption (2earth.github.io)

Tuning the Prusa Core One (arachnoid.com)

Show HN: From Photos to Positions: Prototyping VLM-Based Indoor Maps (arjo129.github.io)

CPU-X: CPU-Z for Linux (thetumultuousunicornofdarkness.github.io)

Neanderthals operated prehistoric “fat factory” on German lakeshore (archaeologymag.com)

tinymcp: Let LLMs control embedded devices via the Model Context Protocol (github.com)

A non-anthropomorphized view of LLMs (addxorrol.blogspot.com)

Show HN: I wrote a "web OS" based on the Apple Lisa's UI, with 1-bit graphics (alpha.lisagui.com)

Show HN: Piano Trainer – Learn piano scales, chords and more using MIDI (github.com)

Cpparinfer: A C++23 implementation of the parinfer algorithm (gitlab.com)

The Harvey Edwards Archive (harveyedwards-archive.com)

Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge (businessinsider.com)

Show HN: NYC Subway Simulator and Route Designer (buildmytransit.nyc)

Why English doesn't use accents (deadlanguagesociety.com)

Intel's Lion Cove P-Core and Gaming Workloads (chipsandcheese.com)

LLMs should not replace therapists (arxiv.org)

Building the Rust Compiler with GCC (fractalfir.github.io)

Showh HN: Microjax – JAX in two classes and six functions (github.com)

Show HN: Integrated System for Enhancing VIC Output (github.com)

The Cat's Meat Man: Feeding Felines in Victorian London (publicdomainreview.org)

Async Queue – One of my favorite programming interview questions (davidgomes.com)

The messy reality of SIMD (vector) functions (johnnysswlab.com)

Thesis: Interesting work is less amenable to the use of AI (remark.ing)

Bridgefy SDK – Make your mobile app work without the Internet (bridgefy.me)

Opencode: AI coding agent, built for the terminal (github.com)

What every programmer should know about how CPUs work [video] (youtube.com)

High Performance Image Sensor Processing Using FPGAs [pdf] (oda.uni-obuda.hu)

Get the location of the ISS using DNS (shkspr.mobi)

Applite – A macOS native GUI for homebrew (aerolite.dev)

The first time I was almost fired from Apple (engineersneedart.com)

Uncommon Uses of Python in Commonly Used Libraries (2022) (eugeneyan.com)

Metriport (YC S22) is hiring engineers to improve healthcare data exchange (ycombinator.com)

Backlog.md – Markdown‑native Task Manager and Kanban visualizer for any Git repo (github.com)

I extracted the safety filters from Apple Intelligence models (github.com)

Pure-vision browser agent scores 94% on WebVoyager (SOTA)

Comments (1)