Beating Google's kernelCTF PoW using AVX512 (anemato.de)

I had no idea WebVoyager only spanned 15 websites lol... the 452 figure you have still seems a little low though - do you have plans to expand it? It seems like you'd want as many sites as possible to improve the real-world accuracy of agents due to the long tail nature of website traffic

suchintan · 1d ago

We definitely plan to expand it. I want to get to ~10,000 for a reasonable benchmark.

15 blew my mind -- it's too easy to overfit that dataset

helsinki · 18h ago

Does anyone use Skyvern to build their websites? I’m wondering how I might benefit from using an agentic browser workflow instead of a playwright MCP server for building a web UI?

vasusen · 19h ago

Thank you so much for creating this folks! A browser navigation agent is key part of our AI QA setup at Donobu (https://donobu.com/). We found the WebVoyager benchmarks severely lacking for complex e2e test cases like logged-in dashboards, onboarding forms, etc.

While the extraction/2fa flows aren't super relevant to us, this saves us time from building our own set of benchmarks. Really appreciate it and hope we can contribute to make this a really large set.

suchintan · 16h ago

That would be amazing!!

gitmagic · 20h ago

Would love to see how Nelly [0] performs on this benchmark.

[0] https://nelly.is

suchintan · 19h ago

Very cool. The benchmark can be found here if you want to take a look at it: https://github.com/Halluminate/WebBench

pants2 · 17h ago

Great work! Big fan of Skyvern.

Looking forward to the benchmarks on Claude 4 (and o3 CUA when that's released)

wm2 · 1d ago

super cool!

Beating Google's kernelCTF PoW using AVX512 (anemato.de)

De Bruijn notation, and why it's useful (blueberrywren.dev)

Systems Correctness Practices at Amazon Web Services (cacm.acm.org)

The Surveilled Student (chronicle.com)

Show HN: W++ – A Python-style scripting language for .NET with NuGet support (github.com)

A Smiling Public Man (salmagundi.skidmore.edu)

Weave (YC W25) is hiring a founding engineer (ycombinator.com)

Microsandbox: Virtual Machines that feel and perform like containers (github.com)

Show HN: Git-Add–Interactive with Enhancements (github.com)

The radix 2^51 trick (2017) (chosenplaintext.ca)

Toxic Origins, Toxic Decisions: Biases in CEO Selection (papers.ssrn.com)

Atomics and Concurrency (redixhumayun.github.io)

Radio Astronomy Software Defined Radio (Rasdr) (radio-astronomy.org)

Vrs: Personal Software Runtime inspired by Emacs, Plan 9, Erlang, Hypermedia (github.com)

Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020) (ndingwall.github.io)

Sieving pores: stable,fast alloying chemistry of Si -electrodes in Li-ion batt (nature.com)

The Darwin Gödel Machine: AI that improves itself by rewriting its own code (sakana.ai)

Practical SDR: Getting started with software-defined radio (nostarch.com)

On eval in dynamic languages generally and in Racket specifically (2011) (blog.racket-lang.org)

Bridged Indexes in OrioleDB: architecture, internals and everyday use? (orioledb.com)

FLUX.1 Kontext (bfl.ai)

Triangle splatting: radiance fields represented by triangles (trianglesplatting.github.io)

Ask HN: What is the best LLM for consumer grade hardware?

Automated Verification of Monotonic Data Structure Traversals in C (arxiv.org)

Investigating AI Manipulation in Viral Chinese Paraglider Video (blog.hyperknot.com)

Show HN: MCP Server SDK in Bash (github.com)

The Art of the Critic (metropolitanreview.org)

U.S. sanctions cloud provider 'Funnull' as top source of 'pig butchering' scams (krebsonsecurity.com)

The Nobel Prize Winner Who Thinks We Have the Universe All Wrong (theatlantic.com)

OpenBao Namespaces (openbao.org)

Show HN: I wrote a modern Command Line Handbook (commandline.stribny.name)

The atmospheric memory that feeds billions of people: Monsoon rainfall mechanism (phys.org)

Smallest Possible Files (github.com)

Printing metal on glass with lasers [video] (youtube.com)

MinIO Removes Web UI Features from Community Version, Pushes Users to Paid Plans (biggo.com)

Why is everybody knitting chickens? (ironicsans.ghost.io)

Show HN: Donut Browser, a Browser Orchestrator (donutbrowser.com)

Making C and Python Talk to Each Other (leetarxiv.substack.com)

Behavioral responses of domestic cats to human odor (journals.plos.org)

Player Piano Rolls (omeka-s.library.illinois.edu)

Take9 Won't Improve Cybersecurity (schneier.com)

Superauthenticity: Computer Game Aspect Ratios (datadrivengamer.blogspot.com)

Human coders are still better than LLMs (antirez.com)

Learning C3 (alloc.dev)

I'm starting a social club to solve the male loneliness epidemic (wave3.social)

WeatherStar 4000+: Weather Channel Simulator (weatherstar.netbymatt.com)

A visual exploration of vector embeddings (blog.pamelafox.org)

When will M&S take online orders again? (moneyweek.com)

Show HN: templUI – The UI Kit for templ (CLI-based, like shadcn/UI) (templui.io)

Open-sourcing circuit tracing tools (anthropic.com)

Web Bench: a new way to compare AI browser agents

Comments (9)