DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking

Comments (1)

four_fifths · 8m ago

If you do a bit of digging into most of the popular benchmarks that all the big labs report on, you'll see pretty quickly that they have almost zero correlation with any real world tasks.

The approach that they're taking here of working backwards from a OS repo pull request and reverse engineering a question is unusually well thought out for a benchmark.

I haven't dug into more of the dataset questions yet, but the example they give in the blog post for the question generated for Hugging Face Transformer's repo gives me hope that this could actually be a solid benchmark:

> How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?

Show HN: Term.everything – Run any GUI app in the terminal (github.com)

Show HN: Haystack – Review pull requests like you wrote them yourself (haystackeditor.com)

Show HN: TailGuard – Bridge your WireGuard router into Tailscale via a container (github.com)

Show HN: Enter a Topic, receive a language learning curriculum with videos (lingolingo.app)

Show HN: Bottlefire – Build single-executable microVMs from Docker images (bottlefire.dev)

Show HN: Small Transfers – charge from 0.000001 USD per request for your SaaS (smalltransfers.com)

Show HN: CrabCamera – Cross-platform camera plugin for Tauri desktop apps (crates.io)

Show HN: HumanAlarm – Real people knock on your door to wake you up (humanalarm.com)

Show HN: Vicinae – a native, Raycast-compatible launcher for Linux (github.com)

Show HN: An Open Source XR(AR/VR) Operating System (getxeneva.com)

Show HN: Shellcast.tv – Stream your vibe coding (shellcast.tv)

Show HN: Attempt – A CLI for retrying fallible commands (github.com)

Show HN: TimeCopilot, forecasting agent with LLMs and foundation models (github.com)

Show HN: Flox – Nvidia CUDA available for the Nix ecosystem (flox.dev)

Show HN: Ion, a Rust/Tokio powered JavaScript runtime for embedders (github.com)

Show HN: Making a cross-platform game in Go using WebRTC Datachannels (pion.ly)

Show HN: Superagents – connect spreadsheets to any database, API or MCP server (sourcetable.com)

Show HN: Llmswap – Universal AI SDK and Code Generation CLI (sreenathmenon.com)

Show HN: ZeroFS, the Filesystem That Makes S3 Your Primary Storage (github.com)

Show HN: WorldView – Compare how different countries report the same news (worldview.up.railway.app)

Show HN: Strange Attractors – a maths side-project in Threejs (blog.shashanktomar.com)

Show HN: Nixite – automatically install all your Linux software unattendedly (github.com)

Show HN: Lightweight tool for managing Linux virtual machines (github.com)

Show HN: Creative writing app inspired by Jung's active imagination (moonwrite.app)

Show HN: I recreated Windows XP as my portfolio (mitchivin.com)

Show HN: A Deep Research MCP Agent (and pitfalls I hit along the way) (thealliance.ai)

Show HN: OpenCV over WebRTC (in Go) (github.com)

Show HN: Swimming in Tech Debt (helpthisbook.com)

Show HN: Open-sourcing our text-to-CAD app (github.com)

Show HN: Robot MCP Server – Connect Any Language Model and ROS Robots Using MCP (github.com)

Show HN: Stop Notifications from Codex CLI and Claude Code (github.com)

Show HN: C++ Compiler Support Page (cppstat.dev)

Show HN: I'm making an open-source platform for learning Japanese (kanadojo.com)

Show HN: Ark v0.5.0 – A Minimal, High-Performance Entity Component System for Go (github.com)

Show HN: Writing Arabic in English (sherifelmetwally.com)

Show HN: Semantic grep with local embeddings (github.com)

Show HN: AegisClip – Minimalist Mac Clipboard Manager with Auto Paste (aegisclip.com)

Show HN: Atsphinx-qrcode – Sphinx extension to generate QR code in document (github.com)

Show HN: DevOps Alchemy – Little Alchemy with DevOps Elements (devops-alchemy.vercel.app)

Show HN: Bottleneck Calculator (bottleneckcalculator.work)

Show HN: Sayathing – Open-Source platform that gives your text a voice (github.com)

Show HN: Tablemd – canvas-based Markdown table editor (tablemd.app)

Show HN: Vizza – Interactive, Beautiful Simulations (github.com)

Show HN: Mathpad, a hardware keypad for typing math symbols (24 hours left) (crowdsupply.com)

Show HN: A roguelike game that runs inside Notepad++ (github.com)

Show HN: Focalist – A distraction-free task app that helps you focus (focalist.app)

Show HN: Backwalk – A lightweight backtrace library written in C (github.com)

Show HN: ArduinoCogs adds web-based dashboards and config to ESP32 projects (github.com)

Show HN: Personalized Learning Pathways (eigenarc.com)

Show HN: Send kind and aspirational words to a stranger who needs it (kindnesssender.com)

DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking

Comments (1)