NoLiMa: Long-Context Evaluation Beyond Literal Matching

Comments (1)

consumer451 · 20h ago

Related paper: https://arxiv.org/abs/2502.05167

> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.

I post this because this information seems very important for users of LLMs, and devs implementing LLMs in their own solutions.

The fall-off in accuracy is far faster and greater than I had imagined.

Someone should really make this an ongoing thing, which evaluates new models as they are released. Or, this information should be included in all model system cards.

Show HN: Coso.ai – AI Agent for social media that learns from branded content

Marc Ribot, Sideman for Tom Waits, Robert Plant, and More (theguardian.com)

What If Your Salary Is Too High for Today's Job Market? (msn.com)

The First Sip of Beer (1millionarr.substack.com)

Ultrasound Cavitation Enables Rapid Fabrication of Tough Hydrogels (advanced.onlinelibrary.wiley.com)

Show HN: Tired of overpriced SEO tools? I built SEOzast for the price of coffee (seozast.com)

NetBox Operator: AI Superpowers for Every Network and Infrastructure Engineer (netboxlabs.com)

Show HN: Spot Hallucinations in ChatGPT (chromewebstore.google.com)

AI "On-Shoring" (medium.com)

What is OKLCH? (oklchtools.com)

AI-based neighborhood evaluation tool (kyna.ai)

Show HN: DappInsight – Ranking DApps by Audits, TVL, and GitHub Activity

End of 10 (endof10.org)

Async Compute All the Things (interplayoflight.wordpress.com)

Opera Neon: A browser for the agentic web (operaneon.com)

Data Center Construction Contributing One Percentage Point to US GDP Growth (apolloacademy.com)

Collatz's Ant and Similarity of Landscapes (gbragafibra.github.io)

SCOTUS Releases Long-Awaited Ethics Code: One Page That Says "Try Your Best" (sites.google.com)

The Sun is killing off SpaceX's Starlink satellites (newscientist.com)

Texas Requires Apple and Google to Verify Ages for App Downloads (nytimes.com)

The Great American AI Race (hazyresearch.stanford.edu)

Google IO: Android Desktop Windowing [video] (youtube.com)

What's in a Name? Tracing an Obsession with the Shakespeare Authorship Question (lithub.com)

Duchenne Smile (newscientist.com)

The Hidden Cost of Skipping the Fundamentals in the Age of AI (codingismycraft.blog)

Ask HN: How are you checking if your LLM is giving customers the right answer?

Fractran Interpreter (tjwei.github.io)

Build image search and query with natural language with vision model CLIP (cocoindex.io)

Byond game engine suffers a weeks-long DDoS attack (pcgamer.com)

System Prompt Leak Compilation (ChatGPT, Claude, Perplexity, etc.) (github.com)

Engineering a Better Java Build Tool Experience [video] (youtube.com)

NIRA Launches Online Pre-Registration Portal to Ease National ID Enrollment (thetechtower.com)

Borland Graphics Interface Font Revived – Litt.chr (github.com)

Anthropic's Interactive Prompt Engineering Tutorial (github.com)

Cloudini – point cloud compression library (github.com)

AI Weather Models Have Shown Promise This Hurricane Season (bloomberg.com)

A tribute to Mario Kart 8 (ravi64.com)

Just Click It (app.reclaim.ai)

AI: Accelerated Incompetence (slater.dev)

Microsoft wants Windows Update to handle all apps (theverge.com)

The luxury of letting ideas marinate (alexarvanitidis.dev)

Driverless Semi Trucks Are Here, with Little Regulation and Big Promises (nytimes.com)

Show HN: Quiz Coding – Teacher Mode for AI Agent in Python Notebook

Al Lowe Reflects on Leisure Suit Larry (2019) (medium.com)

CEOs who aren't yet preparing forquantum are 'already too late,' IBM exec says (businessinsider.com)

Telecoms Industry in US–China Context: Evolving Toward Near-Complete Bifurcation [pdf] (jhuapl.edu)

The Dismal Failure of LLMs as EV Search Aids (scottmeyers.blogspot.com)

'Ocean darkening' a cause for concern – scientists (bbc.com)

India's iPhone exports to the U.S. soared an estimated 76% (cnbc.com)

Voiceover artist urging ScotRail to remove her voice from new AI announcements (news.sky.com)

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Comments (1)