The AI Safety Puzzle Everyone Avoids: How to Measure Impact, Not Intent

Comments (1)

patrick0d · 3d ago

I am an AI interpretability researcher and have a new proposal for a way to measure the per token contribution of each head and neuron in LLMs. I found that the normalisation that happens in every LLM is avoided by modern attribution methods despite it having a large impact on the model's computation.

Here is the full preprint paper and the code I used. https://github.com/patrickod32/landed_writes Happy to some insight from any interested people and would like to know if other people here have been working on anything similar. This seems like a real gap in the research to me.

Linux on Snapdragon X Elite: Linaro and Tuxedo Pave the Way for ARM64 Laptops (linaro.org)

Return of wolves to Yellowstone has led to a surge in aspen trees (livescience.com)

When we get Komooted (bikepacking.com)

Chemical process produces critical battery metals with no waste (spectrum.ieee.org)

Hierarchical Reasoning Model (arxiv.org)

Three high-performance RISC-V processors to watch in H2 2025 (cnx-software.com)

Fast and cheap bulk storage: using LVM to cache HDDs on SSDs (quantum5.ca)

Smallest particulate matter air quality sensor for ultra-compact IoT devices (bosch-sensortec.com)

Constrained languages are easier to optimize (jyn.dev)

A low power 1U Raspberry Pi cluster server for inexpensive colocation (2021) (github.com)

Reading QR codes without a computer (qr.blinry.org)

Janet: Lightweight, Expressive, Modern Lisp (janet-lang.org)

Purple Earth hypothesis (en.wikipedia.org)

How we rooted Copilot (research.eye.security)

Rust running on every GPU (rust-gpu.github.io)

Coronary artery calcium testing can reveal plaque in arteries, but is underused (nytimes.com)

16colo.rs: ANSI/ASCII art archive (16colo.rs)

Cable Bacteria Are Living Batteries (asimov.press)

Resizable structs in Zig (tristanpemble.com)

Low cost mmWave 60GHz radar sensor for advanced sensing (infineon.com)

No AI Content (eclecticlight.co)

BlueOS Kernel – Written in Rust, compatible with POSIX (github.com)

Personal aviation is about to get interesting (2023) (elidourado.com)

What went wrong for Yahoo (dfarq.homeip.net)

Implementing dynamic scope for Fennel and Lua (andreyor.st)

The natural diamond industry is getting rocked. Thank the lab-grown variety (cbc.ca)

Teach Yourself Programming in Ten Years (1998) (norvig.com)

Paul Dirac and the religion of mathematical beauty (2011) [video] (youtube.com)

Getting decent error reports in Bash when you're using 'set -e' (utcc.utoronto.ca)

Show HN: QuickTunes: Apple Music player for Mac with iPod vibes (furnacecreek.org)

Arvo Pärt at 90 (theguardian.com)

Where are vacation homes located in the US? (construction-physics.com)

Beyond Food and People (aeon.co)

Torqued Accelerator Using Radiation from the Sun (Tars) for Interstellar Payload (arxiv.org)

Shallow water is dangerous too (jefftk.com)

Tinyio: A tiny (~200 line) event loop for Python (github.com)

Font-size-adjust Is Useful (matklad.github.io)

Large ancient Hawaiian petroglyphs uncovered by waves on Oahu (sfgate.com)

Missionaries using secret audio devices to evangelise Brazil's isolated peoples (theguardian.com)

The rise and fall of the Hanseatic League (worksinprogress.co)

Test Results for AMD Zen 5 (agner.org)

Breaking the WASM/JS communication performance barrier (github.com)

Inverted Indexes: A Step-by-Step Implementation Guide (2023) (chashnikov.dev)

Meta Unveils Wristband for Controlling Computers with Hand Gestures (nytimes.com)

Millet mystery: A staple crop failed to take root in ancient Japanese kitchens (phys.org)

Bringing a decade old bicycle navigator back to life with open source software (raymii.org)

Simon Tatham's Portable Puzzle Collection (chiark.greenend.org.uk)

Yes, the Book of PF, Fourth Edition Is Coming Soon (bsdly.blogspot.com)

Asyncio: A library with too many sharp corners (sailor.li)

Users claim Discord's age verification can be tricked with video game characters (thepinknews.com)

The AI Safety Puzzle Everyone Avoids: How to Measure Impact, Not Intent

Comments (1)