The AI Safety Puzzle Everyone Avoids: How to Measure Impact, Not Intent

1 patrick0d 1 7/24/2025, 11:52:38 AM lesswrong.com ↗

Comments (1)

patrick0d · 3d ago
I am an AI interpretability researcher and have a new proposal for a way to measure the per token contribution of each head and neuron in LLMs. I found that the normalisation that happens in every LLM is avoided by modern attribution methods despite it having a large impact on the model's computation.

Here is the full preprint paper and the code I used. https://github.com/patrickod32/landed_writes Happy to some insight from any interested people and would like to know if other people here have been working on anything similar. This seems like a real gap in the research to me.