Solving the Issue of Interpretability of AI

2 mikeai686 1 7/20/2025, 1:21:38 PM

# Making AI Thoughts Understandable Through Separate Translator Models

I want to propose a new approach to the problem of AI opacity.

## The Core Problem

Modern AI systems work as "black boxes" - we can't see how they think. Recently, leading researchers warned that we might soon lose even the small transparency we currently have. Here's the difficulty: if we force AI to "think aloud" in human language, it reduces efficiency, but if we allow it to use efficient mathematical representations, we don't understand what's happening.

## Proposed Solution: A Modular System with Translators

I propose dividing the system into four parts:

*1. Free Internal Thinking* Let AI use any mathematical representations that are most efficient for solving tasks. We don't limit its thinking methods.

*2. Multiple Specialized Translator Models* We use several separate models trained to translate AI's internal representations into human-understandable language. Each translator can: - explain the logical structure of reasoning - highlight the main concepts the model is working with - explain how confident the model is in its conclusions Each function is performed by several different translators so results can be cross-checked.

*3. Contradiction Resolution Mechanisms* When translators give different explanations, we: - Highlight areas where they agree (high reliability) - Emphasize discrepancies (likely complex or ambiguous reasoning) - Explain why different interpretations arose If translator results don't contradict each other, we combine non-contradictory aspects into a unified explanation.

*4. Ethics Verification* We use "constitutional AI" (a special rule system, like in Claude.ai) to check: - Compliance with ethical standards - Logical consistency - Alignment with human values

## Main Advantages

- *No delays*: The model can think and produce results without delays (especially important in verbal dialogue), while explanations can be generated in parallel for quality control and, if necessary, future corrections. - *Moderation*: For critically important decisions requiring human moderation, we can wait for the translation and for the human moderator's decision - *Different perspectives*: Different translators show different aspects of thinking - *Transparency of complexities*: When translators disagree, we know the reasoning is complex - *Ethical safety*: An additional verification layer ensures alignment with values

## Open Questions

1. How do we train translators without "correct answers" from humans? 2. How many translators is optimal to use? 3. What to do if all translators cannot clearly explain the reasoning? 4. How to prove that translators accurately reflect internal thinking?

## Next Steps

I would like to: - Create a simple example of such a system working - Develop methods to verify translation accuracy - Combine this approach with existing tools

I would appreciate community feedback, especially regarding potential problems and practical challenges.

Comments (1)

ijk · 3h ago

It sounds like you're proposing doing this operation on the tokens in the reasoning. While it would be interesting to know if allowing it to choose arbitrary tokens, the biggest issue is that there's quite a bit of evidence that the tokens it prints have only a loose relationship with the internal model processes.

I question your premise; first demonstrate that having it think aloud in "efficient mathematical representations" is a useful efficiency. Then you can demonstrate that you can do any interpretatability work on the output.

Show HN: I made a CLI tool to change the Neovim color scheme faster (github.com)

Hacker Machine Shop Tutorials (github.com)

Rock as Heat Storage (tu-darmstadt.de)

TraceFind – Email OSINT information gathering tool (+username) (tracefind.info)

FakeMaker – Instantly generate fake identities (face, name, backstory) (fakemaker.app)

Ukrainian drones attack Moscow as Zelenskyy suggests fresh ceasefire talks (abcnews.go.com)

WorkOS: Summer Launch Week (workos.com)

Moving from an orchestration-heavy to leadership-heavy management role (lethain.com)

OpenVPN puts packets inside your packets (saminiir.com)

Using Claude Code Full-Time for 1 Month: Learnings and Workflows (mortenvistisen.com)

Ask HN: How does HN handle indexing by LLMs?

Therac-25 (en.wikipedia.org)

Vivaldi 7.5 RC 1 – Vivaldi Desktop Browser snapshot 3735.34/35 (vivaldi.com)

Ransomware groups are now using bug bounty tactics

Gold Metal for "Future" Gpt5 (github.com)

Fourier lightfield multiview stereoscope for large field-of-view 3D imaging (spiedigitallibrary.org)

Type aware lint rules: Oxlint vs. Biome 2 (solberg.is)

Ask HN: What's the worst part of web E2E testing?

"The Bitter Lesson" is wrong. Well sort of (assaf-pinhasi.medium.com)

U+237C ⍼ Right Angle with Downwards Zigzag Arrow (ionathan.ch)

China's Richest Man Buying Water Supply of New Hampshire Town Sparks Alarm (newsweek.com)

ChatGPT Hammers for Python Script Nails (prograham.net)

Show HN: Open LLM Spec – Standardizing inputs and outputs across providers

WebSecDojo – Free Web Security Challenges (websecdojo.com)

Unlimited Zip Game (zipgame.app)

MCP the Illustrated Guidebook (media.licdn.com)

Java Processor (en.wikipedia.org)

I'm Tired of Talking About AI (paddy.carvers.com)

λ-Calculus: Then and Now (Dana S. Scott, 2013) [pdf] (cis.upenn.edu)

New interface gives anyone the ability to train a robot (news.mit.edu)

New Study Finds Evidence of Hepatitis C Virus in Cells Lining Human Brain (hopkinsmedicine.org)

Show HN: Use local LLMs to organize your files (github.com)

Long-lost giant rivers that flowed across Antarctica up to 80M years ago (livescience.com)

The Veo 3 API is now available (developers.googleblog.com)

Before Macintosh: The Apple Lisa (2024) [video] (youtube.com)

Ask HN: Newsletters for programmers about developing automations?

What Makes Boeing's Doomsday Plane Different from Air Force One? (jalopnik.com)

'Ghostbusters' Fans Be Warned: Those Old Ecto-Coolers Are Exploding (cracked.com)

A Survey of Context Engineering for Large Language Models (arxiv.org)

Vibe Coding Deleted Production Database (twitter.com)

Speeding Up My ZSH Shell (scottspence.com)

Checklists are hard (but still a good thing) (utcc.utoronto.ca)

Categorizing Book Notes with AI (mieubrisse.substack.com)

Baqpaq for personal data backups on Linux systems (store.teejeetech.com)

HTMX Edit Row Example in HARC Stack (rakujourney.wordpress.com)

Context Engineering for AI Agents: Lessons from Building Manus (manus.im)

Google inks its first fusion power deal with Commonwealth Fusion Systems (techcrunch.com)

I Spy: Escalating to Entra ID's Global Admin with a First-Party App (securitylabs.datadoghq.com)

CBEX crypto scam: AI-hyped Ponzi scheme defrauds African investors (techxplore.com)

Free App (github.com)

Solving the Issue of Interpretability of AI

Comments (1)