Mechanistic Interpretability: Reverse-Engineering Neural Network Circuits and Features

## TL;DR
Mechanistic interpretability treats neural networks as scientific objects to be reverse-engineered — locating the circuits, features, and computational pathways that produce specific behaviors. Instead of asking "what does the model output?", it asks "how does the model compute this output?" — enabling targeted fixes for safety, bias, and reliability.

## Core Explanation
Three key concepts: (1) Features — directions in activation space corresponding to human-interpretable concepts (e.g., "dog", "French text", "sycophancy"). Features are typically not aligned with individual neurons — they exist in arbitrary directions, often in superposition; (2) Circuits — subnetworks of attention heads and MLP layers that implement specific computations (induction heads for in-context learning, name-mover heads for factual recall, greater-than circuits for arithmetic); (3) Superposition — when a neural network represents more features than it has dimensions by encoding them in near-orthogonal directions, interleaving multiple concepts in the same neurons (the "polysemantic neuron" problem). Sparse autoencoders (SAEs) are the primary tool for feature extraction — training an overcomplete autoencoder with L1 sparsity penalty on model activations, learning a sparse overcomplete basis where each latent dimension corresponds to a single feature.

## Detailed Analysis
Circuit discovery workflow: (1) Activation patching — intervene on specific model components (ablate an attention head, replace an MLP output) and measure effect on downstream behavior; (2) Causal tracing — identify the minimal subgraph of model components necessary and sufficient for a behavior (rank-1 patching, attribution patching); (3) Automated circuit discovery (ACDC, Attribution Graphs) — algorithms that automatically search for circuits explaining specific behaviors without manual hypothesis generation. Key findings: Anthropic's transformer-circuits project discovered induction heads — attention heads that implement in-context learning by attending to previous occurrences of the current token. OpenAI's sparse autoencoder research (2023-2024) extracted millions of features from GPT-4, including features for deception, power-seeking, and sycophancy. AI Safety & Security guide (2026) documents the emerging capability to "edit" models by clamping or neutralizing specific features — e.g., removing the sycophancy circuit without affecting helpfulness. ACM Computing Surveys (2025) unified MI research across vision, language, and multimodal models. Critical limitation: circuit-level understanding has only been achieved for toy models (1-2 layer transformers) and small language tasks; scaling to frontier models with billions of parameters and complex reasoning remains largely aspirational.

## Further Reading
- Transformer Circuits Thread (Anthropic, 2021-2024)
- Distill.pub: Feature Visualization & Circuits
- SAE-Vis: Sparse Autoencoder Visualizer (MIT)