Concept-Based Explainability: TCAV and Concept Bottleneck Models

## TL;DR
Concept-based explainability moves beyond pixel-level saliency maps to higher-level human-understandable concepts. TCAV quantifies whether a model "thinks" about stripes when classifying zebras, while Concept Bottleneck Models embed interpretable concept reasoning into the architecture itself.

## Core Explanation
Post-hoc saliency methods (Grad-CAM, Integrated Gradients) show "where" a model looks but not "what concept" it uses. TCAV approach: (1) Collect examples of concept (e.g., images with stripes vs. without); (2) Train a linear CAV (Concept Activation Vector) in the model's activation layer that separates concept from non-concept; (3) Measure the directional derivative of the model's output along the CAV direction — the TCAV score = fraction of inputs where the concept influences classification. Statistical significance via two-sided t-test against random concepts.

## Detailed Analysis
Concept Bottleneck Models (CBMs): architecture forces predictions to pass through a concept layer — model learns to predict concepts from input, then predict label from concepts. Advantages: inherently interpretable, can intervene on misconceptions (e.g., "ignore texture bias, use shape"). Limitations: requires annotated concept labels; reduces accuracy vs. unrestricted models. Hybrid CBMs (CVPR 2025) relax the bottleneck with residual connections. Visual-TCAV (2024) generates concept saliency maps showing where concepts are recognized spatially. Applications: medical imaging (clinician-verifiable reasoning), autonomous driving (concept-level failure analysis), and bias auditing (checking if models use protected attributes as concepts).

## Further Reading
- TensorFlow TCAV GitHub Repository
- Yannic Kilcher's TCAV Video Explanation
- DALLE-3 Concept Understanding Analysis