Multi-Agent Reinforcement Learning: Cooperation, Competition, and Emergent Strategies

## TL;DR
Multi-Agent Reinforcement Learning (MARL) extends RL to systems where multiple agents learn simultaneously — collaborating, competing, or negotiating. From drone swarms to trading agents, MARL captures emergent collective intelligence that exceeds the sum of individual policies.

## Core Explanation
Single-agent RL: one agent learns a policy mapping states to actions. MARL: N agents, each learning while others also learn → non-stationary environment. Key paradigm: CTDE — agents share information during training (centralized critic sees all observations) but act independently at execution (decentralized actor uses only local observation). Algorithms: MADDPG (multi-agent DDPG), QMIX (monotonic value factorization), MAPPO (multi-agent PPO), COMA (counterfactual baseline).

## Detailed Analysis
Cooperative MARL: agents share a team reward (traffic light control, warehouse robots, drone formation). Challenge: credit assignment — which agent's action caused the success? Value factorization (QMIX, VDN) decomposes the joint Q-function into per-agent utilities with monotonic constraints. Competitive MARL: agents have opposing rewards (poker AI, adversarial games). Challenge: policy cycling and convergence. Self-play (AlphaStar, OpenAI Five) and population-based training (League training) maintain diverse opponent pools. Mixed-motive: combines cooperation and competition (negotiation, autonomous driving at intersections). LLM-augmented MARL (2025): LLMs provide strategic reasoning and communication protocols between agents.

## Further Reading
- MADDPG Original Paper (Lowe et al., NeurIPS 2017)
- SMAC: StarCraft Multi-Agent Challenge (DeepMind)
- PettingZoo Multi-Agent RL Library