KVScope: Profiling KV-Cache Memory Dynamics Across Frontier LLMs on H100
The Problem
Memory leaks in LLMs are hard to catch because PyTorch’s caching allocator acts as a black box. Standard tools like nvidia-smi only show total GPU memory, completely missing internal allocator fragmentation and unreleased tensor references post-generation.
The Approach
Built KVScope, a profiling framework that injects forward hooks into attention modules. By intercepting DynamicCache objects, it accurately captures KV cache tensor snapshots at every step without requiring changes to application code, using six heuristic detectors for specific memory pathologies.
Key Findings
- Gemma 4 post-EOS retention: 4.7–5.3 GB retained per generation
- gpt-oss-120B allocator gap: 14.5 GiB reserved-but-unused on H100
- 8-bit quantisation cost: Gemma 4 suffered +4.6% perplexity vs ≤0.25% for others