The Problem: The LLM Observability Gap
Modern transformer-based language models rely on a key–value (KV) cache that grows linearly with sequence length, often exceeding the size of the model weights themselves during long-context generation. In containerised inference services, an over-grown KV cache is the primary cause of out-of-memory (OOM) failures.
However, standard hardware-level monitoring tools like nvidia-smi or Prometheus node exporters report only aggregate memory utilisation. They fail to distinguish between three operationally distinct conditions: (i) healthy linear cache growth, (ii) structural retention bugs, and (iii) allocator fragmentation. Closing this gap requires a profiler that records the structural properties of the KV cache—per-layer tensor shapes and growth curves—correlating them with NVML-reported usage.
The KVScope Profiler
I developed KVScope, an instrumentation framework that attaches forward hooks to the attention modules of any Hugging Face Transformers model. It extracts live KV tensors regardless of cache layout (DynamicCache, HybridCache, or legacy tuples) and records dual software/hardware memory telemetry.
The profiler runs as a context manager around model.generate, requiring zero modifications to the underlying model code. After generation, recorded snapshots are routed through a six-stage detector pipeline:
- GrowthCurveDetector: Fits footprint as a function of tokens via linear/quadratic regression; flags anomalies where quadratic fit improves R² by >0.05.
- PostEOSDetector: Calibrates leak scores by comparing post-generation NVML state to baseline; identifies systematic retention where memory persists after
empty_cache. - FragmentationDetector: Tracks the ratio between PyTorch reserved and allocated pools (Fₜ), flagging averages >15%.
- LayerAnomalyDetector: Uses Z-scores (threshold >3σ) to catch model-loading or sliding-window configuration bugs.
- LayerUniformityDetector: Classifies per-layer footprint bimodality using Coefficient of Variation (CV); identifies 1:1 sliding/full ratios in hybrid models.
- CacheDensityDetector: Measures bytes per generated token per attention layer (e.g., 8.0 KB/token/layer baseline for Pythia-1.4B).
The profiler also includes custom Triton GPU kernels for operations requiring GPU-native accuracy: per-head L2 norm measurement (identifies dead KV heads wasting cache space), MLA latent compression ratio analysis, and attention entropy scoring for token eviction candidate identification.
Systematic post-EOS leak identified across 13/15 prompts (mean score 0.48, persisting after empty_cache).
Reserved-but-unused allocator gap (CV 0.94) invisible to standard monitoring.
Gemma 4 degrades +4.56% under 8-bit (vs. ≤0.25% for Pythia and GLM). The instruction-tuned checkpoint inflates absolute PPL; the delta is the meaningful signal.
Key Findings
The study evaluated Pythia-1.4B (baseline Multi-Head Attention), Gemma 4 (Grouped-Query Attention with hybrid sliding-window), GLM-4.7-Flash (Mixture of Experts), and gpt-oss-120B across 15 structured prompts. The data revealed systemic inefficiencies:
1. Systematic Post-EOS Retention (Gemma 4)
The most actionable finding: Gemma 4 retains between 4.7 and 5.3 GB of its peak KV-cache overhead after generation returns. This persists even after torch.cuda.empty_cache() and gc.collect(). KVScope's deep-tensor check revealed that this originates from the model's interleaved local (sliding-window) and global (full) layers, where buffers are held at the NVML level by hybrid-cache implementations.
Crucially, KVScope performs an explicit torch.equal(k,v) check during extraction to avoid double-counting aliased tensors in Gemma's global layers—a detail missed by simpler profilers.
Bars show unreleased memory after torch.cuda.empty_cache(). Gemma 4 retains 5 GB per request.
Residual NVML usage (red) persisting after generation — Gemma 4 retains ~half its peak KV overhead.
2. Bimodal Footprint and Allocator Gaps (gpt-oss-120B)
KVScope identified a strong bimodal footprint in gpt-oss-120B (Coefficient of Variation 0.94). The model alternates sliding-window and full layers in a 1:1 ratio. Sliding layers stabilise at approximately 0.25 MB (window saturation), while full layers grow linearly to ~8 MB at 4096 tokens. This produces a constant ~14.5 GiB gap between reserved and allocated memory—enough to severely disrupt multi-tenant GPU placement and bin-packing efficiency.
Δ = torch.memory_reserved − torch.memory_allocated. Logarithmic visual scale. gpt-oss holds 14.5 GiB of reserved-but-unused memory.
A constant ~15 GiB allocator pool reservation incurred during MoE expert weight materialisation.
3. Architectural Sensitivity to 8-bit Quantisation
We disproved the assumption that "8-bit is always free." While Pythia and GLM-4.7 showed ≤ 0.25% degradation, Gemma 4's sliding-window architecture suffered a +4.6% perplexity increase, suggesting standard LLM.int8() routing may be insufficient for these specific attention patterns.
* Gemma 4 absolute perplexity is high due to evaluating an instruction-tuned checkpoint with sliding-window masking — not comparable to base-model benchmarks. The +4.56% delta between native and 8-bit is the meaningful signal.
WikiText-103 perplexity on a fixed 1M-character chunk. Gemma 4's +4.56% degradation under 8-bit disproves the "quantisation is always free" assumption for hybrid GQA architectures.
Why It Matters
For infrastructure and ML engineers, these findings have immediate operational implications. Memory retention of 5 GiB per request is not a minor inefficiency—it's a critical production outage waiting to happen under concurrent load.
By relying solely on system-level metrics, engineering teams are flying blind. KVScope demonstrates that deep-level tensor profiling is mandatory for deploying modern, hybrid-attention LLMs safely. Identifying these allocator gaps and retention bugs allows teams to either patch the framework hooks, adjust PyTorch's caching allocator settings, or choose different serving architectures entirely.