KVScope Case Study — Rahul Surya

The Problem: The LLM Observability Gap

Modern transformer-based language models rely on a key–value (KV) cache that grows linearly with sequence length, often exceeding the size of the model weights themselves during long-context generation. In containerised inference services, an over-grown KV cache is the primary cause of out-of-memory (OOM) failures.

However, standard hardware-level monitoring tools like nvidia-smi or Prometheus node exporters report only aggregate memory utilisation. They fail to distinguish between three operationally distinct conditions: (i) healthy linear cache growth, (ii) structural retention bugs, and (iii) allocator fragmentation. Closing this gap requires a profiler that records the structural properties of the KV cache—per-layer tensor shapes and growth curves—correlating them with NVML-reported usage.

The KVScope Profiler

I developed KVScope, an instrumentation framework that attaches forward hooks to the attention modules of any Hugging Face Transformers model. It extracts live KV tensors regardless of cache layout (DynamicCache, HybridCache, or legacy tuples) and records dual software/hardware memory telemetry.

The profiler runs as a context manager around model.generate, requiring zero modifications to the underlying model code. After generation, recorded snapshots are routed through a six-stage detector pipeline:

GrowthCurveDetector: Fits footprint as a function of tokens via linear/quadratic regression; flags anomalies where quadratic fit improves R² by >0.05.
PostEOSDetector: Calibrates leak scores by comparing post-generation NVML state to baseline; identifies systematic retention where memory persists after empty_cache.
FragmentationDetector: Tracks the ratio between PyTorch reserved and allocated pools (Fₜ), flagging averages >15%.
LayerAnomalyDetector: Uses Z-scores (threshold >3σ) to catch model-loading or sliding-window configuration bugs.
LayerUniformityDetector: Classifies per-layer footprint bimodality using Coefficient of Variation (CV); identifies 1:1 sliding/full ratios in hybrid models.
CacheDensityDetector: Measures bytes per generated token per attention layer (e.g., 8.0 KB/token/layer baseline for Pythia-1.4B).

The profiler also includes custom Triton GPU kernels for operations requiring GPU-native accuracy: per-head L2 norm measurement (identifies dead KV heads wasting cache space), MLA latent compression ratio analysis, and attention entropy scoring for token eviction candidate identification.

Gemma 4 Retention 4.7–5.3 GB

Systematic post-EOS leak identified across 13/15 prompts (mean score 0.48, persisting after empty_cache).

gpt-oss-120B Gap 14.5 GiB

Reserved-but-unused allocator gap (CV 0.94) invisible to standard monitoring.

Quantisation Cost +4.56%

Gemma 4 degrades +4.56% under 8-bit (vs. ≤0.25% for Pythia and GLM). The instruction-tuned checkpoint inflates absolute PPL; the delta is the meaningful signal.

Key Findings

The study evaluated Pythia-1.4B (baseline Multi-Head Attention), Gemma 4 (Grouped-Query Attention with hybrid sliding-window), GLM-4.7-Flash (Mixture of Experts), and gpt-oss-120B across 15 structured prompts. The data revealed systemic inefficiencies:

1. Systematic Post-EOS Retention (Gemma 4)

The most actionable finding: Gemma 4 retains between 4.7 and 5.3 GB of its peak KV-cache overhead after generation returns. This persists even after torch.cuda.empty_cache() and gc.collect(). KVScope's deep-tensor check revealed that this originates from the model's interleaved local (sliding-window) and global (full) layers, where buffers are held at the NVML level by hybrid-cache implementations.

Crucially, KVScope performs an explicit torch.equal(k,v) check during extraction to avoid double-counting aliased tensors in Gemma's global layers—a detail missed by simpler profilers.

Fig 4 — Post-EOS Memory: Baseline + Peak KV + Residual (prompt 1)

Pythia-1.4B

Gemma 4

GLM-4.7-Flash

gpt-oss-120B

Bars show unreleased memory after torch.cuda.empty_cache(). Gemma 4 retains 5 GB per request.

Residual NVML usage (red) persisting after generation — Gemma 4 retains ~half its peak KV overhead.

2. Bimodal Footprint and Allocator Gaps (gpt-oss-120B)

KVScope identified a strong bimodal footprint in gpt-oss-120B (Coefficient of Variation 0.94). The model alternates sliding-window and full layers in a 1:1 ratio. Sliding layers stabilise at approximately 0.25 MB (window saturation), while full layers grow linearly to ~8 MB at 4096 tokens. This produces a constant ~14.5 GiB gap between reserved and allocated memory—enough to severely disrupt multi-tenant GPU placement and bin-packing efficiency.

Fig 6 — PyTorch Allocator State at End of Generation (prompt 1)

Pythia-1.4B

Gemma 4

GLM-4.7-Flash

gpt-oss-120B

Δ = torch.memory_reserved − torch.memory_allocated. Logarithmic visual scale. gpt-oss holds 14.5 GiB of reserved-but-unused memory.

A constant ~15 GiB allocator pool reservation incurred during MoE expert weight materialisation.

3. Architectural Sensitivity to 8-bit Quantisation

We disproved the assumption that "8-bit is always free." While Pythia and GLM-4.7 showed ≤ 0.25% degradation, Gemma 4's sliding-window architecture suffered a +4.6% perplexity increase, suggesting standard LLM.int8() routing may be insufficient for these specific attention patterns.

Fig 7 — WikiText-103 Perplexity: Native vs. BnB 8-bit Quantisation

Pythia-1.4B

12.4

native

12.5

8-bit +0.18%

GLM-4.7-Flash

15.3

native

15.3

8-bit +0.21%

gpt-oss-120B

317.7

native (MXFP4)

—

8-bit not supported

Gemma 4

827.7

native*

865.5

8-bit +4.56%

* Gemma 4 absolute perplexity is high due to evaluating an instruction-tuned checkpoint with sliding-window masking — not comparable to base-model benchmarks. The +4.56% delta between native and 8-bit is the meaningful signal.

WikiText-103 perplexity on a fixed 1M-character chunk. Gemma 4's +4.56% degradation under 8-bit disproves the "quantisation is always free" assumption for hybrid GQA architectures.

Why It Matters

For infrastructure and ML engineers, these findings have immediate operational implications. Memory retention of 5 GiB per request is not a minor inefficiency—it's a critical production outage waiting to happen under concurrent load.

By relying solely on system-level metrics, engineering teams are flying blind. KVScope demonstrates that deep-level tensor profiling is mandatory for deploying modern, hybrid-attention LLMs safely. Identifying these allocator gaps and retention bugs allows teams to either patch the framework hooks, adjust PyTorch's caching allocator settings, or choose different serving architectures entirely.

Published research — read the full paper

Published April 2026 on Zenodo (DOI: 10.5281/zenodo.19871039, CC-BY 4.0). The paper details the forward-hook implementation, mathematical definitions of all six anomaly detectors, WikiText-103 perplexity evaluation, and a full data quality assessment of the profiling dataset.

Read the Paper — Zenodo View Source — GitHub