arXiv 2026 · Preprint

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

A training-free and plug-and-play KV-cache policy that enables stable, minute-level autoregressive video generation by combining gated relevance-diversity recall (GRAB) with trusted statistical alignment (TAME).

Yu Meng1, Xiangyang Luo1, Letian Li1, Wenyuan Jiang2, Chen Gao1, Xinlei Chen1, Yong Li1, Xiao-Ping Zhang1
1Tsinghua University  ·  2D-INFK, ETH Zürich
mengy24@mails.tsinghua.edu.cn

Ultra-Long Video Generation, No Retraining

TetherCache extends short-clip autoregressive diffusion models to 5-minute video generation while suppressing accumulated drift, color shifts, and structural distortion — all without modifying model parameters.

5-Minute Continuous Generation

Built on top of Self-Forcing (Wan2.1-1.3B-T2V), TetherCache pushes the rollout horizon far beyond the training window. The video on the left is generated autoregressively for 300 seconds, staying visually coherent throughout the rollout.

Duration 300s Resolution 832×480 Backbone Self-Forcing (Wan2.1-1.3B) Training None

Abstract

Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. We propose TetherCache, a training-free and plug-and-play cache management strategy. TetherCache organizes the cache into Sink, Memory, and Recent regions, and introduces two complementary mechanisms: GRAB selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity; TAME lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution. On VBench-Long, TetherCache consistently improves long-video generation across 30s, 60s, and 240s settings. In particular, for 240s generation, it reduces quality drift from 7.84 → 1.33.

Highlights

Three properties that make TetherCache an attractive component for long video systems.

Training-Free, Plug-and-Play

No fine-tuning or parameter changes. Drop-in replacement of the KV-cache policy at inference time.

Drift-Resistant Long Rollouts

Reduces ΔQuality Drift from 7.84 to 1.33 on 240s generation, suppressing color shifts, noise, and identity drift.

Modest Overhead

Less than 6% latency overhead over the baseline; reuses the original KV-cache tensor with light metadata.

Method Overview

TetherCache reinterprets the fixed KV cache as three contiguous regions — Sink (trusted early frames), Memory (selective long-range recall), and Recent (sliding window) — and equips it with two complementary mechanisms.

Overview of TetherCache framework
Overview of TetherCache. The fixed KV cache is divided into Sink, Memory, and Recent regions. GRAB performs relevance-diversity memory recall from evicted recent frames and existing memory, while TAME uses trusted sink statistics to align newly recalled KV tokens. The two modules jointly preserve informative long-range context and suppress context distribution shift without retraining.
GRAB

Gated Recall with Attention-Diversity Balancing

For every evicted recent frame, GRAB re-scores the entire memory plus the candidate using a gated score combining attention-based relevance and temporal diversity. This keeps memory informative yet temporally spread, instead of being filled with recent neighbors. Old memory entries can be demoted when a more useful candidate appears.

TAME

Trusted Alignment via Memory Editing

Newly admitted KV tokens are partially aligned to the per-head, per-channel statistics of a trusted pool (sink + existing memory). This tethers drifted historical features back to a stable distribution, preventing polluted context from corrupting future attention.

60-Second Generation: Side-by-Side Comparison

For each prompt, we show the Self-Forcing baseline (left) and TetherCache (Ours) (right). Compared with the baseline, our method better preserves visual quality, object identity, and temporal coherence over long autoregressive rollouts. Hover or click play to compare.

Ablation Study

Comparison of different variants: baseline (Self Forcing), without TAME, and full TetherCache.

Self Forcing (Baseline)
w/o TAME
TetherCache