PAN: A General, Interactable, and Long-Horizon World Model

1. Introduction and Problem Statement

PAN (Predictive Action-conditioned Network) addresses a fundamental challenge in AI: building a general-purpose world model that can simulate how the world evolves in response to actions, enabling agents to reason, plan, and act with foresight.

1.1 The Core Problem

Existing approaches fall into two inadequate categories:

Video generation models (Sora, KLING, Gen-3): Produce visually stunning videos but operate in "prompt-to-video" mode without:
- Real-time causal control
- Interactive feedback loops
- Explicit state/action representations
- Long-horizon consistency
Domain-specific world models: Limited to narrow contexts (games, robotics, 3D scenes) with:
- Restricted action spaces
- Poor generalization across domains
- Lack of temporal dynamics or interactivity

Key Insight: A true world model must unify broad-domain generality, long-range interactive dynamics, and coherent simulation to support reasoning and planning.

1.2 PAN's Solution

PAN introduces a Generative Latent Prediction (GLP) architecture that:

Separates abstract causal dynamics (latent space reasoning) from perceptual realization (observation generation)
Grounds simulation in language actions using an LLM backbone
Maintains long-horizon consistency through autoregressive state evolution
Generates high-fidelity observations via video diffusion decoding

Rendering diagram...

2. Technical Approach: The GLP Architecture

2.1 Core Framework

The Generative Latent Prediction (GLP) framework defines three components that model the joint distribution over observations:

Mathematical Formulation:

$p_{PAN} (o_{t + 1} ∣ o_{t}, a_{t}) = \sum_{\overset{s}{^}_{t}, \overset{s}{^}_{t + 1}} encoder p_{h} (\overset{s}{^}_{t} ∣ o_{t}) \cdot world model p_{f} (\overset{s}{^}_{t + 1} ∣ \overset{s}{^}_{t}, a_{t}) \cdot decoder p_{g} (o_{t + 1} ∣ \overset{s}{^}_{t + 1})$

Components:

Encoder h: Maps observations → latent states
- Uses Qwen2.5-VL-7B vision tower (ViT)
- 14×14 spatial patches with 3D temporal grouping
- 2D rotary positional embeddings
Predictive Module f: Evolves latent states under actions
- LLM-based autoregressive backbone (Qwen2.5-VL-7B)
- Predicts next state: $\overset{s}{^}_{t + 1} \sim p_{f} (\cdot ∣ \overset{s}{^}_{t}, a_{t})$
- Maintains global temporal consistency
Decoder g: Reconstructs observations from latent states
- Adapted from Wan2.1-T2V-14B (14B-parameter DiT)
- Generates perceptually detailed video chunks
- Ensures local temporal smoothness

Rendering diagram...

2.2 Generative Supervision: Solving the Collapse Problem

Why Not JEPA?

Previous encoder-only models (JEPA) minimize latent-space distance:

$L_{JEPA} = E [∥ f (h (o_{t}), a_{t}) - h (o_{t + 1}) ∥]$

Problem: Trivial collapse to constant vectors (indefensibility issue).

PAN's Solution: Generative supervision in observation space:

$L_{PAN} = E [disc (g \circ f (h (o_{t}), a_{t}), o_{t + 1})]$

Key Advantage: Every latent transition must correspond to a realizable sensory change, preventing collapse while grounding dynamics in observable reality.

3. Key Innovations

3.1 Autoregressive World Model Backbone

Multi-Turn Conversational Format:

<|user|> <image state 1> <action 1>
<|assistant|> <query embedding × 256>
<|user|> <video state 2> <action 2>
<|assistant|> <query embedding × 256>
...

256 learnable query embeddings represent compact latent states
Teacher forcing during training; closed-loop rollouts during inference
Associative memory preserves global consistency across time

3.2 Causal Shift-Window Denoising Process Model (Causal Swin-DPM)

Problem: Naive sequential generation causes:

Local inconsistency between adjacent video chunks
Rapid quality degradation from error accumulation

Solution: Sliding temporal window with chunk-wise causal attention

Rendering diagram...

Key Features:

Two chunks at different noise levels (K/2 and K)
Chunk-wise causal attention mask: Later chunk only attends to previous chunk (prevents future action leakage)
Fuzzy conditioning: Partially noised history suppresses unpredictable pixel details while preserving semantic structure
Noise augmentation: Conditioning frame gets small noise (k=0.055) to prevent error accumulation

Flow Matching Objective:

$x_{k} = k x_{1} + (1 - k) x_{0}, v_{k} = \frac{d x _{k}}{d k} = x_{1} - x_{0}$

Model predicts velocity $v_{k}$ across 1000 denoising steps with shifted schedule.

3.3 Conditioning Architecture

Dual-Stream Cross-Attention:

Rendering diagram...

World state stream: Linear projection → cross-attention → zero-initialized projection
Action stream: umT5-encoded text → original cross-attention pathway
Summation: Integrates global state context with action-specific changes

4. Training Strategy

4.1 Stage 1: Module-Wise Training

Objective: Adapt Wan2.1-T2V-14B to Causal Swin-DPM architecture

Frozen: Wan-VAE, text encoder, Qwen2.5-VL
Trained: Video diffusion decoder only
Infrastructure:
- 960 NVIDIA H200 GPUs
- Hybrid Sharded Data Parallel (HSDP): FSDP within 8-GPU nodes
- FlashAttention-3 (cross-attention) + FlexAttention (causal attention)
- BFloat16, AdamW (lr=1e-5), 5 epochs

4.2 Stage 2: Joint Training

Objective: Integrate encoder-backbone-decoder with generative supervision

$L_{GLP} (h, f, g) = E [disc (g \circ f (h (o_{t}), a_{t}), o_{t + 1})]$

Frozen: Vision-language model
Trained: Query embeddings + video decoder
Context window: 10 most recent rounds
Sequence parallelism: Ulysses method (SP group size 4)
Early stopping: After 1 epoch (validation convergence)

5. Training Data Pipeline

5.1 Video Segmentation

Frame-level heuristics: Detect scene boundaries
Temporal merging: Combine similar adjacent segments
Quality filtering: Select clips in target duration range

5.2 Multi-Stage Filtering

Rule-Based Filters:

Motion metrics: Remove static/overly dynamic clips (optical flow, edge differences)
Trivial motion: Filter uniform translation/zoom (sparse feature tracking)
Pure color frames: Exclude fade-in/fade-out transitions

Pretrained Detectors:

Aesthetic scorer: Frame-level quality assessment
Scene-text detector: Remove obstructive subtitles/watermarks

Custom VLM Filter: Detects and removes:

Lecture-type videos (static talking heads)
Text-dominated content
Screen recordings
Low-quality/heavily edited clips
Residual scene cuts

5.3 Dense Temporal Captions

Requirements:

Factually rich descriptions (inspired by DALL-E 3)
Temporally grounded: Focus on motion, events, changes (not static backgrounds)

Implementation: VLM re-captioning with prompts emphasizing evolving dynamics

6. Experimental Results

6.1 Evaluation Benchmarks

Three complementary dimensions:

Rendering diagram...

6.2 Baselines

Open-Source:

WAN 2.1/2.2-I2V-14B
Cosmos 1/2-14B (NVIDIA)
V-JEPA 2 (Meta)

Closed-Source:

KLING (Kuaishou)
MiniMax-Hailuo
Gen-3 (Runway)

6.3 Key Results

Metric	PAN	Best Baseline	Improvement
Action Simulation Fidelity	58.6%	55.2% (KLING)	+3.4%
Transition Smoothness	53.6%	41.8% (Cosmos2)	+11.8%
Simulation Consistency	64.1%	52.3% (MiniMax)	+11.8%
Step-Wise Simulation	56.1%	48.7% (Cosmos2)	+7.4%
Open-Ended Planning	+26.7%	-8.3% (Gen-3)	+35.0%
Structured Planning	+23.4%	+12.1% (Cosmos2)	+11.3%

Key Findings:

Action Fidelity: PAN achieves highest accuracy among open-source models, surpassing most commercial systems
Long-Horizon Stability: Substantially outperforms all baselines in maintaining quality over extended rollouts
Planning Support: Only model showing consistent improvements when integrated with VLM agents (OpenAI-o3)

Critical Insight: Realistic appearance alone is insufficient—reliable causal grounding is essential for effective plan-time reasoning.

7. Practical Implications

7.1 Real-World Applications

Robotics:

Manipulation planning: Simulate action outcomes before execution (Agibot dataset)
Multi-step reasoning: Tree-structured search through simulated futures

Autonomous Driving:

Counterfactual simulation: "What if I change lanes now?"
Safety testing: Generate rare/dangerous scenarios

Content Creation:

Interactive storytelling: User-controlled narrative branches
Virtual environments: Consistent, explorable worlds

7.2 Advantages Over Existing Approaches

Capability	Video Generators	Domain-Specific Models	PAN
Visual Quality	✅ High	⚠️ Variable	✅ High
Action Control	❌ None	⚠️ Limited	✅ Natural Language
Long-Horizon	❌ Drift	⚠️ Domain-Specific	✅ Stable
Generality	⚠️ Passive	❌ Narrow	✅ Open-Domain
Causal Grounding	❌ Weak	✅ Strong	✅ Strong

8.1 World Models Evolution

Historical Context:

Classical: Domain-specific simulators (Ha & Schmidhuber 2018)
Robotics: Model-based RL (Yang et al. 2023, Zhou et al. 2024)
Autonomous Driving: Path planning (Wang et al. 2023, Hu et al. 2023)
Games: Genie 2, Matrix-Game (interactive 3D)

PAN's Position: First general-purpose model unifying:

Open-domain generality
Natural language actions
Long-horizon dynamics
High-fidelity observations

8.2 Video Generation Models

Diffusion-Based:

Sora (OpenAI): Non-autoregressive, limited control
Wan2.1 (base for PAN decoder): Single-shot generation
Cosmos (NVIDIA): Physics-focused, domain-specific

PAN's Distinction: Hybrid autoregressive-diffusion enables:

On-the-fly action control
Closed-loop simulation
Thought experiments for reasoning

8.3 Encoder-Only Predictive Models

JEPA (LeCun 2022):

Latent-space matching objective
Collapse problem (constant vector solutions)
Requires heuristic regularizers

DINO-WM:

Fixed DINOv2 features
Stable but ungrounded transitions

PAN's Advantage: Generative supervision ensures every transition corresponds to realizable sensory change.

9. Limitations and Future Work

9.1 Current Limitations

Computational Cost: 960 H200 GPUs for training
Context Window: Limited to 10 recent rounds (Qwen2.5-VL constraint)
Mixed Representations: Current implementation uses pure continuous latents (not discrete-continuous hybrid)

9.2 Future Directions

Architectural Enhancements:

Mixed backbone: Combine LLM + diffusion embedders (full GLP vision)
Hierarchical embeddings: Multi-scale temporal abstractions
Higher-order dynamics: Beyond Markovian state transitions

Scaling:

Broader modalities: Audio, tactile, proprioceptive signals
Larger datasets: More diverse domains and interaction types
Longer contexts: Beyond 10-round history

Applications:

Real-time decision-making: Faster inference for robotics
Multi-agent simulation: Social interactions and coordination
Safety-critical systems: Certified guarantees for autonomous vehicles

10. Conclusion

PAN represents a significant step toward general-purpose world models by:

Unifying latent reasoning and perceptual realization through the GLP framework
Grounding simulation in language-based knowledge via LLM backbones
Maintaining long-horizon consistency through Causal Swin-DPM
Supporting simulative reasoning for downstream planning

Core Contribution: PAN demonstrates that world models can serve as internal simulators for thought experiments, enabling agents to reason about actions before execution—a fundamental capability for general intelligence.

The model achieves state-of-the-art performance among open-source systems and competitive results with commercial models, while uniquely supporting interactive, long-horizon simulation across diverse domains. Future work will focus on scaling to broader modalities, enhancing temporal abstraction, and enabling real-time decision-making for embodied AI systems.

PAN: A General, Interactable, and Long-Horizon World Model

1. Introduction and Problem Statement

1.1 The Core Problem

1.2 PAN's Solution

2. Technical Approach: The GLP Architecture

2.1 Core Framework

2.2 Generative Supervision: Solving the Collapse Problem

3. Key Innovations

3.1 Autoregressive World Model Backbone

3.2 Causal Shift-Window Denoising Process Model (Causal Swin-DPM)

3.3 Conditioning Architecture

4. Training Strategy

4.1 Stage 1: Module-Wise Training

4.2 Stage 2: Joint Training

5. Training Data Pipeline

5.1 Video Segmentation

5.2 Multi-Stage Filtering

5.3 Dense Temporal Captions

6. Experimental Results

6.1 Evaluation Benchmarks

6.2 Baselines

6.3 Key Results

7. Practical Implications

7.1 Real-World Applications

7.2 Advantages Over Existing Approaches

8. Related Work and Context

8.1 World Models Evolution

8.2 Video Generation Models

8.3 Encoder-Only Predictive Models

9. Limitations and Future Work

9.1 Current Limitations

9.2 Future Directions

10. Conclusion