Skip to main content
Back to Research
OtherarXiv preprint · 2025world modelsvideo generationdiffusion modelslarge language modelsaction-conditioned predictionlong-horizon simulationroboticsembodied AI

PAN: A World Model for General, Interactable, and Long-Horizon World Simulation

PAN Team, Institute of Foundation Models

PAN is a general-purpose world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. It employs a Generative Latent Prediction (GLP) architecture combining an LLM-based autoregressive latent dynamics backbone with a video diffusion decoder to achieve unified latent space reasoning and realizable world dynamics.

10 min read

PAN: A General, Interactable, and Long-Horizon World Model

1. Introduction and Problem Statement

PAN (Predictive Action-conditioned Network) addresses a fundamental challenge in AI: building a general-purpose world model that can simulate how the world evolves in response to actions, enabling agents to reason, plan, and act with foresight.

1.1 The Core Problem

Existing approaches fall into two inadequate categories:

  • Video generation models (Sora, KLING, Gen-3): Produce visually stunning videos but operate in "prompt-to-video" mode without:

    • Real-time causal control
    • Interactive feedback loops
    • Explicit state/action representations
    • Long-horizon consistency
  • Domain-specific world models: Limited to narrow contexts (games, robotics, 3D scenes) with:

    • Restricted action spaces
    • Poor generalization across domains
    • Lack of temporal dynamics or interactivity

Key Insight: A true world model must unify broad-domain generality, long-range interactive dynamics, and coherent simulation to support reasoning and planning.

1.2 PAN's Solution

PAN introduces a Generative Latent Prediction (GLP) architecture that:

  1. Separates abstract causal dynamics (latent space reasoning) from perceptual realization (observation generation)
  2. Grounds simulation in language actions using an LLM backbone
  3. Maintains long-horizon consistency through autoregressive state evolution
  4. Generates high-fidelity observations via video diffusion decoding
Rendering diagram...

2. Technical Approach: The GLP Architecture

2.1 Core Framework

The Generative Latent Prediction (GLP) framework defines three components that model the joint distribution over observations:

Mathematical Formulation:

Components:

  1. Encoder h: Maps observations → latent states

    • Uses Qwen2.5-VL-7B vision tower (ViT)
    • 14×14 spatial patches with 3D temporal grouping
    • 2D rotary positional embeddings
  2. Predictive Module f: Evolves latent states under actions

    • LLM-based autoregressive backbone (Qwen2.5-VL-7B)
    • Predicts next state:
    • Maintains global temporal consistency
  3. Decoder g: Reconstructs observations from latent states

    • Adapted from Wan2.1-T2V-14B (14B-parameter DiT)
    • Generates perceptually detailed video chunks
    • Ensures local temporal smoothness
Rendering diagram...

2.2 Generative Supervision: Solving the Collapse Problem

Why Not JEPA?

Previous encoder-only models (JEPA) minimize latent-space distance:

Problem: Trivial collapse to constant vectors (indefensibility issue).

PAN's Solution: Generative supervision in observation space:

Key Advantage: Every latent transition must correspond to a realizable sensory change, preventing collapse while grounding dynamics in observable reality.


3. Key Innovations

3.1 Autoregressive World Model Backbone

Multi-Turn Conversational Format:

<|user|> <image state 1> <action 1>
<|assistant|> <query embedding × 256>
<|user|> <video state 2> <action 2>
<|assistant|> <query embedding × 256>
...
  • 256 learnable query embeddings represent compact latent states
  • Teacher forcing during training; closed-loop rollouts during inference
  • Associative memory preserves global consistency across time

3.2 Causal Shift-Window Denoising Process Model (Causal Swin-DPM)

Problem: Naive sequential generation causes:

  1. Local inconsistency between adjacent video chunks
  2. Rapid quality degradation from error accumulation

Solution: Sliding temporal window with chunk-wise causal attention

Rendering diagram...

Key Features:

  • Two chunks at different noise levels (K/2 and K)
  • Chunk-wise causal attention mask: Later chunk only attends to previous chunk (prevents future action leakage)
  • Fuzzy conditioning: Partially noised history suppresses unpredictable pixel details while preserving semantic structure
  • Noise augmentation: Conditioning frame gets small noise (k=0.055) to prevent error accumulation

Flow Matching Objective:

Model predicts velocity across 1000 denoising steps with shifted schedule.

3.3 Conditioning Architecture

Dual-Stream Cross-Attention:

Rendering diagram...
  • World state stream: Linear projection → cross-attention → zero-initialized projection
  • Action stream: umT5-encoded text → original cross-attention pathway
  • Summation: Integrates global state context with action-specific changes

4. Training Strategy

4.1 Stage 1: Module-Wise Training

Objective: Adapt Wan2.1-T2V-14B to Causal Swin-DPM architecture

  • Frozen: Wan-VAE, text encoder, Qwen2.5-VL
  • Trained: Video diffusion decoder only
  • Infrastructure:
    • 960 NVIDIA H200 GPUs
    • Hybrid Sharded Data Parallel (HSDP): FSDP within 8-GPU nodes
    • FlashAttention-3 (cross-attention) + FlexAttention (causal attention)
    • BFloat16, AdamW (lr=1e-5), 5 epochs

4.2 Stage 2: Joint Training

Objective: Integrate encoder-backbone-decoder with generative supervision

  • Frozen: Vision-language model
  • Trained: Query embeddings + video decoder
  • Context window: 10 most recent rounds
  • Sequence parallelism: Ulysses method (SP group size 4)
  • Early stopping: After 1 epoch (validation convergence)

5. Training Data Pipeline

5.1 Video Segmentation

  1. Frame-level heuristics: Detect scene boundaries
  2. Temporal merging: Combine similar adjacent segments
  3. Quality filtering: Select clips in target duration range

5.2 Multi-Stage Filtering

Rule-Based Filters:

  • Motion metrics: Remove static/overly dynamic clips (optical flow, edge differences)
  • Trivial motion: Filter uniform translation/zoom (sparse feature tracking)
  • Pure color frames: Exclude fade-in/fade-out transitions

Pretrained Detectors:

  • Aesthetic scorer: Frame-level quality assessment
  • Scene-text detector: Remove obstructive subtitles/watermarks

Custom VLM Filter: Detects and removes:

  • Lecture-type videos (static talking heads)
  • Text-dominated content
  • Screen recordings
  • Low-quality/heavily edited clips
  • Residual scene cuts

5.3 Dense Temporal Captions

Requirements:

  1. Factually rich descriptions (inspired by DALL-E 3)
  2. Temporally grounded: Focus on motion, events, changes (not static backgrounds)

Implementation: VLM re-captioning with prompts emphasizing evolving dynamics


6. Experimental Results

6.1 Evaluation Benchmarks

Three complementary dimensions:

Rendering diagram...

6.2 Baselines

Open-Source:

  • WAN 2.1/2.2-I2V-14B
  • Cosmos 1/2-14B (NVIDIA)
  • V-JEPA 2 (Meta)

Closed-Source:

  • KLING (Kuaishou)
  • MiniMax-Hailuo
  • Gen-3 (Runway)

6.3 Key Results

MetricPANBest BaselineImprovement
Action Simulation Fidelity58.6%55.2% (KLING)+3.4%
Transition Smoothness53.6%41.8% (Cosmos2)+11.8%
Simulation Consistency64.1%52.3% (MiniMax)+11.8%
Step-Wise Simulation56.1%48.7% (Cosmos2)+7.4%
Open-Ended Planning+26.7%-8.3% (Gen-3)+35.0%
Structured Planning+23.4%+12.1% (Cosmos2)+11.3%

Key Findings:

  1. Action Fidelity: PAN achieves highest accuracy among open-source models, surpassing most commercial systems
  2. Long-Horizon Stability: Substantially outperforms all baselines in maintaining quality over extended rollouts
  3. Planning Support: Only model showing consistent improvements when integrated with VLM agents (OpenAI-o3)

Critical Insight: Realistic appearance alone is insufficient—reliable causal grounding is essential for effective plan-time reasoning.


7. Practical Implications

7.1 Real-World Applications

Robotics:

  • Manipulation planning: Simulate action outcomes before execution (Agibot dataset)
  • Multi-step reasoning: Tree-structured search through simulated futures

Autonomous Driving:

  • Counterfactual simulation: "What if I change lanes now?"
  • Safety testing: Generate rare/dangerous scenarios

Content Creation:

  • Interactive storytelling: User-controlled narrative branches
  • Virtual environments: Consistent, explorable worlds

7.2 Advantages Over Existing Approaches

CapabilityVideo GeneratorsDomain-Specific ModelsPAN
Visual Quality✅ High⚠️ Variable✅ High
Action Control❌ None⚠️ Limited✅ Natural Language
Long-Horizon❌ Drift⚠️ Domain-Specific✅ Stable
Generality⚠️ Passive❌ Narrow✅ Open-Domain
Causal Grounding❌ Weak✅ Strong✅ Strong

8.1 World Models Evolution

Historical Context:

  • Classical: Domain-specific simulators (Ha & Schmidhuber 2018)
  • Robotics: Model-based RL (Yang et al. 2023, Zhou et al. 2024)
  • Autonomous Driving: Path planning (Wang et al. 2023, Hu et al. 2023)
  • Games: Genie 2, Matrix-Game (interactive 3D)

PAN's Position: First general-purpose model unifying:

  • Open-domain generality
  • Natural language actions
  • Long-horizon dynamics
  • High-fidelity observations

8.2 Video Generation Models

Diffusion-Based:

  • Sora (OpenAI): Non-autoregressive, limited control
  • Wan2.1 (base for PAN decoder): Single-shot generation
  • Cosmos (NVIDIA): Physics-focused, domain-specific

PAN's Distinction: Hybrid autoregressive-diffusion enables:

  • On-the-fly action control
  • Closed-loop simulation
  • Thought experiments for reasoning

8.3 Encoder-Only Predictive Models

JEPA (LeCun 2022):

  • Latent-space matching objective
  • Collapse problem (constant vector solutions)
  • Requires heuristic regularizers

DINO-WM:

  • Fixed DINOv2 features
  • Stable but ungrounded transitions

PAN's Advantage: Generative supervision ensures every transition corresponds to realizable sensory change.


9. Limitations and Future Work

9.1 Current Limitations

  1. Computational Cost: 960 H200 GPUs for training
  2. Context Window: Limited to 10 recent rounds (Qwen2.5-VL constraint)
  3. Mixed Representations: Current implementation uses pure continuous latents (not discrete-continuous hybrid)

9.2 Future Directions

Architectural Enhancements:

  • Mixed backbone: Combine LLM + diffusion embedders (full GLP vision)
  • Hierarchical embeddings: Multi-scale temporal abstractions
  • Higher-order dynamics: Beyond Markovian state transitions

Scaling:

  • Broader modalities: Audio, tactile, proprioceptive signals
  • Larger datasets: More diverse domains and interaction types
  • Longer contexts: Beyond 10-round history

Applications:

  • Real-time decision-making: Faster inference for robotics
  • Multi-agent simulation: Social interactions and coordination
  • Safety-critical systems: Certified guarantees for autonomous vehicles

10. Conclusion

PAN represents a significant step toward general-purpose world models by:

  1. Unifying latent reasoning and perceptual realization through the GLP framework
  2. Grounding simulation in language-based knowledge via LLM backbones
  3. Maintaining long-horizon consistency through Causal Swin-DPM
  4. Supporting simulative reasoning for downstream planning

Core Contribution: PAN demonstrates that world models can serve as internal simulators for thought experiments, enabling agents to reason about actions before execution—a fundamental capability for general intelligence.

The model achieves state-of-the-art performance among open-source systems and competitive results with commercial models, while uniquely supporting interactive, long-horizon simulation across diverse domains. Future work will focus on scaling to broader modalities, enhancing temporal abstraction, and enabling real-time decision-making for embodied AI systems.