Skip to main content
Back to Research
Computer ArchitectureNeurIPS 2025 · 2025nested learningassociative memorycontinual learningin-context learningoptimizationtransformerslanguage modelsmemory systems

Nested Learning: The Illusion of Deep Learning Architectures

Ali Behrouz, Meisam Razaviyayn, Peiling Zhong, Vahab Mirrokni

This paper introduces Nested Learning (NL), a new learning paradigm that represents models as nested, multi-level optimization problems with distinct context flows. NL reveals that deep learning methods compress their context flow and explains in-context learning emergence, leading to three contributions: Deep Optimizers (showing gradient-based optimizers are associative memory modules), Self-Modifying Titans (a sequence model that learns its own update algorithm), and a Continuum Memory System with the HOPE architecture.

10 min read

Nested Learning: The Illusion of Deep Learning Architectures

1. Introduction and Problem Statement

1.1 The Challenge

Current Large Language Models (LLMs) face a fundamental limitation: they are largely static after deployment. While they excel at tasks learned during pre-training, they cannot continually acquire new capabilities beyond their immediate context window. This creates a condition analogous to anterograde amnesia in neuroscience—where the model can only access:

  • Immediate context (short-term memory via attention mechanisms)
  • Long-past knowledge (frozen parameters from pre-training)

"Current Models only Experience the Immediate Present" - lacking the ability to consolidate new information into long-term memory after deployment.

1.2 Key Limitations of Traditional Deep Learning

Traditional deep learning approaches face several challenges:

  1. Computational depth may not increase with more layers
  2. Capacity improvements show diminishing returns with depth/width
  3. Optimization may converge to suboptimal solutions
  4. Adaptation abilities (continual learning, out-of-distribution generalization) don't automatically improve with more layers

2. Nested Learning (NL) Paradigm

2.1 Core Concept

Nested Learning represents a machine learning model as a coherent system of nested, multi-level, and/or parallel optimization problems, each with its own "context flow." This paradigm reveals that:

Deep learning methods learn from data through compressing their own context flow, explaining how in-context learning emerges in large models.

Rendering diagram...

2.2 Associative Memory Foundation

NL builds on the concept of associative memory—the ability to form and retrieve connections between events.

Definition (Associative Memory): Given keys K ⊆ ℝ^dk and values V ⊆ ℝ^dv, associative memory is an operator M: K → V that minimizes:

M* = arg min_M L̃(M(K); V)

Key Insight: Memory is a neural update caused by input, while learning is the process of acquiring effective memory.

2.3 Update Frequency Hierarchy

Components are ordered by their update frequency (f_A):

  • Higher levels = Lower frequency updates (e.g., pre-training parameters)
  • Lower levels = Higher frequency updates (e.g., attention states)
  • Parallel components = Same frequency but independent computation (A =_f B)
Rendering diagram...

3. Technical Approach

3.1 Decomposing Gradient Descent

Example 1: Simple MLP Training with Gradient Descent

The weight update rule:

W_{t+1} = W_t - η_{t+1} ∇_{W_t} L(W_t; x_{t+1})
         = W_t - η_{t+1} ∇_{y_{t+1}} L(W_t; x_{t+1}) ⊗ x_{t+1}

Can be reformulated as an optimization problem:

W_{t+1} = arg min_W ⟨Wx_{t+1}, u_{t+1}⟩ + (1/2η_{t+1})||W - W_t||²_2

where u_{t+1} = ∇_{y_{t+1}} L(W_t; x_{t+1}) is the Local Surprise Signal (LSS)—quantifying mismatch between current output and objective structure.

Example 2: Gradient Descent with Momentum

W_{t+1} = W_t - m_{t+1}
m_{t+1} = m_t - η_{t+1} ∇_{W_t} L(W_t; x_{t+1})

Reformulated as a 2-level optimization:

W_{t+1} = W_t - m_{t+1}  (outer level)
m_{t+1} = arg min_m -⟨m, ∇_{W_t} L(W_t; x_{t+1})⟩ + η_{t+1}||m - m_t||²_2  (inner level)

Key Discovery: Momentum is a key-less associative memory that compresses gradients into its parameters using gradient descent.

3.2 Architectural Decomposition

Linear Attention as Nested Optimization:

Standard linear attention formulation:

k_t = x_t W_k,  v_t = x_t W_v,  q_t = x_t W_q
M_t = M_{t-1} + v_t k_t^⊤
y_t = M_t q_t

The memory update M_t can be reformulated as:

M_{t+1} = arg min_M ⟨M k_{t+1}, v_{t+1}⟩ + ||M - M_t||²_2

This reveals linear attention as a 2-level optimization:

  • Inner level: Memory M_t compresses key-value mappings (gradient descent)
  • Outer level: Projection matrices W_k, W_v, W_q are trained (gradient descent)
Rendering diagram...

3.3 Deep Optimizers: Enhanced Gradient-Based Methods

Building on the insight that optimizers are associative memories, NL proposes several extensions:

3.3.1 More Expressive Association (Preconditioning)

W_{t+1} = W_t + m_{t+1}
m_{t+1} = α_{t+1} m_t - η_t P_t ∇L(W_t; x_t)

where P_t provides meaningful value mappings (e.g., Hessian information).

3.3.2 Delta-Rule Based Updates

Using ℓ2 regression loss instead of dot-product:

W_{t+1} = W_t + m_{t+1}
m_{t+1} = (α_{t+1}I - ∇L(W_t; x_t)^⊤ ∇L(W_t; x_t)) m_t - η_t P_t ∇L(W_t; x_t)

This allows better capacity management for memorizing gradient sequences.

3.3.3 Deep Momentum Gradient Descent (DMGD)

Replace linear momentum with an MLP:

W_{t+1} = W_t + m_{t+1}(u_t)
m_{t+1} = α_{t+1} m_t - η_t ∇L^(2)(m_t; u_t, I)

where u_t = ∇L(W_t; x_t) and m(·) is an MLP with higher capacity.

3.3.4 Context-Aware Gradient Descent

Considering data dependencies with ℓ2 objective:

W_{t+1} = W_t (I - x_t x_t^⊤) - η_{t+1} ∇_{y_t} L(W_t; x_t) ⊗ x_t

4. HOPE Architecture

4.1 Continuum Memory System (CMS)

Traditional view: binary memory (short-term attention vs. long-term MLP)

NL perspective: Memory should exist on a continuum of update frequencies

Continuum Memory System is formalized as a chain of MLP blocks with different update frequencies:

y_t = MLP^(f_k)(MLP^(f_{k-1})(···MLP^(f_1)(x_t)))

where parameters θ^(f_ℓ) are updated every C^(ℓ) steps:

θ^(f_ℓ)_{i+1} = {
    θ^(f_ℓ)_i - Σ_{t=i-C^(ℓ)}^i η^(ℓ)_t f(θ^(f_ℓ)_t; x_t)  if i ≡ 0 (mod C^(ℓ))
    0                                                        otherwise
}

4.2 HOPE: Self-Modifying Learning Module

HOPE (Higher-Order Parameter Evolution) combines:

  1. Self-referential sequence model (based on Titans architecture)
  2. Continuum Memory System (multi-frequency MLPs)
  3. Context-aware gradient descent (from Section 3.3.4)
Rendering diagram...

Key Innovation: HOPE learns to modify its own update algorithm, enabling online consolidation of memories from short-term to long-term storage.

5. Key Results

5.1 Language Modeling Performance

HOPE demonstrates strong performance across multiple scales:

ModelSizeWiki ppl↓LMB ppl↓Avg. Acc↑
760M parameters / 30B tokens
HOPE760M26.0529.3846.90
Transformer++760M25.2127.6448.69
Titans (LMM)760M20.0421.9651.56
1.3B parameters / 100B tokens
HOPE1.3B15.1111.6357.23
Transformer++1.3B18.5318.3252.25
Titans (LMM)1.3B15.6011.4156.82

5.2 Common-Sense Reasoning Benchmarks

HOPE outperforms baselines on multiple reasoning tasks:

1.3B Model Results:

  • PIQA: 73.29% (vs. 73.09% Titans)
  • HellaSwag: 56.84% (vs. 56.31% Titans)
  • WinoGrande: 60.19% (vs. 59.81% Titans)
  • ARC-Challenge: 41.24% (vs. 40.82% Titans)

5.3 Key Findings

  1. Dynamic Context Adaptation: HOPE's self-modifying mechanism enables better context-dependent learning
  2. Multi-Scale Memory: Continuum Memory System provides more effective knowledge storage than binary short/long-term memory
  3. Optimizer Insights: Deep momentum variants show improved convergence properties
  4. Emergent Capabilities: Higher-order in-context learning abilities emerge from multi-level optimization structure

6. Practical Implications

6.1 Continual Learning

Problem: Traditional LLMs cannot update long-term memory after pre-training without catastrophic forgetting.

Solution: HOPE's multi-frequency update mechanism enables:

  • Online consolidation of short-term memories into long-term storage
  • Gradual adaptation without disrupting existing knowledge
  • Context-aware updates that preserve important information

6.2 Optimizer Design

Insight: Viewing optimizers as associative memories provides a principled framework for enhancement:

  • Deep Momentum: Use MLPs instead of linear momentum for higher capacity
  • Preconditioning: Provide meaningful value mappings (e.g., Hessian information)
  • Delta-Rule Updates: Better capacity management for gradient sequences

6.3 Long-Context Reasoning

The Continuum Memory System naturally handles long contexts by:

  • Hierarchical compression across different time scales
  • Efficient memory consolidation without full backpropagation
  • Adaptive forgetting based on update frequency

7.1 Connection to Neuroscience

Memory Consolidation: Human brain uses two-stage consolidation:

  1. Online (Synaptic): Rapid stabilization during wakefulness
  2. Offline (Systems): Replay during sleep for long-term storage

HOPE focuses on online consolidation, mimicking the brain's rapid memory stabilization.

7.2 Fast Weight Programs (FWPs)

NL generalizes FWPs by:

  • Extending beyond 2-level hierarchies to arbitrary depth
  • Providing formal framework for update frequency ordering
  • Connecting to optimization theory

7.3 Modern Recurrent Models

HOPE builds on recent architectures:

  • Linear Attention: Efficient key-value memory
  • Titans: Self-referential learning
  • State Space Models: Continuous-time dynamics

But adds:

  • Multi-frequency updates (CMS)
  • Self-modification capabilities
  • Nested optimization perspective

7.4 Test-Time Training (TTT)

Related approaches:

  • TTT-Linear: Updates linear layers at test time
  • Cartridges: Lightweight context representations

HOPE differs by:

  • Integrated training/inference (no separate test-time phase)
  • Multi-level optimization (not just single-level updates)
  • Continuum memory (not binary short/long-term)

8. Theoretical Contributions

8.1 White-Box Interpretability

NL provides transparent gradient flows for each component:

  • Each level has explicit optimization objective
  • Update frequencies are clearly defined
  • Memory compression is mathematically formalized

8.2 Expressivity Analysis

Theorem (Informal): A k-level nested learning model can implement algorithms requiring k nested loops, while traditional deep learning (1-level) cannot.

This explains:

  • Why stacking layers doesn't always increase computational depth
  • How in-context learning emerges (as inner-level optimization)
  • Path to higher-order in-context learning (more levels)

9. Limitations and Future Directions

9.1 Current Limitations

  1. Computational Overhead: Multi-level optimization requires careful scheduling
  2. Hyperparameter Tuning: Update frequencies C^(ℓ) need task-specific tuning
  3. Theoretical Gaps: Formal convergence guarantees for nested optimization remain open

9.2 Future Research Directions

  • Offline Consolidation: Incorporating sleep-like replay mechanisms
  • Adaptive Frequencies: Learning optimal update schedules
  • Hybrid Architectures: Combining NL with other paradigms (e.g., mixture-of-experts)
  • Scaling Laws: Understanding how nested depth affects scaling behavior

10. Conclusion

Nested Learning offers a paradigm shift from viewing neural networks as static computational graphs to dynamic systems of nested optimization problems. Key takeaways:

"Existing deep learning methods learn from data through compressing their own context flow"

  1. Optimizers are memories: Gradient-based optimizers compress gradient sequences
  2. Architectures are nested: Components operate at different update frequencies
  3. New dimension: Expressivity increases with optimization depth, not just layer depth
  4. Practical benefits: HOPE demonstrates improved continual learning and reasoning

This framework opens new avenues for designing more adaptive, interpretable, and powerful machine learning systems that can truly learn continuously—moving beyond the "anterograde amnesia" of current LLMs.