Nested Learning: The Illusion of Deep Learning Architectures

1. Introduction and Problem Statement

1.1 The Challenge

Current Large Language Models (LLMs) face a fundamental limitation: they are largely static after deployment. While they excel at tasks learned during pre-training, they cannot continually acquire new capabilities beyond their immediate context window. This creates a condition analogous to anterograde amnesia in neuroscience—where the model can only access:

Immediate context (short-term memory via attention mechanisms)
Long-past knowledge (frozen parameters from pre-training)

"Current Models only Experience the Immediate Present" - lacking the ability to consolidate new information into long-term memory after deployment.

1.2 Key Limitations of Traditional Deep Learning

Traditional deep learning approaches face several challenges:

Computational depth may not increase with more layers
Capacity improvements show diminishing returns with depth/width
Optimization may converge to suboptimal solutions
Adaptation abilities (continual learning, out-of-distribution generalization) don't automatically improve with more layers

2. Nested Learning (NL) Paradigm

2.1 Core Concept

Nested Learning represents a machine learning model as a coherent system of nested, multi-level, and/or parallel optimization problems, each with its own "context flow." This paradigm reveals that:

Deep learning methods learn from data through compressing their own context flow, explaining how in-context learning emerges in large models.

Rendering diagram...

2.2 Associative Memory Foundation

NL builds on the concept of associative memory—the ability to form and retrieve connections between events.

Definition (Associative Memory): Given keys K ⊆ ℝ^dk and values V ⊆ ℝ^dv, associative memory is an operator M: K → V that minimizes:

M* = arg min_M L̃(M(K); V)

Key Insight: Memory is a neural update caused by input, while learning is the process of acquiring effective memory.

2.3 Update Frequency Hierarchy

Components are ordered by their update frequency (f_A):

Higher levels = Lower frequency updates (e.g., pre-training parameters)
Lower levels = Higher frequency updates (e.g., attention states)
Parallel components = Same frequency but independent computation (A =_f B)

Rendering diagram...

3. Technical Approach

3.1 Decomposing Gradient Descent

Example 1: Simple MLP Training with Gradient Descent

The weight update rule:

W_{t+1} = W_t - η_{t+1} ∇_{W_t} L(W_t; x_{t+1})
         = W_t - η_{t+1} ∇_{y_{t+1}} L(W_t; x_{t+1}) ⊗ x_{t+1}

Can be reformulated as an optimization problem:

W_{t+1} = arg min_W ⟨Wx_{t+1}, u_{t+1}⟩ + (1/2η_{t+1})||W - W_t||²_2

where u_{t+1} = ∇_{y_{t+1}} L(W_t; x_{t+1}) is the Local Surprise Signal (LSS)—quantifying mismatch between current output and objective structure.

Example 2: Gradient Descent with Momentum

W_{t+1} = W_t - m_{t+1}
m_{t+1} = m_t - η_{t+1} ∇_{W_t} L(W_t; x_{t+1})

Reformulated as a 2-level optimization:

W_{t+1} = W_t - m_{t+1}  (outer level)
m_{t+1} = arg min_m -⟨m, ∇_{W_t} L(W_t; x_{t+1})⟩ + η_{t+1}||m - m_t||²_2  (inner level)

Key Discovery: Momentum is a key-less associative memory that compresses gradients into its parameters using gradient descent.

3.2 Architectural Decomposition

Linear Attention as Nested Optimization:

Standard linear attention formulation:

k_t = x_t W_k,  v_t = x_t W_v,  q_t = x_t W_q
M_t = M_{t-1} + v_t k_t^⊤
y_t = M_t q_t

The memory update M_t can be reformulated as:

M_{t+1} = arg min_M ⟨M k_{t+1}, v_{t+1}⟩ + ||M - M_t||²_2

This reveals linear attention as a 2-level optimization:

Inner level: Memory M_t compresses key-value mappings (gradient descent)
Outer level: Projection matrices W_k, W_v, W_q are trained (gradient descent)

Rendering diagram...

3.3 Deep Optimizers: Enhanced Gradient-Based Methods

Building on the insight that optimizers are associative memories, NL proposes several extensions:

3.3.1 More Expressive Association (Preconditioning)

W_{t+1} = W_t + m_{t+1}
m_{t+1} = α_{t+1} m_t - η_t P_t ∇L(W_t; x_t)

where P_t provides meaningful value mappings (e.g., Hessian information).

3.3.2 Delta-Rule Based Updates

Using ℓ2 regression loss instead of dot-product:

W_{t+1} = W_t + m_{t+1}
m_{t+1} = (α_{t+1}I - ∇L(W_t; x_t)^⊤ ∇L(W_t; x_t)) m_t - η_t P_t ∇L(W_t; x_t)

This allows better capacity management for memorizing gradient sequences.

3.3.3 Deep Momentum Gradient Descent (DMGD)

Replace linear momentum with an MLP:

W_{t+1} = W_t + m_{t+1}(u_t)
m_{t+1} = α_{t+1} m_t - η_t ∇L^(2)(m_t; u_t, I)

where u_t = ∇L(W_t; x_t) and m(·) is an MLP with higher capacity.

3.3.4 Context-Aware Gradient Descent

Considering data dependencies with ℓ2 objective:

W_{t+1} = W_t (I - x_t x_t^⊤) - η_{t+1} ∇_{y_t} L(W_t; x_t) ⊗ x_t

4. HOPE Architecture

4.1 Continuum Memory System (CMS)

Traditional view: binary memory (short-term attention vs. long-term MLP)

NL perspective: Memory should exist on a continuum of update frequencies

Continuum Memory System is formalized as a chain of MLP blocks with different update frequencies:

y_t = MLP^(f_k)(MLP^(f_{k-1})(···MLP^(f_1)(x_t)))

where parameters θ^(f_ℓ) are updated every C^(ℓ) steps:

θ^(f_ℓ)_{i+1} = {
    θ^(f_ℓ)_i - Σ_{t=i-C^(ℓ)}^i η^(ℓ)_t f(θ^(f_ℓ)_t; x_t)  if i ≡ 0 (mod C^(ℓ))
    0                                                        otherwise
}

4.2 HOPE: Self-Modifying Learning Module

HOPE (Higher-Order Parameter Evolution) combines:

Self-referential sequence model (based on Titans architecture)
Continuum Memory System (multi-frequency MLPs)
Context-aware gradient descent (from Section 3.3.4)

Rendering diagram...

Key Innovation: HOPE learns to modify its own update algorithm, enabling online consolidation of memories from short-term to long-term storage.

5. Key Results

5.1 Language Modeling Performance

HOPE demonstrates strong performance across multiple scales:

Model	Size	Wiki ppl↓	LMB ppl↓	Avg. Acc↑
760M parameters / 30B tokens
HOPE	760M	26.05	29.38	46.90
Transformer++	760M	25.21	27.64	48.69
Titans (LMM)	760M	20.04	21.96	51.56
1.3B parameters / 100B tokens
HOPE	1.3B	15.11	11.63	57.23
Transformer++	1.3B	18.53	18.32	52.25
Titans (LMM)	1.3B	15.60	11.41	56.82

5.2 Common-Sense Reasoning Benchmarks

HOPE outperforms baselines on multiple reasoning tasks:

1.3B Model Results:

PIQA: 73.29% (vs. 73.09% Titans)
HellaSwag: 56.84% (vs. 56.31% Titans)
WinoGrande: 60.19% (vs. 59.81% Titans)
ARC-Challenge: 41.24% (vs. 40.82% Titans)

5.3 Key Findings

Dynamic Context Adaptation: HOPE's self-modifying mechanism enables better context-dependent learning
Multi-Scale Memory: Continuum Memory System provides more effective knowledge storage than binary short/long-term memory
Optimizer Insights: Deep momentum variants show improved convergence properties
Emergent Capabilities: Higher-order in-context learning abilities emerge from multi-level optimization structure

6. Practical Implications

6.1 Continual Learning

Problem: Traditional LLMs cannot update long-term memory after pre-training without catastrophic forgetting.

Solution: HOPE's multi-frequency update mechanism enables:

Online consolidation of short-term memories into long-term storage
Gradual adaptation without disrupting existing knowledge
Context-aware updates that preserve important information

6.2 Optimizer Design

Insight: Viewing optimizers as associative memories provides a principled framework for enhancement:

Deep Momentum: Use MLPs instead of linear momentum for higher capacity
Preconditioning: Provide meaningful value mappings (e.g., Hessian information)
Delta-Rule Updates: Better capacity management for gradient sequences

6.3 Long-Context Reasoning

The Continuum Memory System naturally handles long contexts by:

Hierarchical compression across different time scales
Efficient memory consolidation without full backpropagation
Adaptive forgetting based on update frequency

7.1 Connection to Neuroscience

Memory Consolidation: Human brain uses two-stage consolidation:

Online (Synaptic): Rapid stabilization during wakefulness
Offline (Systems): Replay during sleep for long-term storage

HOPE focuses on online consolidation, mimicking the brain's rapid memory stabilization.

7.2 Fast Weight Programs (FWPs)

NL generalizes FWPs by:

Extending beyond 2-level hierarchies to arbitrary depth
Providing formal framework for update frequency ordering
Connecting to optimization theory

7.3 Modern Recurrent Models

HOPE builds on recent architectures:

Linear Attention: Efficient key-value memory
Titans: Self-referential learning
State Space Models: Continuous-time dynamics

But adds:

Multi-frequency updates (CMS)
Self-modification capabilities
Nested optimization perspective

7.4 Test-Time Training (TTT)

Related approaches:

TTT-Linear: Updates linear layers at test time
Cartridges: Lightweight context representations

HOPE differs by:

Integrated training/inference (no separate test-time phase)
Multi-level optimization (not just single-level updates)
Continuum memory (not binary short/long-term)

8. Theoretical Contributions

8.1 White-Box Interpretability

NL provides transparent gradient flows for each component:

Each level has explicit optimization objective
Update frequencies are clearly defined
Memory compression is mathematically formalized

8.2 Expressivity Analysis

Theorem (Informal): A k-level nested learning model can implement algorithms requiring k nested loops, while traditional deep learning (1-level) cannot.

This explains:

Why stacking layers doesn't always increase computational depth
How in-context learning emerges (as inner-level optimization)
Path to higher-order in-context learning (more levels)

9. Limitations and Future Directions

9.1 Current Limitations

Computational Overhead: Multi-level optimization requires careful scheduling
Hyperparameter Tuning: Update frequencies C^(ℓ) need task-specific tuning
Theoretical Gaps: Formal convergence guarantees for nested optimization remain open

9.2 Future Research Directions

Offline Consolidation: Incorporating sleep-like replay mechanisms
Adaptive Frequencies: Learning optimal update schedules
Hybrid Architectures: Combining NL with other paradigms (e.g., mixture-of-experts)
Scaling Laws: Understanding how nested depth affects scaling behavior

10. Conclusion

Nested Learning offers a paradigm shift from viewing neural networks as static computational graphs to dynamic systems of nested optimization problems. Key takeaways:

"Existing deep learning methods learn from data through compressing their own context flow"

Optimizers are memories: Gradient-based optimizers compress gradient sequences
Architectures are nested: Components operate at different update frequencies
New dimension: Expressivity increases with optimization depth, not just layer depth
Practical benefits: HOPE demonstrates improved continual learning and reasoning

This framework opens new avenues for designing more adaptive, interpretable, and powerful machine learning systems that can truly learn continuously—moving beyond the "anterograde amnesia" of current LLMs.

Nested Learning: The Illusion of Deep Learning Architectures

1. Introduction and Problem Statement

1.1 The Challenge

1.2 Key Limitations of Traditional Deep Learning

2. Nested Learning (NL) Paradigm

2.1 Core Concept

2.2 Associative Memory Foundation

2.3 Update Frequency Hierarchy

3. Technical Approach

3.1 Decomposing Gradient Descent

3.2 Architectural Decomposition

3.3 Deep Optimizers: Enhanced Gradient-Based Methods

3.3.1 More Expressive Association (Preconditioning)

3.3.2 Delta-Rule Based Updates

3.3.3 Deep Momentum Gradient Descent (DMGD)

3.3.4 Context-Aware Gradient Descent

4. HOPE Architecture

4.1 Continuum Memory System (CMS)

4.2 HOPE: Self-Modifying Learning Module

5. Key Results

5.1 Language Modeling Performance

5.2 Common-Sense Reasoning Benchmarks

5.3 Key Findings

6. Practical Implications

6.1 Continual Learning

6.2 Optimizer Design

6.3 Long-Context Reasoning

7. Related Work and Context

7.1 Connection to Neuroscience

7.2 Fast Weight Programs (FWPs)

7.3 Modern Recurrent Models

7.4 Test-Time Training (TTT)

8. Theoretical Contributions

8.1 White-Box Interpretability

8.2 Expressivity Analysis

9. Limitations and Future Directions

9.1 Current Limitations

9.2 Future Research Directions

10. Conclusion