Computer Architecture Research
Stay current with breakthrough research and emerging trends. Explore cutting-edge papers from top-tier conferences and understand their practical implications.
Showing 7 of 7 papers
2025
HipKittens: Fast and Furious AMD Kernels
HipKittens is a C++ embedded domain-specific language that provides tile-based programming primitives for high-performance AI kernel development on AMD GPUs. The framework introduces novel scheduling patterns (8-wave ping-pong and 4-wave interleave), explicit register management, and chiplet-aware cache optimization to achieve performance competitive with or exceeding hand-optimized assembly kernels across diverse AI workloads.
Impact: HipKittens addresses the critical software gap limiting AMD GPU adoption in AI workloads, often called the 'CUDA moat.' By providing accessible C++ programming primitives, it enables developers to write high-performance AMD kernels without resorting to raw assembly. The framework achieves 1.2-10× speedups over existing baselines in various settings and matches AMD's hand-optimized assembly kernels across key operations like GEMM and attention. This work is particularly impactful for democratizing AI hardware access, as AMD MI355X GPUs offer competitive or superior specifications to NVIDIA alternatives (2.5 PFLOPs BF16, 8 TB/s bandwidth, 288 GB memory). The open-source release enables the AI community to leverage diverse hardware platforms, potentially breaking vendor lock-in and accelerating AI development through increased compute availability.
Nested Learning: The Illusion of Deep Learning Architectures
This paper introduces Nested Learning (NL), a new learning paradigm that represents models as nested, multi-level optimization problems with distinct context flows. NL reveals that deep learning methods compress their context flow and explains in-context learning emergence, leading to three contributions: Deep Optimizers (showing gradient-based optimizers are associative memory modules), Self-Modifying Titans (a sequence model that learns its own update algorithm), and a Continuum Memory System with the HOPE architecture.
Impact: Nested Learning provides a new theoretical framework for understanding and designing machine learning models with enhanced continual learning capabilities. The HOPE architecture demonstrates practical improvements over Transformers and modern recurrent networks in language modeling tasks, achieving better perplexity and accuracy on benchmark datasets. The framework addresses the static nature of Large Language Models after deployment by enabling online memory consolidation, similar to human brain processes. This has significant implications for developing AI systems that can continually acquire new knowledge beyond their immediate context window, potentially reducing the need for expensive retraining and enabling more adaptive AI systems in production environments.
PAN: A World Model for General, Interactable, and Long-Horizon World Simulation
PAN is a general-purpose world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. It employs a Generative Latent Prediction (GLP) architecture combining an LLM-based autoregressive latent dynamics backbone with a video diffusion decoder to achieve unified latent space reasoning and realizable world dynamics.
Impact: PAN enables practical applications in robotics, autonomous systems, and AI planning by providing a general-purpose simulator that can predict future world states based on natural language actions. The model's ability to maintain long-horizon consistency and support interactive simulation makes it valuable for training robotic policies, testing autonomous vehicle scenarios, generating synthetic training data, and enabling agents to perform 'thought experiments' before taking real-world actions. Its open-domain generalization allows deployment across diverse environments without domain-specific retraining, while the natural language action interface makes it accessible for human-in-the-loop applications and planning systems.
2024
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
Novel parallelization scheme that accelerates LLM prompt phase by dual-purposing KV-cache for parallel generation, achieving 1.4× and 1.6× speedups for Llama 7B and Falcon 7B with asynchronous communication and context-level load-balancing.
Impact: Directly reduces time-to-first-token (TTFT) in production LLM serving systems, enabling better user experience for long-context applications like RAG, summarization, and in-context learning.
2023
Dynamic Warp Scheduling for Improved GPU Utilization
Machine learning-based warp scheduler that adapts to workload characteristics, achieving 15-25% performance improvements across diverse GPU workloads.
Impact: Directly applicable to next-generation GPU architectures, with major vendors expressing interest in the approach for future products.
Scalable Cache Coherence for Manycore Processors
2017
Attention Is All You Need
This paper introduces the Transformer, a novel neural network architecture based entirely on attention mechanisms, eliminating the need for recurrence and convolutions. The model achieves state-of-the-art results on machine translation tasks (28.4 BLEU on WMT 2014 English-to-German) while being significantly more parallelizable and requiring less training time than previous approaches.
Impact: The Transformer architecture has revolutionized natural language processing and beyond, becoming the foundation for modern large language models like BERT, GPT, and their successors. Its parallel processing capabilities enable efficient training on modern GPU hardware, reducing computational costs and training time significantly. The architecture's success in machine translation demonstrated that attention mechanisms alone could outperform recurrent networks, leading to widespread adoption across various domains including computer vision, speech recognition, and protein folding. The model's interpretability through attention visualizations has also provided insights into how neural networks process sequential data, influencing both research and production systems in industry.
Stay Ahead of the Curve
Computer architecture is rapidly evolving. Our research summaries help you understand the latest breakthroughs and their practical implications for system design.