mobile logo
Back to Blog
Bringing Blazing Fast State Space Models to the Modular MAX Framework
02/04/2026

Bringing Blazing Fast State Space Models to the Modular MAX Framework

By Evan Owen

Two weeks. That's how long it took the QWERKY AI team to implement the Mamba 1 architecture into Modular's MAX framework. Along the way, we built the kernel-level infrastructure needed for Mamba 2, validated everything against the most widely used open-source implementations, and shipped something that, as far as we can tell, nobody else has done yet: CPU-only Selective Scan and causal convolution kernels for state space models (SSMs).

I want to walk through what we built, what made it interesting (read: hard), and why this matters for what we're building at QWERKY.

A Quick Primer on MAX by Modular

If you're not familiar with Modular, here's the short version: MAX (Modular Accelerated Xecution) is an AI inference platform built to be fast, portable, and vendor-agnostic. It's not just a serving wrapper around PyTorch. It's a full graph compiler and runtime that optimizes your model at the graph level and executes it across NVIDIA GPUs, AMD GPUs, and CPUs without requiring you to rewrite anything for each target.

Under the hood, MAX is built on Mojo, Modular's systems-level programming language that compiles down to multi-level intermediate representation (MLIR). What that means in practice is you can write custom kernels in something that feels almost like Python but compiles to code that actually screams on the hardware. The framework handles kernel fusion, continuous batching, and multi-GPU scaling, and it ships as a lean Kubernetes-compatible container.

For us at QWERKY, the appeal was immediate. We build custom SSM architectures, and we need to deploy them across different hardware without maintaining separate codebases for every vendor's accelerator stack. As we described in Modular's case study on our partnership, traditional frameworks like PyTorch and TensorFlow proved inefficient for our custom Mamba-based architectures. We could absolutely write CUDA kernels ourselves, but it doesn't make sense when you also need those same kernels running on AMD GPUs via ROCm, which means maintaining a completely separate codebase with different toolchains and APIs. MAX solves that. We write our kernels once in Mojo and deploy them across NVIDIA and AMD without the rewrite overhead.

But here's the part that made this project interesting: MAX was architected primarily around transformer-based models. Attention mechanisms, key-value (KV) caches, the standard transformer block pipeline. State space models are a fundamentally different computational paradigm, and when we started, the infrastructure to support them simply didn't exist in the framework. That's what we set out to build.

Why State Space Models? Why Mamba?

For readers who haven't spent time with state space models, here's the core idea. Transformers (the architecture behind GPT, Claude, Llama, and most of the models you've heard of) use an "attention" mechanism that lets every token in a sequence look at every other token. That's powerful, but it scales quadratically with sequence length. Double the context window and you quadruple the compute. That gets expensive fast.

State space models like Mamba take a different approach. Instead of attention, they use a recurrence, a mechanism where the model processes the sequence one step at a time, maintaining a compressed hidden state that summarizes everything it's seen so far. The key innovation in Mamba is that this recurrence is selective: the model dynamically decides at each timestep what information to keep and what to discard, based on the actual content of the input. This gives you linear-time inference scaling. Double the sequence length and you only double the compute.

The original Mamba paper by Albert Gu and Tri Dao reported four to five times higher inference throughput compared to similarly sized transformers, with linear scaling in sequence length. Those are the published benchmarks from the architecture itself, not numbers we measured. But they track with what we've seen in our own work, and they're a big part of why we've bet our company on this family of architectures. If you want a deeper introduction to the theory, Maarten Grootendorst's visual guide to Mamba and state space models is an excellent starting point.

What We Built

The core of this effort was a set of eight custom kernels, including two fused kernels optimized for Mamba.

The variable-length causal conv1d kernel handles the one-dimensional causal convolution at the front of every Mamba block. Causal convolutions are conceptually straightforward: you're convolving over a sequence while ensuring the output at position t depends only on inputs at positions ≤ t. Getting this right for variable-length sequences in a batched setting requires careful handling of padding and masking. Our implementation handles arbitrary sequence lengths without wasting compute by padding everything to a fixed maximum.

The variable-length selective scan kernels implement the heart of the Mamba architecture: the selective state space mechanism. This is where Mamba gets its ability to dynamically gate information flow through the recurrence, and where most of the computational complexity lives. Variable-length support is critical for efficient batched inference when you're serving requests with different sequence lengths simultaneously.

We also implemented the standard causal conv1d and selective scan kernels for fixed-length use cases, along with normalization and fused add and root-mean-square (RMS) norm operations. Fusing the add and RMS norm into a single kernel pass avoids an extra round-trip to global memory, the kind of small optimization that compounds quickly across billions of tokens.

Beyond the individual kernels, we built a state space cache layer that didn't previously exist in MAX. This is a critical piece. Without a proper SSM cache, you'd have to recompute the entire recurrence from scratch on every forward pass during autoregressive generation. That's fine for prefill, but it's a disaster for decode latency and memory consumption. Our cache layer stores the recurrent state between decode steps, cutting both inference time and memory requirements during generation.

The Fun Part: What Made This Hard

Implementing a different computational paradigm inside a framework built for another one is, how do I put this, not a weekend project. Here are some of the more interesting problems we ran into.

Getting a Transformer Framework to Think in Recurrences

MAX inference pipeline was designed around the transformer computational pattern: big batched matrix multiplications flowing through attention layers with a KV cache for autoregressive decode. SSMs don't work that way. The core operation in Mamba is a recurrence where the state at position t depends on the state at position t-1. That's a different data flow, and a lot of the assumptions baked into the framework's scheduling, memory management, and caching didn't directly apply.

The KV cache is the big one. In a transformer, the KV cache is how you avoid recomputing attention over the entire sequence at every decode step. SSMs need something analogous: you have to persist the recurrent hidden state between steps. But the shape of that state, the way it gets updated, and the access patterns are all different. There was no "SSM cache" in MAX. We had to design the memory layout and update logic from scratch, making sure it integrated cleanly with the existing serving infrastructure while respecting the different lifecycle of a recurrent state versus a key-value store.

Selective Scan on Hardware That Loves MatMuls

Here's a fun tension: modern GPUs are optimized for large, regular matrix multiplications. Transformers give them exactly that: big, general matrix-matrix multiplies (GEMMs) that saturate the tensor cores. The selective scan in Mamba is not a GEMM. It's an inherently sequential computation along the sequence dimension, involving element-wise operations, gating, and recurrent-state updates at each step. You're working against the hardware's natural strengths.

Tri Dao's original CUDA implementation of the selective scan is a genuinely impressive hardware-aware algorithm design: kernel fusion, parallel scan decompositions, careful memory staging to make it fast despite not being a natural fit for GPU architecture. Reimplementing this in Mojo meant we couldn't just port CUDA line by line. We had to understand the algorithmic intent behind each optimization and reexpress it using Mojo's abstractions and MAX's compilation pipeline. Getting the same hardware efficiency with a completely different toolchain while maintaining numerical correctness was one of the more satisfying puzzles of this project.

Cross-Vendor Portability for Non-Standard Operations

Running standard transformer operations across NVIDIA and AMD is well-trodden ground, with mature libraries on both sides. Custom SSM kernels? Less so. The memory hierarchies, warp sizes versus wavefront sizes, and optimal access patterns differ between vendors, and operations like the selective scan are sensitive to these details because they're memory-bandwidth-bound rather than compute-bound.

This is actually one of the places where Mojo and MAX's hardware abstraction layer really earned their keep. The promise is "write once, run anywhere," and for standard operations, that mostly just works. For custom kernels doing non-standard things at the edge of what the framework was designed for, careful profiling and tuning were still required. But the fact that we could do it without maintaining two completely separate kernel codebases was a real win.

Building CPU Kernels from Scratch

This was probably the most unexpected part of the project. The selective scan and causal conv1d have only ever existed as GPU kernels. There's no reference CPU implementation. No established set of tricks for making these operations fast on x86. We were starting from a blank page.

CPU execution has totally different performance characteristics. You're optimizing for cache utilization and single-instruction, multiple-data (SIMD) vectorization rather than thread occupancy and memory coalescing. The parallel scan decomposition that works beautifully on a GPU with thousands of threads doesn't translate directly to a CPU with a handful of cores. We had to rethink the algorithmic structure for an entirely different execution model.

The payoff, though, is significant: CPU-only SSM inference opens up deployment scenarios that were previously not possible. Edge devices, local development without GPU drivers, and cost-sensitive batch processing. Nobody else has shipped these, and we think they'll matter more than people currently expect.

Numerical Precision across Recurrent Steps

SSMs are notoriously sensitive to numerical precision. The Mamba authors note in their repository that parameters should be stored in fp32, as fp16 storage can introduce instabilities. If you've read my post on incidental non-determinism, you know this is a topic close to my heart. When you're running a recurrence, small floating-point errors at position t propagate forward to t+1, then t+2, and so on. Over long sequences, these errors accumulate in ways that just don't happen in attention-based models, where each position's computation is relatively independent.

Our validation had to be thorough. We verified numerical agreement with the reference implementations from Tri Dao's codebase and vLLM across a range of sequence lengths, batch sizes, and model configurations, on both NVIDIA and AMD hardware. Several rounds of careful debugging to get exact agreement within floating-point tolerance across all configurations. Not glamorous, but absolutely necessary.

Validation and End-to-End Results

We tested and validated every kernel against the popular implementations by Tri Dao and vLLM to make sure our kernels match the exact specification of the original forward pass. Not approximate equivalence, but exact numerical agreement within floating-point tolerance.

With the kernel suite and cache layer in place, we validated end-to-end inference for Mamba at 130 million and two billion parameters. Both run successfully, confirming the full stack from tokenization through the SSM blocks to output logits works correctly. We also confirmed that Mamba runs on both NVIDIA and AMD hardware within the framework, and we blazed a trail with the first CPU-only Selective Scan and causal conv1d kernels, opening future paths to CPU-only architectures and local inference.

Why This Really Matters: From Mamba Kernels to QWERKY's Architecture

Here's where I want to connect the engineering work to the bigger picture, because this isn't just about getting vanilla Mamba models running in MAX.

QWERKY doesn't ship off-the-shelf Mamba. Our proprietary QDistill technology optimizes models into our own state-space model variants, architectures built on the same foundational principles as Mamba (selective state spaces, causal convolutions, recurrent hidden states), but with QWERKY-specific modifications that push the performance envelope further. QDistill reduces VRAM usage by up to 90 percent and boosts throughput by up to tenfold while maintaining model accuracy. To put that in practical terms: our custom-trained eight-billion-parameter state space models deliver the responsiveness that you'd typically need a hundred-billion-plus-parameter transformer to achieve, at a fraction of the computational cost.

The key insight is that our models are built on the same kernel infrastructure as standard Mamba, but extend it in proprietary ways. Think of it like this: if vanilla Mamba defines the grammar, our architecture speaks the same language but says more interesting things with it. The eight kernels we built for MAX (the selective scans, the causal convolutions, the cache layer) form the foundation on which our optimized models run. By establishing first-class SSM support in MAX, we've laid the groundwork for deploying our proprietary architecture on a high-performance, hardware-portable inference engine.

This is a commercial win on both sides of the partnership. For Modular, SSM support expands the range of what MAX can serve. The entire industry is watching state space models as a serious post-transformer architecture, and MAX is now one of the first major inference frameworks with native SSM kernel support, not through a compatibility shim, but through purpose-built kernels validated against the reference implementations. That's a meaningful differentiator for their platform.

For QWERKY, it means we have a production-grade inference path for our optimized models across NVIDIA, AMD, and CPU hardware. Our team can write CUDA kernels, and we have. But maintaining separate CUDA and ROCm codebases for every kernel, across every hardware target, is not a good use of engineering time when the alternative is writing it once in Mojo and letting MAX handle the compilation. As Modular noted in their case study, the equivalent kernel in Mojo is often twenty to thirty lines of readable, Python-like code compared to hundreds of lines of intricate memory management in CUDA, and it runs across different hardware automatically. That development velocity matters enormously for a small team. We can iterate on model improvements in hours instead of weeks, test locally on whatever hardware is available, and deploy the exact same code to production across our entire infrastructure.

The practical result is that QWERKY's customers get enterprise-grade AI capabilities at a fraction of the cost of running comparable transformer models. We recently announced a partnership with Inbox Beverage to build a custom AI-powered design platform, powered by our eight-billion-parameter state space model. That model delivers the feel of something many times its size because the underlying kernel infrastructure we've built is tuned for efficient SSM execution.

What's Next

In the coming weeks, we're adding support for the Mamba 2 and Gated DeltaNet architectures to the MAX framework.

Mamba 2 introduces what Dao and Gu call "structured state space duality" (SSD), a theoretical framework that reveals deep connections between state space models and attention and recasts the selective state space mechanism as a structured matrix operation. This enables chunkwise computation that combines the efficient recurrent mode for inference with a parallelizable quadratic form for training. The practical result is an algorithm that's significantly faster to train and simpler to implement than Mamba 1, with the ability to scale to much larger hidden state dimensions.

Gated DeltaNet, published at ICLR 2025, combines two complementary mechanisms: gating for adaptive memory control (similar to Mamba's approach) and the delta update rule for precise, targeted memory modifications. The result is an architecture that outperforms both Mamba 2 and standard DeltaNet across language modeling, in-context retrieval, and long-context understanding benchmarks. Adding it broadens the set of SSM architectures that MAX can serve natively and gives us more options for our own optimization pipeline.

Beyond the architecture work, we're also cooking up something bigger with the Modular team. We're not ready to share details yet, but I'll just say this: the MAX build we've put together gives us the infrastructure to train and deploy state space models at a scale we haven't attempted before. We're really looking forward to telling you more about that soon.

The ultimate goal is full native support for QWERKY's optimized model architectures within the MAX ecosystem, making our QDistill pipeline truly production-ready from optimization through deployment. A more detailed technical deep dive into some kernel-level implementation details is coming soon. Keep an eye out right here on the blog.