AMD MI300X Attention Kernel: MoonMath AI Open-Sources HIP Solution That Outperforms AITER v3
MoonMath AI's Breakthrough in GPU Optimization
The AI hardware sector continues to see rapid advances in specialized kernels that maximize GPU throughput for transformer models, as developers seek alternatives to dominant platforms amid growing demand for efficient inference and training on diverse accelerators.
Kernel Design and Technical Approach
MoonMath AI has open-sourced a bf16 forward attention kernel written in HIP for AMD’s MI300X GPU, targeting the gfx942 ISA without reliance on hand-written assembly. The kernel computes the fused softmax(QKáµ€/√d)·V operation and supports BSHD or BHSD layouts with a fixed head dimension of 128, handling any sequence length including cross-attention. Key architectural choices include:
- Eight waves per block divided into two groups of four, offset by phase to keep matrix cores active during softmax and loads.
- Two s_barrier instructions per iteration for synchronization.
- Memory placement with K streamed into double-buffered LDS (32 KiB), V kept resident in L1, and Q plus accumulators in VGPRs.
- Use of 16×16×16 MFMA instructions to reduce accumulator pressure compared with larger tiles.
The core technique employs one-instruction asm wrappers in __device__ __forceinline__ functions. Extended asm constraints allow opcode selection while the compiler manages register allocation and data flow, as shown in this example tying input and output to the same VGPR:
// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.
__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {
asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"
: "+v"(c) : "v"(a), "v"(b));
}
Additional optimizations such as a third Q tile and Flash-Decoding-style tail KV splitting further improve reuse and utilization across the MI300X’s 304 compute units.
Performance Results and Real-World Application
Benchmarks on MI300X in bf16 with head dimension 128 demonstrate consistent gains across shapes and rounding modes (RTNE, RTNA, RTZ). Geomean speedups versus AITER v3 reach 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ), with individual peaks up to 1.26×. Versus Modular MAX, geomeans range from 1.44× to 1.49×. A practical deployment in a SGLang diffusion PR replaced AITER attention with the new kernel on Wan2.1-T2V-1.3B-Diffusers, yielding a 1.23× end-to-end generation improvement on MI300X with no quality regression. The kernel installs via pip, launches on the caller’s stream, and supports BSHD layout directly for diffusion workloads.
Fact Check
- MoonMath AI released a bf16 forward attention kernel for AMD MI300X written in HIP under MIT license.
- The kernel beats AITER v3 on every tested shape and rounding mode with reported geomean speedups of 1.18×/1.15×/1.08×.
- Memory placement (K in LDS, V in L1, Q in registers) and one-instruction asm wrappers form the primary optimization techniques.
- An SGLang PR using the kernel achieved 1.23× speedup on Wan2.1 video diffusion without quality loss.
- Limitations include no causal mask, no GQA, and exclusive support for gfx942 hardware.
How do you see this shaping your industry?
