Radix Sort
GPU radix sort for u32 arrays
Quick Start
Loading...
Source
Loading...
Documentation
Radix Sort
GPU-accelerated radix sort for unsigned 32-bit integers. Pure compute — no visual output. Significantly faster than CPU sorting for large arrays (100K+ elements).
Usage
import { createRadixSort } from './radix-sort';
const sorter = createRadixSort(device, { maxElements: 1_000_000 });
// Sort a buffer in-place
sorter.sort(myBuffer, elementCount);
// Clean up when done
sorter.destroy();
API
createRadixSort(device, options?)
Returns a RadixSort instance.
| Option | Type | Default | Description |
|---|---|---|---|
maxElements |
number |
1_000_000 |
Maximum number of elements (determines internal buffer sizes) |
sorter.sort(buffer, elementCount)
Sorts a GPUBuffer of u32 values in-place (ascending order).
| Param | Type | Description |
|---|---|---|
buffer |
GPUBuffer |
Buffer containing u32 values. Must have STORAGE | COPY_SRC | COPY_DST usage. |
elementCount |
number |
Number of elements to sort (not bytes) |
sorter.sortPairs(keysBuffer, valuesBuffer, elementCount)
Sorts keys and reorders associated values to match. Currently sorts keys only — see "Extending to key-value sort" below for how to implement paired reordering.
sorter.destroy()
Releases all internal GPU buffers.
Algorithm
This implements a parallel radix sort processing 4 bits per pass (16 buckets), requiring 8 passes to sort 32-bit keys.
Each pass runs three compute kernels in sequence:
Histogram — Each workgroup counts how many of its elements fall into each of the 16 buckets for the current 4-bit digit. Uses workgroup-shared atomics for fast local counting, then writes results to global memory.
Prefix sum — An exclusive scan over the per-workgroup histograms computes global write offsets. The histogram is laid out as
[bucket][workgroup], so scanning linearly preserves the relative order of elements within each bucket (stable sort).Scatter — Each element looks up its destination from the prefix sum and writes to the output buffer. Uses atomics to get unique positions for elements in the same bucket within a workgroup.
Ping-pong buffers alternate between passes to avoid copying.
WGSL loading
The default import uses Vite's ?raw suffix:
import shaderSource from './radix-sort.wgsl?raw';
If you're not using a bundler, load via fetch:
const shaderSource = await fetch(new URL('./radix-sort.wgsl', import.meta.url)).then((r) =>
r.text()
);
Modifying
Sort f32 values instead of u32
Floating-point values need a bit transformation to sort correctly with radix sort. Before sorting, apply this transform to each key:
if (key >> 31) == 1: key = ~key // negative: flip all bits
else: key = key ^ 0x80000000 // positive: flip sign bit
Apply the inverse after sorting. You can add a pre-processing and post-processing compute pass, or modify the histogram and scatter kernels to apply the transform inline.
Change the radix width
The current implementation uses 4 bits (16 buckets, 8 passes). You could use:
- 8 bits (256 buckets, 4 passes) — fewer passes but more shared memory and larger histograms
- 2 bits (4 buckets, 16 passes) — less shared memory but more passes
Change RADIX_BITS and NUM_BUCKETS in the WGSL, and NUM_PASSES in the TypeScript.
Extending to key-value sort
To sort key-value pairs, add a second set of ping-pong buffers for values. In the scatter kernel, when writing output[dest] = element, also write values_output[dest] = values_input[idx]. This requires adding additional buffer bindings to the bind group.
Use as a building block
Radix sort is commonly used as a sub-routine in:
- Particle systems — sort particles by depth for correct alpha blending
- Spatial hashing — sort by cell index to group spatially close elements
- GPU BVH construction — sort Morton codes for tree building
- Prefix sum / stream compaction — the prefix sum kernel can be reused independently
Further Reading
Resources on GPU radix sort algorithms and parallel sorting techniques.
Core Papers
Satish, Harris, Garland, "Designing Efficient Sorting Algorithms for Manycore GPUs" (2009) The foundational paper on GPU radix sort. Introduces the histogram–prefix-sum–scatter pattern used in this module. Benchmarks against CPU sorting and other GPU approaches. https://mgarland.org/files/papers/nvr-2008-001.pdf
Merrill, Grimshaw, "High Performance and Scalable Radix Sorting" (2011) Advances the state of the art with a more efficient single-pass scan and better workgroup utilization. Basis for CUB's radix sort. https://escholarship.org/uc/item/8051s4pj
Harada, Howes, "Introduction to GPU Radix Sort" (2011) A practical, accessible walkthrough of implementing radix sort on the GPU. Good companion to the academic papers. http://www.heterogeneouscompute.org/wordpress/wp-content/uploads/2011/06/RadixSort.pdf
Prefix Sum (Scan)
Blelloch, "Prefix Sums and Their Applications" (1990) The classic reference on parallel prefix sum (scan). The Blelloch up-sweep/down-sweep pattern is used in this module's prefix sum kernel. https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf
Harris, Sengupta, Owens, "Parallel Prefix Sum (Scan) with CUDA" (GPU Gems 3, Chapter 39) Practical GPU implementation of parallel prefix sum with work-efficient algorithms and bank conflict avoidance. https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
WebGPU-Specific
Tint WGSL Compiler — Workgroup shared memory Understanding workgroup-shared variables and barriers in WGSL, which are critical for the histogram and prefix sum kernels. https://www.w3.org/TR/WGSL/#var-and-value
WebGPU Specification — Compute Shaders The official spec for compute shader dispatch, workgroup sizes, and synchronization primitives. https://www.w3.org/TR/webgpu/#compute-passes
General References
Cormen, Leiserson, Rivest, Stein, "Introduction to Algorithms" (CLRS) Chapter 8 covers radix sort in the sequential setting. Understanding the sequential algorithm helps reason about the parallel version's correctness.
Duane Merrill, "CUB Library" CUDA's high-performance primitives library, including the most optimized GPU radix sort implementation. A good reference for advanced optimization techniques. https://github.com/NVIDIA/cub
Vulkan / CUDA Radix Sort Benchmarks Practical performance comparisons across GPU sorting implementations, useful for understanding expected throughput. https://github.com/b0nes164/GPUSorting