flashinfer.testing.bench_gpu_time_with_cudagraph¶
- flashinfer.testing.bench_gpu_time_with_cudagraph(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, num_iters_within_graph: int = 10, l2_flush: bool | None = None, l2_flush_size_mb: int | None = None, l2_flush_device: str | None = None, sleep_after_run: bool = False, input_args: Tuple = (), input_kwargs: dict | None = None, cold_l2_cache: bool = True)¶
Benchmark GPU time using CUDA graphs with amortized kernel launch overhead.
CUDA graphs capture a sequence of GPU operations and replay them with minimal CPU overhead. By running multiple iterations within a single graph, kernel launch latency is amortized, yielding measurements closer to pure GPU time.
Cold-L2 Benchmarking:
When
cold_l2_cache=True, the function uses rotating buffers to ensure cold L2 cache for each kernel invocation within the graph. Multiple copies of the GPU tensors ininput_args/input_kwargsare created and rotated through during graph capture, ensuring each kernel invocation operates on different memory regions. The number of buffer copies is automatically calculated based on the device’s L2 cache size.- Parameters:
fn (Callable) – The kernel function to benchmark.
dry_run_iters (int, optional) – Number of warmup iterations (not timed). If None, computed from dry_run_time_ms.
repeat_iters (int, optional) – Number of measured iterations (graph replays). If None, computed from repeat_time_ms.
dry_run_time_ms (int) – Target warmup duration in ms (default: 25).
repeat_time_ms (int) – Target measurement duration in ms (default: 100).
num_iters_within_graph (int) – Number of kernel calls captured in the graph (default: 10). Higher values better amortize launch overhead but use more memory when rotating buffers.
sleep_after_run (bool) – If True, sleep briefly after each iteration (default: False).
input_args (tuple) – Positional arguments to pass to fn. GPU tensors in this structure will be cloned when
cold_l2_cache=True.input_kwargs (dict, optional) – Keyword arguments to pass to fn. GPU tensors in this structure will be cloned when
cold_l2_cache=True.cold_l2_cache (bool) – If True, use rotating buffers to ensure cold L2 cache for each kernel invocation within the graph (default: True).
- Returns:
Per-iteration execution times in milliseconds. Each time is the graph replay duration divided by
num_iters_within_graph.- Return type:
List[float]
Example
Cold-L2 benchmarking (default, for memory-bound kernels):
>>> def run_attention(q, k, v, o): ... flashinfer.single_prefill_with_kv_cache(q, k, v, o) ... >>> q = torch.randn(batch, heads, seq_len, head_dim, device="cuda") >>> k = torch.randn(batch, heads, seq_len, head_dim, device="cuda") >>> v = torch.randn(batch, heads, seq_len, head_dim, device="cuda") >>> o = torch.empty_like(q) >>> times = bench_gpu_time_with_cudagraph( ... fn=run_attention, ... input_args=(q, k, v, o), ... ) >>> print(f"Cold-L2 median time: {np.median(times):.3f} ms")
Example
Hot L2 benchmarking (for compute-bound kernels):
>>> times = bench_gpu_time_with_cudagraph( ... fn=lambda: torch.matmul(q, k.T), ... cold_l2_cache=False, ... )
Note
When using
input_args/input_kwargs, the function must accept the tensors as arguments (not capture them from closure).GPU tensors are automatically detected and cloned. Non-tensor arguments (scalars, booleans, etc.) are preserved across all copies.
Memory usage scales with the number of rotations needed to exceed L2 cache.
See also
calculate_rotation_count: Computes required buffer copies for cold-L2.
Deprecated since version The:
l2_flush,l2_flush_size_mb, andl2_flush_deviceparameters are deprecated. Usecold_l2_cacheinstead.