flashinfer.testing.bench_gpu_time_with_cudagraph

flashinfer.testing.bench_gpu_time_with_cudagraph(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, num_iters_within_graph: int = 10, l2_flush: bool | None = None, l2_flush_size_mb: int | None = None, l2_flush_device: str | None = None, sleep_after_run: bool = False, input_args: Tuple = (), input_kwargs: dict | None = None, cold_l2_cache: bool = True)

Benchmark GPU time using CUDA graphs with amortized kernel launch overhead.

CUDA graphs capture a sequence of GPU operations and replay them with minimal CPU overhead. By running multiple iterations within a single graph, kernel launch latency is amortized, yielding measurements closer to pure GPU time.

Cold-L2 Benchmarking:

When cold_l2_cache=True, the function uses rotating buffers to ensure cold L2 cache for each kernel invocation within the graph. Multiple copies of the GPU tensors in input_args/input_kwargs are created and rotated through during graph capture, ensuring each kernel invocation operates on different memory regions. The number of buffer copies is automatically calculated based on the device’s L2 cache size.

Parameters:
  • fn (Callable) – The kernel function to benchmark.

  • dry_run_iters (int, optional) – Number of warmup iterations (not timed). If None, computed from dry_run_time_ms.

  • repeat_iters (int, optional) – Number of measured iterations (graph replays). If None, computed from repeat_time_ms.

  • dry_run_time_ms (int) – Target warmup duration in ms (default: 25).

  • repeat_time_ms (int) – Target measurement duration in ms (default: 100).

  • num_iters_within_graph (int) – Number of kernel calls captured in the graph (default: 10). Higher values better amortize launch overhead but use more memory when rotating buffers.

  • sleep_after_run (bool) – If True, sleep briefly after each iteration (default: False).

  • input_args (tuple) – Positional arguments to pass to fn. GPU tensors in this structure will be cloned when cold_l2_cache=True.

  • input_kwargs (dict, optional) – Keyword arguments to pass to fn. GPU tensors in this structure will be cloned when cold_l2_cache=True.

  • cold_l2_cache (bool) – If True, use rotating buffers to ensure cold L2 cache for each kernel invocation within the graph (default: True).

Returns:

Per-iteration execution times in milliseconds. Each time is the graph replay duration divided by num_iters_within_graph.

Return type:

List[float]

Example

Cold-L2 benchmarking (default, for memory-bound kernels):

>>> def run_attention(q, k, v, o):
...     flashinfer.single_prefill_with_kv_cache(q, k, v, o)
...
>>> q = torch.randn(batch, heads, seq_len, head_dim, device="cuda")
>>> k = torch.randn(batch, heads, seq_len, head_dim, device="cuda")
>>> v = torch.randn(batch, heads, seq_len, head_dim, device="cuda")
>>> o = torch.empty_like(q)
>>> times = bench_gpu_time_with_cudagraph(
...     fn=run_attention,
...     input_args=(q, k, v, o),
... )
>>> print(f"Cold-L2 median time: {np.median(times):.3f} ms")

Example

Hot L2 benchmarking (for compute-bound kernels):

>>> times = bench_gpu_time_with_cudagraph(
...     fn=lambda: torch.matmul(q, k.T),
...     cold_l2_cache=False,
... )

Note

  • When using input_args/input_kwargs, the function must accept the tensors as arguments (not capture them from closure).

  • GPU tensors are automatically detected and cloned. Non-tensor arguments (scalars, booleans, etc.) are preserved across all copies.

  • Memory usage scales with the number of rotations needed to exceed L2 cache.

See also

  • calculate_rotation_count: Computes required buffer copies for cold-L2.

Deprecated since version The: l2_flush, l2_flush_size_mb, and l2_flush_device parameters are deprecated. Use cold_l2_cache instead.