GPU code implementation details

This file contains detailed explanations of the algorithms and optimization strategies used in the GPU version of the library.

The focus is on clarity and reproducibility of the core computational techniques, including spreading/interpolation schemes, memory access patterns, and kernel launch structures.

Note

This is a living document, started in 2025 (v 2.4.1). Implementation details are subject to change as performance and accuracy improvements are integrated.

Output Driven

The output-driven spreading strategy is designed to reduce global memory traffic and exploit shared memory locality. A CUDA block corresponds to a spatial tile in the output grid, and shared memory is used to accumulate updates from multiple nonuniform points. The original approach was developed by Juan Ignacio Polanco in NonuniformFFTs.jl.

The process follows three main stages:

  1. Per-thread kernel evaluation:

    Each thread takes care of a single NU point and calculates all kernel values for that point. Kernel values are stored into shared memory (kerevals) in a batched layout, allowing reuse by all threads in the block. kerevals is a 3D array with shape (Np, dim, ns), where Np is the number of NUFFT points. Using CUDA parallelism it is possible to evaluate all the kernel values in parallel accessing kerevals(thread.id, dim, 0). The third parameter is always 0 because eval_kernel_vec takes a pointer and writes ns values in one go. This corresponds to:

    eval_kernel_vec<T, ns>(&kerevals(i, 0, 0), x1, es_c, es_beta);
    
  2. Thread-cooperative accumulation in shared memory:

    • Instead of assigning 1 thread per point (which would lead to shared memory collisions), all threads iterate over a small batch (Np) of NUFFT points. That is, the points are not processed in parallel, but the inner loop (tensor product) is.

      The Shared Memory (SM) approach does:

      parallel for point = 0 to NumPoints
        ...
        for x = 0 to ns
          for y = 0 to ns
            for z = 0 to ns
              ...
      

      The Output-driven approach does:

      For each point:

      • Loop over NUFFT points sequentially.

      • Parallelize over kernel grid entries using a flattened loop up to :math:n_s^{\text{dim}}.

      Example pseudocode:

      for point = 0 to NumPoints, point+=np
        ...
        parallel for i = 0 to pow(ns, dim)
          ...
        ...
      

      The parallelism is flipped: SM parallelizes the outer loop (over points), while Output-driven parallelizes the inner loop (over the kernel values). There is no collision because the local grid tile (local_subgrid) is accessed by (ix, iy, iz) — and these are unique per thread as determined by the thread ID. This removes the need for AtomicAdd on the local subgrid.

  3. Atomic addition to global memory:

    Unchanged from SM: once all points have been processed and accumulated into local_subgrid, the block performs an atomic write to global memory (fw). Since this step is amortized over many points, its overhead is negligible.

Memory Organization

  • kerevals: Stores kernel weights in shape (Np, dim, ns). Threads access only their assigned batch rows.

  • local_subgrid: A padded shared-memory grid with shape :math:(bin\_size + padding)^{dim}. Where passing is :math:padding = 2((ns+1)/2). Threads write to disjoint sections during accumulation to avoid races.

Design Insights

This hybrid parallelization combines per-point parallelism (step 1) and spatial parallelism (step 2):

  • Threads collaborate rather than compete on shared memory access.

  • Batching (Np) controls memory footprint and allows tuning for hardware constraints.

  • Synchronization barriers ensure correctness before accessing shared buffers.