MATLAB GPU interfaces

Note

See the MATLAB GPU installation page for how to build these interfaces.

We in 2025 we added MATLAB Parallel Computing Toolbox gpuArray interfaces to CUFINUFFT, our GPU library. Here we follow the MATLAB/Octave CPU tutorial page, which you should to read first, to learn the basics of the interfaces and array indexing. The key change for the GPU user is that our CUDA FINUFFT (aka CUFINUFFT) MATLAB interface acts on gpuArray objects for the main I/O data arrays, and this requires MATLAB’s Parallel Computing Toolbox. This is a commercial product, and we do not currently have an Octave solution for the GPU.

Quick-start example

We jump straight into a 2D type 1 transform using the simple interface, in single precision. Let’s request a rectangular output Fourier mode array of 10000 modes in the x direction but 5000 in the y direction. We create 100 millions source points directly on the GPU, with coordinates lying in the square of side length \(2\pi\):

M = 1e8;
x = 2*pi*gpuArray.rand(M,1,'single');   % random pts in [0,2pi]^2
y = 2*pi*gpuArray.rand(M,1,'single');
c = gpuArray.randn(M,1,'single')+1i*gpuArray.randn(M,1,'single');    % iid random complex data
N1 = 10000; N2 = 5000;                   % desired Fourier mode array sizes
tol = 1e-3;
f = cufinufft2d1(x,y,c,+1,tol,N1,N2);    % do it (takes around 0.2 sec)

The resulting output f is a complex single-precision gpuArray of size 10000 by 5000. The first dimension (number of rows) corresponds to the x input coordinate, and the second to y. If you need to change the definition of the period from \(2\pi\), you cannot; instead linearly rescale your points before sending them to FINUFFT. The above shows a throughput of about 0.5 billion points/sec (on my A6000). For the full example code that also verifies one of the outputs, see simple1d1f_gpu.m.

Note

Timing GPU functions in MATLAB is misleading when using plain tic and toc, because of asynchronous computation: the toc is often executed before the gpuArray function has actually completed! For correct timings, use the following pattern:

dev = gpuDevice();
tic
f = cufinufft2d1(x,y,c,+1,tol,N1,N2);
wait(dev)
toc

Note

Under the hood cuFINUFFT has double- and single-precision libraries. The simple and vectorized GPU MATLAB interfaces infer which precision library to call by checking the precision of its input arrays, which must all match (ie, all must be double gpuArrays or all must be single gpuArrays). The precision of a gpuArray cannot be inferred via class; instead one can use underlyingType (introduced recently in R2020b). In contrast, precision in the guru interface is set with the cufinufft_plan option string opts.floatprec, either 'double' (the default), or 'single'.

Warning

The cufinufft?d? MATLAB commands are a guaranteed way to call the GPU library. We also offer an experimental GPU overloading of the finufft?d? MATLAB commands: when the first argument x is a gpuArray then this redirects to the corresponding cufinufft?d? function. This follows The MathWorks style, and is achieved by the tiny wrapper codes in matlab/@gpuArray/. (We actually exploit this in test/fullmathtest.m.) For this to work the user must add the FINUFFT matlab directory to the path before any gpuArray objects are used in the current session. This is because the gpuArray class is “closed”. Since TMW may prevent library developers from overloading in this way at any time in the future, the overloaded interface remains experimental as of FINUFFT version 2.4.

For several examples of using the cuFINUFFT guru (plan) interface, see examples in the repo.

Full documentation of GPU interface

Here are the help documentation strings for all MATLAB GPU interfaces. The interfaces are the same as the GPU ones except preceded by “cu”. The options descriptions are rather abbreviated in the below; for full documentation see Options for GPU code. Informative warnings and errors are raised in MATLAB style with unique codes (see sources errhandler.m, cufinufft.mw, and valid_*.m found here). The low-level error number codes are not used.

The individual GPU commands have the full help documentation:

CUFINUFFT1D1   GPU 1D complex nonuniform FFT, type 1 (nonuniform to uniform).

f = cufinufft1d1(x,c,isign,eps,ms)
f = cufinufft1d1(x,c,isign,eps,ms,opts)

This computes on the GPU, to relative precision eps, via a fast algorithm:

              nj
    f(k1) =  SUM c[j] exp(+/-i k1 x(j))  for -ms/2 <= k1 <= (ms-1)/2
             j=1
  Inputs:
    x     length-nj vector of real-valued locations of nonuniform sources
    c     length-nj complex vector of source strengths. If numel(c)>nj,
          expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
          transformed with the same source locations.
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    ms    number of Fourier modes computed, may be even or odd;
          in either case, mode range is integers lying in [-ms/2, (ms-1)/2]
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
    opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
    opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
  Outputs:
    f     size-ms complex column vector of Fourier coefficients, or, if
          ntrans>1, a matrix of size (ms,ntrans).

Notes:
 * For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
 * The precision of gpuArray input x controls whether the double or
   single precision GPU library is called; all array inputs must match in
   location (ie, be gpuArrays), and in precision.
 * The vectorized (many vector) interface, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst
 * See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
 * Full documentation is online at http://finufft.readthedocs.io

See also CUFINUFFT_PLAN.
CUFINUFFT1D2   GPU 1D complex nonuniform FFT, type 2 (uniform to nonuniform).

c = cufinufft1d2(x,isign,eps,f)
c = cufinufft1d2(x,isign,eps,f,opts)

This computes on the GPU, to relative precision eps, via a fast algorithm:

   c[j] = SUM   f[k1] exp(+/-i k1 x[j])      for j = 1,...,nj
           k1
    where sum is over -ms/2 <= k1 <= (ms-1)/2.

 Inputs:
    x     length-nj vector of real-valued locations of nonuniform sources
    f     complex Fourier coefficients. If a vector, length sets ms
          (with mode ordering given by opts.modeord). If a matrix, each
          of ntrans columns is transformed with the same nonuniform targets.
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
    opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
    opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
 Outputs:
    c     complex column vector of nj answers at targets, or,
          if ntrans>1, matrix of size (nj,ntrans).

Notes:
 * For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
 * The precision of gpuArray input x controls whether the double or
   single precision GPU library is called; all array inputs must match in
   location (ie, be gpuArrays), and in precision.
 * The vectorized (many vector) interface, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst
 * See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
 * Full documentation is online at http://finufft.readthedocs.io

See also CUFINUFFT_PLAN.
CUFINUFFT1D3   GPU 1D complex nonuniform FFT, type 3 (nonuniform to nonuniform).

f = cufinufft1d3(x,c,isign,eps,s)
f = cufinufft1d3(x,c,isign,eps,s,opts)

This computes on the GPU, to relative precision eps, via a fast algorithm:

             nj
    f[k]  =  SUM   c[j] exp(+-i s[k] x[j]),      for k = 1, ..., nk
             j=1
  Inputs:
    x     length-nj vector of real-valued locations of nonuniform sources
    c     length-nj complex vector of source strengths. If numel(c)>nj,
          expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
          transformed with the same source and target locations.
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    s     length-nk vector of frequency locations of nonuniform targets
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
  Outputs:
    f     length-nk complex vector of values at targets, or, if ntrans>1,
          a matrix of size (nk,ntrans)

Notes:
 * For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
 * The precision of gpuArray input x controls whether the double or
   single precision GPU library is called; all array inputs must match in
   location (ie, be gpuArrays), and in precision.
 * The vectorized (many vector) interface, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst
 * See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
 * Full documentation is online at http://finufft.readthedocs.io

See also CUFINUFFT_PLAN.
CUFINUFFT2D1   GPU 2D complex nonuniform FFT, type 1 (nonuniform to uniform).

f = cufinufft2d1(x,y,c,isign,eps,ms,mt)
f = cufinufft2d1(x,y,c,isign,eps,ms,mt,opts)

This computes on the GPU, to relative precision eps, via a fast algorithm:

                  nj
    f[k1,k2] =   SUM  c[j] exp(+-i (k1 x[j] + k2 y[j]))
                 j=1

    for -ms/2 <= k1 <= (ms-1)/2,  -mt/2 <= k2 <= (mt-1)/2.

  Inputs:
    x,y   real-valued coordinates of nonuniform sources in the plane,
          each a length-nj vector
    c     length-nj complex vector of source strengths. If numel(c)>nj,
          expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
          transformed with the same source locations.
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    ms,mt  number of Fourier modes requested in x & y; each may be even or odd.
           In either case the mode range is integers lying in [-m/2, (m-1)/2]
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
    opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
    opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
  Outputs:
    f     size (ms,mt) complex matrix of Fourier coefficients
          (ordering given by opts.modeord in each dimension; ms fast, mt slow),
          or, if ntrans>1, a 3D array of size (ms,mt,ntrans).

Notes:
 * For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
 * The precision of gpuArray input x controls whether the double or
   single precision GPU library is called; all array inputs must match in
   location (ie, be gpuArrays), and in precision.
 * The vectorized (many vector) interface, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst
 * See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
 * Full documentation is online at http://finufft.readthedocs.io

See also CUFINUFFT_PLAN.
CUFINUFFT2D2   GPU 2D complex nonuniform FFT, type 2 (uniform to nonuniform).

c = cufinufft2d2(x,y,isign,eps,f)
c = cufinufft2d2(x,y,isign,eps,f,opts)

This computes on the GPU, to relative precision eps, via a fast algorithm:

   c[j] =  SUM   f[k1,k2] exp(+/-i (k1 x[j] + k2 y[j]))  for j = 1,..,nj
          k1,k2
    where sum is over -ms/2 <= k1 <= (ms-1)/2, -mt/2 <= k2 <= (mt-1)/2,

 Inputs:
    x,y   real-valued coordinates of nonuniform targets in the plane,
          each a vector of length nj
    f     complex Fourier coefficient matrix, whose size determines (ms,mt).
          (Mode ordering given by opts.modeord, in each dimension.)
          If a 3D array, 3rd dimension sets ntrans, and each of ntrans
          matrices is transformed with the same nonuniform targets.
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
    opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
    opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
 Outputs:
    c     complex column vector of nj answers at targets, or,
          if ntrans>1, matrix of size (nj,ntrans).

Notes:
 * For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
 * The precision of gpuArray input x controls whether the double or
   single precision GPU library is called; all array inputs must match in
   location (ie, be gpuArrays), and in precision.
 * The vectorized (many vector) interface, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst
 * See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
 * Full documentation is online at http://finufft.readthedocs.io

See also CUFINUFFT_PLAN.
CUFINUFFT2D3   GPU 2D complex nonuniform FFT, type 3 (nonuniform to nonuniform).

f = cufinufft2d3(x,y,c,isign,eps,s,t)
f = cufinufft2d3(x,y,c,isign,eps,s,t,opts)

This computes on the GPU, to relative precision eps, via a fast algorithm:

             nj
    f[k]  =  SUM   c[j] exp(+-i (s[k] x[j] + t[k] y[j])),  for k = 1, ..., nk
             j=1
  Inputs:
    x,y   coordinates of nonuniform sources in R^2, each a length-nj vector.
    c     length-nj complex vector of source strengths. If numel(c)>nj,
          expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
          transformed with the same source and target locations.
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    s,t   frequency coordinates of nonuniform targets in R^2,
          each a length-nk vector.
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
  Outputs:
    f     length-nk complex vector of values at targets, or, if ntrans>1,
          a matrix of size (nk,ntrans)

Notes:
 * For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
 * The precision of gpuArray input x controls whether the double or
   single precision GPU library is called; all array inputs must match in
   location (ie, be gpuArrays), and in precision.
 * The vectorized (many vector) interface, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst
 * See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
 * Full documentation is online at http://finufft.readthedocs.io

See also CUFINUFFT_PLAN.
CUFINUFFT3D1   GPU 3D complex nonuniform FFT, type 1 (nonuniform to uniform).

f = cufinufft3d1(x,y,z,c,isign,eps,ms,mt,mu)
f = cufinufft3d1(x,y,z,c,isign,eps,ms,mt,mu,opts)

This computes on the GPU, to relative precision eps, via a fast algorithm:

                      nj
    f[k1,k2,k3] =    SUM  c[j] exp(+-i (k1 x[j] + k2 y[j] + k3 z[j]))
                     j=1

    for -ms/2 <= k1 <= (ms-1)/2,  -mt/2 <= k2 <= (mt-1)/2,
        -mu/2 <= k3 <= (mu-1)/2.

  Inputs:
    x,y,z real-valued coordinates of nonuniform sources,
          each a length-nj vector
    c     length-nj complex vector of source strengths. If numel(c)>nj,
          expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
          transformed with the same source locations.
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    ms,mt,mu  number of Fourier modes requested in x,y and z; each may be
          even or odd.
          In either case the mode range is integers lying in [-m/2, (m-1)/2]
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
    opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
    opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
  Outputs:
    f     size (ms,mt,mu) complex array of Fourier coefficients
          (ordering given by opts.modeord in each dimension; ms fastest, mu
          slowest), or, if ntrans>1, a 4D array of size (ms,mt,mu,ntrans).

Notes:
 * For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
 * The precision of gpuArray input x controls whether the double or
   single precision GPU library is called; all array inputs must match in
   location (ie, be gpuArrays), and in precision.
 * The vectorized (many vector) interface, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst
 * See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
 * Full documentation is online at http://finufft.readthedocs.io

See also CUFINUFFT_PLAN.
CUFINUFFT3D2   GPU 3D complex nonuniform FFT, type 2 (uniform to nonuniform).

c = cufinufft3d2(x,y,z,isign,eps,f)
c = cufinufft3d2(x,y,z,isign,eps,f,opts)

This computes on the GPU, to relative precision eps, via a fast algorithm:

   c[j] =   SUM   f[k1,k2,k3] exp(+/-i (k1 x[j] + k2 y[j] + k3 z[j]))
          k1,k2,k3
                           for j = 1,..,nj
    where sum is over -ms/2 <= k1 <= (ms-1)/2, -mt/2 <= k2 <= (mt-1)/2,
                      -mu/2 <= k3 <= (mu-1)/2.

 Inputs:
    x,y,z real-valued coordinates of nonuniform targets,
          each a vector of length nj
    f     complex Fourier coefficient array, whose size sets (ms,mt,mu).
          (Mode ordering given by opts.modeord, in each dimension.)
          If a 4D array, 4th dimension sets ntrans, and each of ntrans
          3D arrays is transformed with the same nonuniform targets.
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
    opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
    opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
 Outputs:
    c     complex column vector of nj answers at targets, or,
          if ntrans>1, matrix of size (nj,ntrans).

Notes:
 * For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
 * The precision of gpuArray input x controls whether the double or
   single precision GPU library is called; all array inputs must match in
   location (ie, be gpuArrays), and in precision.
 * The vectorized (many vector) interface, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst
 * See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
 * Full documentation is online at http://finufft.readthedocs.io

See also CUFINUFFT_PLAN.
CUFINUFFT3D3   GPU 3D complex nonuniform FFT, type 3 (nonuniform to nonuniform).

f = cufinufft3d3(x,y,z,c,isign,eps,s,t,u)
f = cufinufft3d3(x,y,z,c,isign,eps,s,t,u,opts)

This computes on the GPU, to relative precision eps, via a fast algorithm:

             nj
    f[k]  =  SUM   c[j] exp(+-i (s[k] x[j] + t[k] y[j] + u[k] z[j])),
             j=1
                             for k = 1, ..., nk
  Inputs:
    x,y,z  coordinates of nonuniform sources in R^3, each a length-nj vector.
    c     length-nj complex vector of source strengths. If numel(c)>nj,
          expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
          transformed with the same source and target locations.
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    s,t,u  frequency coordinates of nonuniform targets in R^3,
          each a length-nk vector.
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
  Outputs:
    f     length-nk complex vector of values at targets, or, if ntrans>1,
          a matrix of size (nk,ntrans)

Notes:
 * For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
 * The precision of gpuArray input x controls whether the double or
   single precision GPU library is called; all array inputs must match in
   location (ie, be gpuArrays), and in precision.
 * The vectorized (many vector) interface, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst
 * See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
 * Full documentation is online at http://finufft.readthedocs.io

See also CUFINUFFT_PLAN.
CUFINUFFT_PLAN   is a class which wraps the guru GPU interface to FINUFFT.

 Full documentation is online at http://finufft.readthedocs.io
 Also see examples in matlab/cuda/examples and matlab/cuda/test

PROPERTIES
  mwptr - opaque pointer to a C++ finufft_plan object (see MWrap manual),
          whose properties cannot be accessed directly
  floatprec - either 'double' or 'single', tracks what precision of C++
          library is being called
  type, dim, n_modes, n_trans, nj, nk, xj, yj, zj - other plan parameters.
 Note: the user should never alter these plan properties directly! Rather,
 the below methods should be used to create, use, and destroy plans.

METHODS
  cufinufft_plan - create plan object for one/many general nonuniform FFTs.
  setpts       - process nonuniform points for general FINUFFT transform(s).
  execute      - execute single or many-vector FINUFFT transforms in a plan.

General notes:
 * All array inputs and outputs are MATLAB gpuArrays of the same precision.
 * Use delete(plan) to remove a plan after use.
 * See ERRHANDLER, VALID_*, and this code for warning/error IDs.



=========== Detailed description of guru methods ==========================

1) CUFINUFFT_PLAN create plan object for one/many general nonuniform FFTs.

plan = cufinufft_plan(type, n_modes_or_dim, isign, ntrans, eps)
plan = cufinufft_plan(type, n_modes_or_dim, isign, ntrans, eps, opts)

Creates a cufinufft_plan MATLAB object in the interface to GPU FINUFFT, of
 type 1, 2 or 3, and with given numbers of Fourier modes (unless type 3).

Inputs:
    type            transform type: 1, 2, or 3
    n_modes_or_dim  if type is 1 or 2, the number of Fourier modes in each
                    dimension: [ms] in 1D, [ms mt] in 2D, or [ms mt mu] in 3D.
                    Its length sets the dimension, which must be 1, 2 or 3.
                    If type is 3, in contrast, its *value* fixes the dimension
    isign if >=0, uses + sign in exponential, otherwise - sign.
    eps   relative precision requested (generally between 1e-15 and 1e-1)
    opts   optional struct with optional fields controlling the following:
    opts.debug:   0 (silent, default), 1 (timing breakdown), 2 (debug info).
    opts.upsampfac:   sigma.  2.0 (default), or 1.25 (low RAM, smaller FFT).
    opts.gpu_method:  0 (auto, default), 1 (GM or GM-sort), 2 (SM).
    opts.gpu_sort:  0 (do not sort NU pts), 1 (sort when GM method, default).
    opts.gpu_kerevalmeth:  0 (slow reference). 1 (Horner ppoly, default).
    opts.gpu_maxsubprobsize:  max # NU pts per subprob (gpu_method=2 only).
    opts.gpu_binsize{x,y,z}:  various binsizes in GM-sort/SM (for experts).
    opts.gpu_maxbatchsize:   0 (auto, default), or many-vector batch size.
    opts.gpu_device_id:  sets the GPU device ID (experts only).
    opts.floatprec: library precision to use, 'double' (default) or 'single'.
    for type 1 and 2 only, the following opts fields are also relevant:
    opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
    opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
Outputs:
    plan            cufinufft_plan object (opaque pointer)

Notes:
 * For type 1 and 2, this does the cuFFT planning and kernel-FT precompute.
 * For type 3, this does very little, since cuFFT sizes are not yet known.
 * The vectorized (many vector) plan, ie ntrans>1, can be faster
   than repeated calls with the same nonuniform points. Note that here the
   I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
 * For more details about the opts fields, see ../docs/c_gpu.rst


2) SETPTS   process nonuniform points for general GPU FINUFFT transform(s).

plan.setpts(xj)
plan.setpts(xj, yj)
plan.setpts(xj, yj, zj)
plan.setpts(xj, [], [], s)
plan.setpts(xj, yj, [], s, t)
plan.setpts(xj, yj, zj, s, t, u)

 When plan is a cufinufft_plan MATLAB object, brings in nonuniform point
 coordinates (xj,yj,zj), and additionally in the type 3 case, nonuniform
 frequency target points (s,t,u). Empty arrays may be passed in the case of
 unused dimensions. For all types, sorting is done to internally store a
 reindexing of points, and for type 3 the spreading and FFTs are planned.
 The nonuniform points may be used for multiple transforms.

Inputs:
    xj     vector of x-coords of all nonuniform points
    yj     empty (if dim<2), or vector of y-coords of all nonuniform points
    zj     empty (if dim<3), or vector of z-coords of all nonuniform points
    s      vector of x-coords of all nonuniform frequency targets
    t      empty (if dim<2), or vector of y-coords of all frequency targets
    u      empty (if dim<3), or vector of z-coords of all frequency targets
Input/Outputs:
    plan   cufinufft_plan object

Notes:
 * The values in xj (and if nonempty, yj and zj) are real-valued, and
   invariant under translations by multiples of 2pi. For type 1
   they are "sources", whereas for type 2 they are "targets".
   For type 3 there is no periodicity, and no restrictions other
   than the resulting size of the internal fine grids.
 * s (and t and u) are only relevant for type 3, and may be omitted otherwise
 * The matlab vectors xj,... and s,... should not be changed before calling
   future execute calls, because the plan stores only pointers to the
   arrays (they are not duplicated internally).
 * The precision (double/single) of all inputs must match that chosen at the
   plan stage using opts.floatprec, otherwise an error is raised.


3) EXECUTE   execute single or many-vector GPU FINUFFT transforms in a plan.

result = plan.execute(data_in);

 For plan a previously created cufinufft_plan object also containing all
 needed nonuniform point coordinates, do a single (or if ntrans>1 in the
 plan stage, multiple) NUFFT transform(s), with the strengths or Fourier
 coefficient inputs vector(s) from data_in. The result of the transform(s)
 is returned as a (possibly multidimensional) gpuArray.

Inputs:
    plan     cufinufft_plan object
    data_in  strengths (types 1 or 3) or Fourier coefficients (type 2)
             vector, matrix, or array of appropriate size. For type 1 and 3,
             this is either a length-M vector (where M is the length of xj),
             or an (M,ntrans) matrix when ntrans>1. For type 2, in 1D this is
             length-ms, in 2D size (ms,mt), or in 3D size (ms,mt,mu), or
             each of these with an extra last dimension ntrans if ntrans>1.
Outputs:
    result   vector of output strengths at targets (types 2 or 3), or array
             of Fourier coefficients (type 1), or, if ntrans>1, a stack of
             such vectors or arrays, of appropriate size.
             Specifically, if ntrans=1, for type 1, in 1D
             this is a length-ms column vector, in 2D a matrix of size
             (ms,mt), or in 3D an array of size (ms,mt,mu); for types 2 and 3
             it is a column vector of length M (the length of xj in type 2),
             or nk (the length of s in type 3). If ntrans>1 its is a stack
             of such objects, ie, it has an extra last dimension ntrans.

Notes:
 * The precision (double/single) of all gpuArrays must match that chosen at
   the plan stage using opts.floatprec, otherwise an error is raised.


4) To deallocate (delete) a GPU nonuniform FFT plan, use delete(plan)

This deallocates all stored cuFFT plans, nonuniform point sorting arrays,
 kernel Fourier transforms arrays, etc.