MATLAB GPU interfaces¶
Note
See the MATLAB GPU installation page for how to build these interfaces.
We in 2025 we added MATLAB Parallel Computing Toolbox gpuArray interfaces to
CUFINUFFT, our GPU library.
Here we follow the MATLAB/Octave CPU tutorial page, which
you should to read first, to learn the basics of the interfaces and
array indexing.
The key change for the GPU user
is that our CUDA FINUFFT (aka CUFINUFFT) MATLAB interface
acts on gpuArray objects for the main I/O data arrays, and this requires
MATLAB’s Parallel Computing Toolbox. This is a commercial product,
and we do not currently have an Octave solution for the GPU.
Quick-start example¶
We jump straight into a 2D type 1 transform using the simple interface, in single precision. Let’s request a rectangular output Fourier mode array of 10000 modes in the x direction but 5000 in the y direction. We create 100 millions source points directly on the GPU, with coordinates lying in the square of side length \(2\pi\):
M = 1e8;
x = 2*pi*gpuArray.rand(M,1,'single'); % random pts in [0,2pi]^2
y = 2*pi*gpuArray.rand(M,1,'single');
c = gpuArray.randn(M,1,'single')+1i*gpuArray.randn(M,1,'single'); % iid random complex data
N1 = 10000; N2 = 5000; % desired Fourier mode array sizes
tol = 1e-3;
f = cufinufft2d1(x,y,c,+1,tol,N1,N2); % do it (takes around 0.2 sec)
The resulting output f is a complex single-precision gpuArray of size
10000 by 5000. The first dimension
(number of rows) corresponds to the x input coordinate, and the second to y.
If you need to change the definition of the period from \(2\pi\), you cannot;
instead linearly rescale your points before sending them to FINUFFT.
The above shows a throughput of about 0.5 billion points/sec (on my A6000).
For the full example code that also verifies one of the outputs,
see simple1d1f_gpu.m.
Note
Timing GPU functions in MATLAB is misleading when using plain tic and toc, because of asynchronous computation: the toc is often executed before the gpuArray function has actually completed! For correct timings, use the following pattern:
dev = gpuDevice();
tic
f = cufinufft2d1(x,y,c,+1,tol,N1,N2);
wait(dev)
toc
Note
Under the hood cuFINUFFT has double- and single-precision libraries. The simple and vectorized GPU MATLAB interfaces infer which precision library to call by checking the precision of its input arrays, which must all match (ie, all must be double gpuArrays or all must be single gpuArrays). The precision of a gpuArray cannot be inferred via class; instead one can use underlyingType (introduced recently in R2020b). In contrast, precision in the guru interface is set with the cufinufft_plan option string opts.floatprec, either 'double' (the default), or 'single'.
Warning
The cufinufft?d? MATLAB commands are a guaranteed way to call the GPU library. We also offer an experimental GPU overloading of the finufft?d? MATLAB commands: when the first argument x is a gpuArray then this redirects to the corresponding cufinufft?d? function. This follows The MathWorks style, and is achieved by the tiny wrapper codes in matlab/@gpuArray/. (We actually exploit this in test/fullmathtest.m.) For this to work the user must add the FINUFFT matlab directory to the path before any gpuArray objects are used in the current session. This is because the gpuArray class is “closed”. Since TMW may prevent library developers from overloading in this way at any time in the future, the overloaded interface remains experimental as of FINUFFT version 2.4.
For several examples of using the cuFINUFFT guru (plan) interface, see examples in the repo.
Full documentation of GPU interface¶
Here are the help documentation strings for all MATLAB GPU interfaces.
The interfaces are the same as the GPU ones except preceded by “cu”.
The options descriptions are rather abbreviated in the below;
for full documentation see Options for GPU code.
Informative warnings and errors are raised in MATLAB style with unique
codes (see sources errhandler.m, cufinufft.mw, and
valid_*.m found here).
The low-level error number codes are not used.
The individual GPU commands have the full help documentation:
CUFINUFFT1D1 GPU 1D complex nonuniform FFT, type 1 (nonuniform to uniform).
f = cufinufft1d1(x,c,isign,eps,ms)
f = cufinufft1d1(x,c,isign,eps,ms,opts)
This computes on the GPU, to relative precision eps, via a fast algorithm:
nj
f(k1) = SUM c[j] exp(+/-i k1 x(j)) for -ms/2 <= k1 <= (ms-1)/2
j=1
Inputs:
x length-nj vector of real-valued locations of nonuniform sources
c length-nj complex vector of source strengths. If numel(c)>nj,
expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
transformed with the same source locations.
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
ms number of Fourier modes computed, may be even or odd;
in either case, mode range is integers lying in [-ms/2, (ms-1)/2]
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
Outputs:
f size-ms complex column vector of Fourier coefficients, or, if
ntrans>1, a matrix of size (ms,ntrans).
Notes:
* For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
* The precision of gpuArray input x controls whether the double or
single precision GPU library is called; all array inputs must match in
location (ie, be gpuArrays), and in precision.
* The vectorized (many vector) interface, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
* See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
* Full documentation is online at http://finufft.readthedocs.io
See also CUFINUFFT_PLAN.
CUFINUFFT1D2 GPU 1D complex nonuniform FFT, type 2 (uniform to nonuniform).
c = cufinufft1d2(x,isign,eps,f)
c = cufinufft1d2(x,isign,eps,f,opts)
This computes on the GPU, to relative precision eps, via a fast algorithm:
c[j] = SUM f[k1] exp(+/-i k1 x[j]) for j = 1,...,nj
k1
where sum is over -ms/2 <= k1 <= (ms-1)/2.
Inputs:
x length-nj vector of real-valued locations of nonuniform sources
f complex Fourier coefficients. If a vector, length sets ms
(with mode ordering given by opts.modeord). If a matrix, each
of ntrans columns is transformed with the same nonuniform targets.
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
Outputs:
c complex column vector of nj answers at targets, or,
if ntrans>1, matrix of size (nj,ntrans).
Notes:
* For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
* The precision of gpuArray input x controls whether the double or
single precision GPU library is called; all array inputs must match in
location (ie, be gpuArrays), and in precision.
* The vectorized (many vector) interface, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
* See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
* Full documentation is online at http://finufft.readthedocs.io
See also CUFINUFFT_PLAN.
CUFINUFFT1D3 GPU 1D complex nonuniform FFT, type 3 (nonuniform to nonuniform).
f = cufinufft1d3(x,c,isign,eps,s)
f = cufinufft1d3(x,c,isign,eps,s,opts)
This computes on the GPU, to relative precision eps, via a fast algorithm:
nj
f[k] = SUM c[j] exp(+-i s[k] x[j]), for k = 1, ..., nk
j=1
Inputs:
x length-nj vector of real-valued locations of nonuniform sources
c length-nj complex vector of source strengths. If numel(c)>nj,
expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
transformed with the same source and target locations.
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
s length-nk vector of frequency locations of nonuniform targets
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
Outputs:
f length-nk complex vector of values at targets, or, if ntrans>1,
a matrix of size (nk,ntrans)
Notes:
* For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
* The precision of gpuArray input x controls whether the double or
single precision GPU library is called; all array inputs must match in
location (ie, be gpuArrays), and in precision.
* The vectorized (many vector) interface, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
* See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
* Full documentation is online at http://finufft.readthedocs.io
See also CUFINUFFT_PLAN.
CUFINUFFT2D1 GPU 2D complex nonuniform FFT, type 1 (nonuniform to uniform).
f = cufinufft2d1(x,y,c,isign,eps,ms,mt)
f = cufinufft2d1(x,y,c,isign,eps,ms,mt,opts)
This computes on the GPU, to relative precision eps, via a fast algorithm:
nj
f[k1,k2] = SUM c[j] exp(+-i (k1 x[j] + k2 y[j]))
j=1
for -ms/2 <= k1 <= (ms-1)/2, -mt/2 <= k2 <= (mt-1)/2.
Inputs:
x,y real-valued coordinates of nonuniform sources in the plane,
each a length-nj vector
c length-nj complex vector of source strengths. If numel(c)>nj,
expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
transformed with the same source locations.
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
ms,mt number of Fourier modes requested in x & y; each may be even or odd.
In either case the mode range is integers lying in [-m/2, (m-1)/2]
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
Outputs:
f size (ms,mt) complex matrix of Fourier coefficients
(ordering given by opts.modeord in each dimension; ms fast, mt slow),
or, if ntrans>1, a 3D array of size (ms,mt,ntrans).
Notes:
* For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
* The precision of gpuArray input x controls whether the double or
single precision GPU library is called; all array inputs must match in
location (ie, be gpuArrays), and in precision.
* The vectorized (many vector) interface, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
* See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
* Full documentation is online at http://finufft.readthedocs.io
See also CUFINUFFT_PLAN.
CUFINUFFT2D2 GPU 2D complex nonuniform FFT, type 2 (uniform to nonuniform).
c = cufinufft2d2(x,y,isign,eps,f)
c = cufinufft2d2(x,y,isign,eps,f,opts)
This computes on the GPU, to relative precision eps, via a fast algorithm:
c[j] = SUM f[k1,k2] exp(+/-i (k1 x[j] + k2 y[j])) for j = 1,..,nj
k1,k2
where sum is over -ms/2 <= k1 <= (ms-1)/2, -mt/2 <= k2 <= (mt-1)/2,
Inputs:
x,y real-valued coordinates of nonuniform targets in the plane,
each a vector of length nj
f complex Fourier coefficient matrix, whose size determines (ms,mt).
(Mode ordering given by opts.modeord, in each dimension.)
If a 3D array, 3rd dimension sets ntrans, and each of ntrans
matrices is transformed with the same nonuniform targets.
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
Outputs:
c complex column vector of nj answers at targets, or,
if ntrans>1, matrix of size (nj,ntrans).
Notes:
* For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
* The precision of gpuArray input x controls whether the double or
single precision GPU library is called; all array inputs must match in
location (ie, be gpuArrays), and in precision.
* The vectorized (many vector) interface, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
* See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
* Full documentation is online at http://finufft.readthedocs.io
See also CUFINUFFT_PLAN.
CUFINUFFT2D3 GPU 2D complex nonuniform FFT, type 3 (nonuniform to nonuniform).
f = cufinufft2d3(x,y,c,isign,eps,s,t)
f = cufinufft2d3(x,y,c,isign,eps,s,t,opts)
This computes on the GPU, to relative precision eps, via a fast algorithm:
nj
f[k] = SUM c[j] exp(+-i (s[k] x[j] + t[k] y[j])), for k = 1, ..., nk
j=1
Inputs:
x,y coordinates of nonuniform sources in R^2, each a length-nj vector.
c length-nj complex vector of source strengths. If numel(c)>nj,
expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
transformed with the same source and target locations.
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
s,t frequency coordinates of nonuniform targets in R^2,
each a length-nk vector.
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
Outputs:
f length-nk complex vector of values at targets, or, if ntrans>1,
a matrix of size (nk,ntrans)
Notes:
* For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
* The precision of gpuArray input x controls whether the double or
single precision GPU library is called; all array inputs must match in
location (ie, be gpuArrays), and in precision.
* The vectorized (many vector) interface, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
* See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
* Full documentation is online at http://finufft.readthedocs.io
See also CUFINUFFT_PLAN.
CUFINUFFT3D1 GPU 3D complex nonuniform FFT, type 1 (nonuniform to uniform).
f = cufinufft3d1(x,y,z,c,isign,eps,ms,mt,mu)
f = cufinufft3d1(x,y,z,c,isign,eps,ms,mt,mu,opts)
This computes on the GPU, to relative precision eps, via a fast algorithm:
nj
f[k1,k2,k3] = SUM c[j] exp(+-i (k1 x[j] + k2 y[j] + k3 z[j]))
j=1
for -ms/2 <= k1 <= (ms-1)/2, -mt/2 <= k2 <= (mt-1)/2,
-mu/2 <= k3 <= (mu-1)/2.
Inputs:
x,y,z real-valued coordinates of nonuniform sources,
each a length-nj vector
c length-nj complex vector of source strengths. If numel(c)>nj,
expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
transformed with the same source locations.
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
ms,mt,mu number of Fourier modes requested in x,y and z; each may be
even or odd.
In either case the mode range is integers lying in [-m/2, (m-1)/2]
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
Outputs:
f size (ms,mt,mu) complex array of Fourier coefficients
(ordering given by opts.modeord in each dimension; ms fastest, mu
slowest), or, if ntrans>1, a 4D array of size (ms,mt,mu,ntrans).
Notes:
* For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
* The precision of gpuArray input x controls whether the double or
single precision GPU library is called; all array inputs must match in
location (ie, be gpuArrays), and in precision.
* The vectorized (many vector) interface, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
* See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
* Full documentation is online at http://finufft.readthedocs.io
See also CUFINUFFT_PLAN.
CUFINUFFT3D2 GPU 3D complex nonuniform FFT, type 2 (uniform to nonuniform).
c = cufinufft3d2(x,y,z,isign,eps,f)
c = cufinufft3d2(x,y,z,isign,eps,f,opts)
This computes on the GPU, to relative precision eps, via a fast algorithm:
c[j] = SUM f[k1,k2,k3] exp(+/-i (k1 x[j] + k2 y[j] + k3 z[j]))
k1,k2,k3
for j = 1,..,nj
where sum is over -ms/2 <= k1 <= (ms-1)/2, -mt/2 <= k2 <= (mt-1)/2,
-mu/2 <= k3 <= (mu-1)/2.
Inputs:
x,y,z real-valued coordinates of nonuniform targets,
each a vector of length nj
f complex Fourier coefficient array, whose size sets (ms,mt,mu).
(Mode ordering given by opts.modeord, in each dimension.)
If a 4D array, 4th dimension sets ntrans, and each of ntrans
3D arrays is transformed with the same nonuniform targets.
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
Outputs:
c complex column vector of nj answers at targets, or,
if ntrans>1, matrix of size (nj,ntrans).
Notes:
* For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
* The precision of gpuArray input x controls whether the double or
single precision GPU library is called; all array inputs must match in
location (ie, be gpuArrays), and in precision.
* The vectorized (many vector) interface, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
* See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
* Full documentation is online at http://finufft.readthedocs.io
See also CUFINUFFT_PLAN.
CUFINUFFT3D3 GPU 3D complex nonuniform FFT, type 3 (nonuniform to nonuniform).
f = cufinufft3d3(x,y,z,c,isign,eps,s,t,u)
f = cufinufft3d3(x,y,z,c,isign,eps,s,t,u,opts)
This computes on the GPU, to relative precision eps, via a fast algorithm:
nj
f[k] = SUM c[j] exp(+-i (s[k] x[j] + t[k] y[j] + u[k] z[j])),
j=1
for k = 1, ..., nk
Inputs:
x,y,z coordinates of nonuniform sources in R^3, each a length-nj vector.
c length-nj complex vector of source strengths. If numel(c)>nj,
expects a stack of vectors (eg, a nj*ntrans matrix) each of which is
transformed with the same source and target locations.
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
s,t,u frequency coordinates of nonuniform targets in R^3,
each a length-nk vector.
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
Outputs:
f length-nk complex vector of values at targets, or, if ntrans>1,
a matrix of size (nk,ntrans)
Notes:
* For CUFINUFFT all array I/O is in the form of gpuArrays (on-device).
* The precision of gpuArray input x controls whether the double or
single precision GPU library is called; all array inputs must match in
location (ie, be gpuArrays), and in precision.
* The vectorized (many vector) interface, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
* See ERRHANDLER, VALID_* and CUFINUFFT_PLAN for possible warning/error IDs.
* Full documentation is online at http://finufft.readthedocs.io
See also CUFINUFFT_PLAN.
CUFINUFFT_PLAN is a class which wraps the guru GPU interface to FINUFFT.
Full documentation is online at http://finufft.readthedocs.io
Also see examples in matlab/cuda/examples and matlab/cuda/test
PROPERTIES
mwptr - opaque pointer to a C++ finufft_plan object (see MWrap manual),
whose properties cannot be accessed directly
floatprec - either 'double' or 'single', tracks what precision of C++
library is being called
type, dim, n_modes, n_trans, nj, nk, xj, yj, zj - other plan parameters.
Note: the user should never alter these plan properties directly! Rather,
the below methods should be used to create, use, and destroy plans.
METHODS
cufinufft_plan - create plan object for one/many general nonuniform FFTs.
setpts - process nonuniform points for general FINUFFT transform(s).
execute - execute single or many-vector FINUFFT transforms in a plan.
General notes:
* All array inputs and outputs are MATLAB gpuArrays of the same precision.
* Use delete(plan) to remove a plan after use.
* See ERRHANDLER, VALID_*, and this code for warning/error IDs.
=========== Detailed description of guru methods ==========================
1) CUFINUFFT_PLAN create plan object for one/many general nonuniform FFTs.
plan = cufinufft_plan(type, n_modes_or_dim, isign, ntrans, eps)
plan = cufinufft_plan(type, n_modes_or_dim, isign, ntrans, eps, opts)
Creates a cufinufft_plan MATLAB object in the interface to GPU FINUFFT, of
type 1, 2 or 3, and with given numbers of Fourier modes (unless type 3).
Inputs:
type transform type: 1, 2, or 3
n_modes_or_dim if type is 1 or 2, the number of Fourier modes in each
dimension: [ms] in 1D, [ms mt] in 2D, or [ms mt mu] in 3D.
Its length sets the dimension, which must be 1, 2 or 3.
If type is 3, in contrast, its *value* fixes the dimension
isign if >=0, uses + sign in exponential, otherwise - sign.
eps relative precision requested (generally between 1e-15 and 1e-1)
opts optional struct with optional fields controlling the following:
opts.debug: 0 (silent, default), 1 (timing breakdown), 2 (debug info).
opts.upsampfac: sigma. 2.0 (default), or 1.25 (low RAM, smaller FFT).
opts.gpu_method: 0 (auto, default), 1 (GM or GM-sort), 2 (SM).
opts.gpu_sort: 0 (do not sort NU pts), 1 (sort when GM method, default).
opts.gpu_kerevalmeth: 0 (slow reference). 1 (Horner ppoly, default).
opts.gpu_maxsubprobsize: max # NU pts per subprob (gpu_method=2 only).
opts.gpu_binsize{x,y,z}: various binsizes in GM-sort/SM (for experts).
opts.gpu_maxbatchsize: 0 (auto, default), or many-vector batch size.
opts.gpu_device_id: sets the GPU device ID (experts only).
opts.floatprec: library precision to use, 'double' (default) or 'single'.
for type 1 and 2 only, the following opts fields are also relevant:
opts.modeord: 0 (CMCL increasing mode ordering, default), 1 (FFT ordering)
opts.gpu_spreadinterponly: 0 (do NUFFT, default), 1 (only spread/interp)
Outputs:
plan cufinufft_plan object (opaque pointer)
Notes:
* For type 1 and 2, this does the cuFFT planning and kernel-FT precompute.
* For type 3, this does very little, since cuFFT sizes are not yet known.
* The vectorized (many vector) plan, ie ntrans>1, can be faster
than repeated calls with the same nonuniform points. Note that here the
I/O data ordering is stacked not interleaved. See ../docs/matlab_gpu.rst
* For more details about the opts fields, see ../docs/c_gpu.rst
2) SETPTS process nonuniform points for general GPU FINUFFT transform(s).
plan.setpts(xj)
plan.setpts(xj, yj)
plan.setpts(xj, yj, zj)
plan.setpts(xj, [], [], s)
plan.setpts(xj, yj, [], s, t)
plan.setpts(xj, yj, zj, s, t, u)
When plan is a cufinufft_plan MATLAB object, brings in nonuniform point
coordinates (xj,yj,zj), and additionally in the type 3 case, nonuniform
frequency target points (s,t,u). Empty arrays may be passed in the case of
unused dimensions. For all types, sorting is done to internally store a
reindexing of points, and for type 3 the spreading and FFTs are planned.
The nonuniform points may be used for multiple transforms.
Inputs:
xj vector of x-coords of all nonuniform points
yj empty (if dim<2), or vector of y-coords of all nonuniform points
zj empty (if dim<3), or vector of z-coords of all nonuniform points
s vector of x-coords of all nonuniform frequency targets
t empty (if dim<2), or vector of y-coords of all frequency targets
u empty (if dim<3), or vector of z-coords of all frequency targets
Input/Outputs:
plan cufinufft_plan object
Notes:
* The values in xj (and if nonempty, yj and zj) are real-valued, and
invariant under translations by multiples of 2pi. For type 1
they are "sources", whereas for type 2 they are "targets".
For type 3 there is no periodicity, and no restrictions other
than the resulting size of the internal fine grids.
* s (and t and u) are only relevant for type 3, and may be omitted otherwise
* The matlab vectors xj,... and s,... should not be changed before calling
future execute calls, because the plan stores only pointers to the
arrays (they are not duplicated internally).
* The precision (double/single) of all inputs must match that chosen at the
plan stage using opts.floatprec, otherwise an error is raised.
3) EXECUTE execute single or many-vector GPU FINUFFT transforms in a plan.
result = plan.execute(data_in);
For plan a previously created cufinufft_plan object also containing all
needed nonuniform point coordinates, do a single (or if ntrans>1 in the
plan stage, multiple) NUFFT transform(s), with the strengths or Fourier
coefficient inputs vector(s) from data_in. The result of the transform(s)
is returned as a (possibly multidimensional) gpuArray.
Inputs:
plan cufinufft_plan object
data_in strengths (types 1 or 3) or Fourier coefficients (type 2)
vector, matrix, or array of appropriate size. For type 1 and 3,
this is either a length-M vector (where M is the length of xj),
or an (M,ntrans) matrix when ntrans>1. For type 2, in 1D this is
length-ms, in 2D size (ms,mt), or in 3D size (ms,mt,mu), or
each of these with an extra last dimension ntrans if ntrans>1.
Outputs:
result vector of output strengths at targets (types 2 or 3), or array
of Fourier coefficients (type 1), or, if ntrans>1, a stack of
such vectors or arrays, of appropriate size.
Specifically, if ntrans=1, for type 1, in 1D
this is a length-ms column vector, in 2D a matrix of size
(ms,mt), or in 3D an array of size (ms,mt,mu); for types 2 and 3
it is a column vector of length M (the length of xj in type 2),
or nk (the length of s in type 3). If ntrans>1 its is a stack
of such objects, ie, it has an extra last dimension ntrans.
Notes:
* The precision (double/single) of all gpuArrays must match that chosen at
the plan stage using opts.floatprec, otherwise an error is raised.
4) To deallocate (delete) a GPU nonuniform FFT plan, use delete(plan)
This deallocates all stored cuFFT plans, nonuniform point sorting arrays,
kernel Fourier transforms arrays, etc.