Options parameters (CPU)¶
Aside from the mandatory inputs (dimension, type, nonuniform points, strengths or coefficients, and, in C++/C/Fortran/MATLAB, sign of the imaginary unit and tolerance) FINUFFT has optional parameters. These adjust the workings of the algorithm, change the output format, or provide debug/timing text to stdout. Sensible default options are chosen, so that the new user need not worry about changing them. However, users wanting to try to increase speed or see more timing breakdowns will want to change options from their defaults. See each language doc page for how this is done, but is generally by creating an options structure, changing fields from their defaults, then passing this (or a pointer to it) to the simple, vectorized, or guru makeplan routines. Recall how to do this from C++:
// (... set up M,x,c,tol,N, and allocate F here...)
finufft_opts opts;
finufft_default_opts(&opts);
opts.debug = 1;
int ier = finufft1d1(M,x,c,+1,tol,N,F,&opts);
This setting produces more timing output to stdout
.
Warning
In C/C++ and Fortran, don’t forget to call the command which sets default options
(finufft_default_opts
or finufftf_default_opts
)
before you start changing them and passing them to FINUFFT.
Summary and quick advice¶
Here is a 1-line summary of each option, taken from the code
(the header include/finufft_opts.h
):
// FINUFFT options:
// data handling opts...
int modeord; // (type 1,2 only): 0 CMCL-style increasing mode order
// 1 FFT-style mode order
int chkbnds; // [DEPRECATED] 0 don't check NU pts in [-3pi,3pi), 1 do (<few % slower)
// diagnostic opts...
int debug; // 0 silent, 1 some timing/debug, or 2 more
int spread_debug; // spreader: 0 silent, 1 some timing/debug, or 2 tonnes
int showwarn; // 0 don't print warnings to stderr, 1 do
// algorithm performance opts...
int nthreads; // number of threads to use, or 0 uses all available
int fftw; // plan flags to FFTW (FFTW_ESTIMATE=64, FFTW_MEASURE=0,...)
int spread_sort; // spreader: 0 don't sort, 1 do, or 2 heuristic choice
int spread_kerevalmeth; // spreader: 0 exp(sqrt()), 1 Horner piecewise poly (faster)
int spread_kerpad; // (exp(sqrt()) only): 0 don't pad kernel to 4n, 1 do
double upsampfac; // upsampling ratio sigma: 2.0 std, 1.25 small FFT, 0.0 auto
int spread_thread; // (vectorized ntr>1 only): 0 auto, 1 seq multithreaded,
// 2 parallel single-thread spread
int maxbatchsize; // (vectorized ntr>1 only): max transform batch, 0 auto
int spread_nthr_atomic; // if >=0, threads above which spreader OMP critical goes
// atomic
int spread_max_sp_size; // if >0, overrides spreader (dir=1) max subproblem size
Here are their default settings (from src/finufft_core.cpp:finufft_default_opts
):
o->modeord = 0;
o->chkbnds = 1;
o->debug = 0;
o->spread_debug = 0;
o->showwarn = 1;
o->nthreads = 0;
#ifdef FINUFFT_USE_DUCC0
o->fftw = 0;
#else
o->fftw = FFTW_ESTIMATE;
#endif
o->spread_sort = 2;
o->spread_kerevalmeth = 1;
o->spread_kerpad = 1;
o->upsampfac = 0.0;
o->spread_thread = 0;
o->maxbatchsize = 0;
o->spread_nthr_atomic = -1;
o->spread_max_sp_size = 0;
o->fftw_lock_fun = nullptr;
o->fftw_unlock_fun = nullptr;
o->fftw_lock_data = nullptr;
As for quick advice, the main options you’ll want to play with are:
upsampfac
to trade-off between spread/interpolate vs FFT speed and RAMmodeord
to flip (“fftshift”) the Fourier mode orderingdebug
to look at timing output (to determine if your problem is spread/interpolation dominated, vs FFT dominated)nthreads
to run with a different number of threads than the current maximum available through OpenMP (a large number can sometimes be detrimental, and very small problems can sometimes run faster on 1 thread)fftw
to try slower FFTW plan modes which give faster transforms. The next natural one to try isFFTW_MEASURE
(look at the FFTW3 docs)
See Troubleshooting for good advice on trying options, and read the full options descriptions below.
Warning
Some of the options are for experts only, and will result in slow or incorrect results. Please test options in a small known test case so that you understand their effect.
Documentation of all options¶
Data handling options¶
modeord: Fourier coefficient frequency index ordering in every dimension. For type 1, this is for the output; for type 2 the input. It has no effect in type 3. Here we use N
to denote the size in any of the relevant dimensions:
if
modeord=0
: frequency indices are in increasing ordering, namely \(\{-N/2,-N/2+1,\dots,N/2-1\}\) if \(N\) is even, or \(\{-(N-1)/2,\dots,(N-1)/2\}\) if \(N\) is odd. For example, ifN=6
the indices are-3,-2,-1,0,1,2
, whereas ifN=7
they are-3,-2,-1,0,1,2,3
. This is called “CMCL ordering” since it is that of the CMCL NUFFT.if
modeord=1
: frequency indices are ordered as in the usual FFT, increasing from zero then jumping to negative indices half way along, namely \(\{0,1,\dots,N/2-1,-N/2,-N/2+1,\dots,-1\}\) if \(N\) is even, or \(\{0,1,\dots,(N-1)/2,-(N-1)/2,\dots,-1\}\) if \(N\) is odd. For example, ifN=6
the indices are0,1,2,-3,-2,-1
, whereas ifN=7
they are0,1,2,3,-3,-2,-1
.Note
The index sets are the same in the two
modeord
choices; their ordering differs only by a cyclic shift. The FFT ordering cyclically shifts the CMCL indices \(\mbox{floor}(N/2)\) to the left (often called an “fftshift”).
chkbnds: [DEPRECATED] has no effect.
Diagnostic options¶
debug: Controls the amount of overall debug/timing output to stdout.
debug=0
: silentdebug=1
: print some informationdebug=2
: prints more information
spread_debug: Controls the amount of debug/timing output from the spreader/interpolator.
spread_debug=0
: silentspread_debug=1
: prints some timing informationspread_debug=2
: prints lots. This can print thousands of lines since it includes one line per subproblem.
showwarn: Whether to print warnings (these go to stderr).
showwarn=0
: suppresses such warningsshowwarn=1
: prints warnings
Algorithm performance options¶
nthreads: (Ignored in single-threaded library builds.) If positive, sets the number of threads to use throughout (multi-threaded build of) library, or if 0
uses the maximum number of threads available according to OpenMP. In the positive case, no cap is placed on this number. This number of threads is passed to bin-sorting (which may choose to use less threads), but is adhered to in FFTW and spreading/interpolation steps. This number of threads (or 1 for single-threaded builds) also controls the batch size for vectorized transforms (ie ntr>1
here).
For medium-to-large transforms, 0
is usually recommended.
However, for (repeated) small transforms it can be advantageous to use a small number, even as small as 1
.
fftw: FFTW planner flags. This number is simply passed to FFTW’s planner;
the flags are documented here.
A good first choice is FFTW_ESTIMATE
; however if you will be making multiple calls, consider FFTW_MEASURE
, which could spend many seconds planning, but will give a faster run-time when called again from the same process. These macros are bit-wise flags defined in /usr/include/fftw3.h
on a linux system; they currently have the values FFTW_ESTIMATE=64
and FFTW_MEASURE=0
. Note that FFTW plans are saved (by FFTW’s library)
automatically from call to call in the same executable (incidentally, also in the same MATLAB/octave or python session); there is a small overhead for lookup of such plans, which with many repeated small problems can motivate use of the guru interface.
spread_sort: Sorting mode within the spreader/interpolator.
spread_sort=0
: never sortsspread_sort=1
: always sortsspread_sort=2
: uses a heuristic to decide whether to sort or not.
The heuristic bakes in empirical findings such as: generally it is not worth sorting in 1D type 2 transforms, or when the number of nonuniform points is small. Feel free to try experimenting here; if you have highly-structured nonuniform point ordering (such as coming from polar-grid or propeller-type MRI k-points) it may be advantageous not to sort.
spread_kerevalmeth: Kernel evaluation method in spreader/interpolator. This should not be changed from its default value, unless you are an expert wanting to compare against outdated
spread_kerevalmeth=0
: direct evaluation ofsqrt(exp(beta(1-x*x)))
in the ES kernel. This is outdated, and it’s only possible use could be in exploring upsampling factors \(\sigma\) different from standard (see below).spread_kerevalmeth=1
: use Horner’s rule applied to piecewise polynomials with precomputed coefficients. This is faster, less brittle to compiler/glibc/CPU variations, and is the recommended approach. It only works for the standard upsampling factors listed below.
spread_kerpad: whether to pad the number of direct kernel evaluations per dimension and per nonuniform point to a multiple of four; this can help SIMD vectorization. It only applies to the (outdated) spread_kerevalmeth=0
choice.
There is thus little reason for the nonexpert to mess with this option.
spread_kerpad=0
: do not padspread_kerpad=1
: pad to next multiple of four
upsampfac: This is the internal factor by which the FFT (fine grid) is chosen larger than the number of requested modes in each dimension, for type 1 and 2 transforms. For type 3 transforms this factor gets squared, due to type 2 nested in a type-1-spreading operation, so has even more influence. We have built efficient kernels for only two settings, as follows. Otherwise, setting it to zero chooses a good heuristic:
upsampfac=0.0
: use heuristics to chooseupsampfac
as one of the below values, and use this value internally. The value chosen is visible in the text output via settingdebug>=2
. This setting is recommended for basic users; however, if you seek more performance it is quick to try the other of the values.upsampfac=2.0
: standard setting of upsampling. Due to kernel width restrictions, this is necessary if you need to exceed 9 digits of accuracy.upsampfac=1.25
: low-upsampling option, with lower RAM, smaller FFTs, but wider spreading kernel. The latter can be much faster than the standard when the number of nonuniform points is similar or smaller to the number of modes, and/or if low accuracy is required. It is especially much (2 to 3 times) faster for type 3 transforms. However, the kernel widths \(w\) are about 50% larger in each dimension, which can lead to slower spreading (it can also be faster due to the smaller size of the fine grid). Because the kernel width is limited to 16, currently, thus only 9-digit accuracy can currently be reached when usingupsampfac=1.25
.
spread_thread: in the case of multiple transforms per call (ntr>1
, or the “many” interfaces), controls how multithreading is used to spread/interpolate each batch of data.
spread_thread=0
: makes an automatic choice between the below. Recommended.spread_thread=1
: acts on each vector in the batch in sequence, using multithreaded spread/interpolate on that vector. It can be slightly better than2
for large problems.spread_thread=2
: acts on all vectors in a batch (of size chosen typically to be the number of threads) simultaneously, assigning each a thread which performs a single-threaded spread/interpolate. It is much better than1
for all but large problems. (Historical note: this was used by Melody Shih for the original “2dmany” interface in 2018.)Note
Historical note: A former option
3
has been removed. This was like2
except allowing nested OMP parallelism, so multi-threaded spread-interpolate was used for each of the vectors in a batch in parallel. This was used by Andrea Malleo in 2019. We have not yet found a case where this beats both1
and2
, hence removed it due to complications with changing the OMP nesting state in both old and new OMP versions.
maxbatchsize: in the case of multiple transforms per call (ntr>1
, or the “many” interfaces), set the largest batch size of data vectors.
Here 0
makes an automatic choice. If you are unhappy with this, then for small problems it should equal the number of threads, while for large problems it appears that 1
often better (since otherwise too much simultaneous RAM movement occurs). Some further work is needed to optimize this parameter.
spread_nthr_atomic: if non-negative: for numbers of threads up to this value, an OMP critical block for add_wrapped_subgrid
is used in spreading (type 1 transforms). Above this value, instead OMP atomic writes are used, which scale better for large thread numbers. If negative, the heuristic default in the spreader is used, set in src/spreadinterp.cpp:setup_spreader()
.
spread_max_sp_size: if positive, overrides the maximum subproblem (chunking) size for multithreaded spreading (type 1 transforms). Otherwise the default in the spreader is used, set in src/spreadinterp.cpp:setup_spreader()
, which we believe is a decent heuristic for Intel i7 and xeon machines.
Thread safety options (advanced)¶
By default, with FFTW as the FFT library, FINUFFT is thread safe so long as no other threads are calling FFTW plan creation/destruction routines independently of FINUFFT. If these FFTW routines are called outside of FINUFFT, then the program is liable to crash. In most cases, the calling program can simply call the FFTW routine fftw_make_planner_thread_safe()
before threading out and thread safety will be maintained. However, in instances where this is less desirable, we provide a means to provide your own FFTW locking mechanism. The following example code should exercise FFTW thread safety, and can be built with c++ thread_test.cpp -o thread_test -lfinufft -lfftw3_threads -lfftw3 -fopenmp -std=c++11
, assuming the finufft include and library paths are set.
// thread_test.cpp
#include <vector>
#include <mutex>
#include <complex>
#include <fftw3.h>
#include <finufft.h>
#include <omp.h>
using namespace std;
constexpr int N = 65384;
void locker(void *lck) { reinterpret_cast<recursive_mutex *>(lck)->lock(); }
void unlocker(void *lck) { reinterpret_cast<recursive_mutex *>(lck)->unlock(); }
int main() {
int64_t Ns[3]; // guru describes mode array by vector [N1,N2..]
Ns[0] = N;
recursive_mutex lck;
finufft_opts opts;
finufft_default_opts(&opts);
opts.nthreads = 1;
opts.debug = 0;
opts.fftw_lock_fun = locker;
opts.fftw_unlock_fun = unlocker;
opts.fftw_lock_data = reinterpret_cast<void *>(&lck);
// random nonuniform points (x) and complex strengths (c)
vector<complex<double>> c(N);
// init FFTW threads
fftw_init_threads();
// FFTW and FINUFFT execution using OpenMP parallelization
#pragma omp parallel for
for (int j = 0; j < 100; ++j) {
// allocate output array for FFTW...
vector<complex<double>> F1(N);
// FFTW plan
lck.lock();
fftw_plan_with_nthreads(1);
fftw_plan plan = fftw_plan_dft_1d(N, reinterpret_cast<fftw_complex*>(c.data()),
reinterpret_cast<fftw_complex*>(F1.data()),
FFTW_FORWARD, FFTW_ESTIMATE);
fftw_destroy_plan(plan);
lck.unlock();
// FINUFFT plan
finufft_plan nufftplan;
finufft_makeplan(1, 1, Ns, 1, 1, 1e-6, &nufftplan, &opts);
finufft_destroy(nufftplan);
}
return 0;
}
fftw_lock_fun: void (fun*)(void *)
C-style callback function to lock calls to FFTW plan manipulation routines. A nullptr
or 0
value will be ignored. If non-null, fftw_unlock_fun
must also be set.
fftw_unlock_fun: void (fun*)(void *)
C-style callback function to unlock calls to FFTW plan manipulation routines. A nullptr
or 0
value will be ignored. If non-null, fftw_lock_fun
must also be set.
fftw_lock_data: void *data
pointer, typically to the lock object itself. Pointer will be passed to fftw_lock_fun
and fftw_unlock_fun
if they are set.