Options parameters (CPU)¶

Aside from the mandatory inputs (dimension, type, nonuniform points, strengths or coefficients, and, in C++/C/Fortran/MATLAB, sign of the imaginary unit and tolerance) FINUFFT has optional parameters. These adjust the workings of the algorithm, change the output format, or provide debug/timing text to stdout. Sensible default options are chosen, so that the new user need not worry about changing them. However, users wanting to try to increase speed or see more timing breakdowns will want to change options from their defaults. See each language doc page for how this is done, but is generally by creating an options structure, changing fields from their defaults, then passing this (or a pointer to it) to the simple, vectorized, or guru makeplan routines. Recall how to do this from C++:

// (... set up M,x,c,tol,N, and allocate F here...)
finufft_opts opts;
finufft_default_opts(&opts);
opts.debug = 1;
int ier = finufft1d1(M,x,c,+1,tol,N,F,&opts);

This setting produces more timing output to stdout.

Warning

In C/C++ and Fortran, don’t forget to call the command which sets default options (finufft_default_opts or finufftf_default_opts) before you start changing them and passing them to FINUFFT.

Summary and quick advice¶

Here is a 1-line summary of each option, taken from the code (the header include/finufft_opts.h):

  // FINUFFT options:
  // data handling opts...
  int modeord;          // (type 1,2 only): 0 CMCL-style increasing mode order
                        //                  1 FFT-style mode order
  int spreadinterponly; // (type 1,2 only): 0 do actual NUFFT
                        // 1 only spread (if type 1) or interpolate (type 2)

  // diagnostic opts...
  int debug;        // 0 silent, 1 some timing/debug, or 2 more
  int spread_debug; // spreader: 0 silent, 1 some timing/debug, or 2 tonnes
  int showwarn;     // 0 don't print warnings to stderr, 1 do

  // algorithm performance opts...
  int nthreads;           // number of threads to use, or 0 uses all available
  int fftw;               // plan flags to FFTW (FFTW_ESTIMATE=64, FFTW_MEASURE=0,...)
  int spread_sort;        // spreader: 0 don't sort, 1 do, or 2 heuristic choice
  int spread_kerevalmeth; // deprecated, retained for ABI; Horner is always used
  int spread_kerpad;      // deprecated, retained for ABI; padding has no effect
  double upsampfac;       // upsampling ratio sigma: 2.0 std, 1.25 small FFT, 0.0 auto
  int spread_thread;      // (vectorized ntr>1 only): 0 auto, 1 seq multithreaded,
                          //                          2 parallel single-thread spread
  int maxbatchsize;       // (vectorized ntr>1 only): max transform batch, 0 auto
  int spread_nthr_atomic; // if >=0, threads above which spreader OMP critical goes
                          // atomic
  int spread_max_sp_size; // if >0, overrides spreader (dir=1) max subproblem size
  int spread_kerformula;  // kernel function formula: 0 default, [>0 devs/debug only]
                          // Non-zero values are unsupported and behavior can change

Here are their default settings (from src/finufft_core.cpp:finufft_default_opts):

  o->modeord          = 0;
  o->spreadinterponly = 0;

  o->debug        = 0;
  o->spread_debug = 0;
  o->showwarn     = 1;

  o->nthreads = 0;
#ifdef FINUFFT_USE_DUCC0
  o->fftw = -1;
#else
  o->fftw = FFTW_ESTIMATE;
#endif
  o->spread_sort        = 2;
  o->spread_kerevalmeth = 1; // deprecated
  o->spread_kerpad      = 1; // deprecated
  o->upsampfac          = 0.0;
  o->spread_thread      = 0;
  o->maxbatchsize       = 0;
  o->spread_nthr_atomic = -1;
  o->spread_max_sp_size = 0;
  o->spread_kerformula  = 0;
  o->fftw_lock_fun      = nullptr;
  o->fftw_unlock_fun    = nullptr;
  o->fftw_lock_data     = nullptr;

As for quick advice, the main options you’ll want to play with are:

upsampfac to trade-off between spread/interpolate vs FFT speed and RAM
modeord to flip (“fftshift”) the Fourier mode ordering
debug to look at timing output (to determine if your problem is spread/interpolation dominated, vs FFT dominated)
nthreads to run with a different number of threads than the current maximum available through OpenMP (a large number can sometimes be detrimental, and very small problems can sometimes run faster on 1 thread)
fftw to try slower FFTW plan modes which give faster transforms. The next natural one to try is FFTW_MEASURE (look at the FFTW3 docs)

See Troubleshooting for good advice on trying options, and read the full options descriptions below.

Warning

Some of the options are for experts only, and will result in slow or incorrect results. Please test options in a small known test case so that you understand their effect.

Documentation of all options¶

Data handling options¶

modeord: Fourier coefficient frequency index ordering in every dimension. For type 1, this is for the output; for type 2 the input. It has no effect in type 3. Here we use N to denote the size in any of the relevant dimensions:

if modeord=0: frequency indices are in increasing ordering, namely \(\{-N/2,-N/2+1,\dots,N/2-1\}\) if \(N\) is even, or \(\{-(N-1)/2,\dots,(N-1)/2\}\) if \(N\) is odd. For example, if N=6 the indices are -3,-2,-1,0,1,2, whereas if N=7 they are -3,-2,-1,0,1,2,3. This is called “CMCL ordering” since it is that of the CMCL NUFFT.
if modeord=1: frequency indices are ordered as in the usual FFT, increasing from zero then jumping to negative indices half way along, namely \(\{0,1,\dots,N/2-1,-N/2,-N/2+1,\dots,-1\}\) if \(N\) is even, or \(\{0,1,\dots,(N-1)/2,-(N-1)/2,\dots,-1\}\) if \(N\) is odd. For example, if N=6 the indices are 0,1,2,-3,-2,-1, whereas if N=7 they are 0,1,2,3,-3,-2,-1.

Note

The index sets are the same in the two modeord choices; their ordering differs only by a cyclic shift. The FFT ordering cyclically shifts the CMCL indices \(\mbox{floor}(N/2)\) to the left (often called an “fftshift”).

spreadinterponly: [only has effect for type 1 or 2.] Controls whether a NUFFT is performed, or only spreading or interpolation. For experts only.

If 0 do the NUFFT as intended.
If 1, omit the FFT and deconvolution (diagonal division by kernel Fourier transform) steps, thus returning garbage answers as a NUFFT, but allowing experts to perform solely spreading (if type 1) or solely interpolation (if type 2) via the FINUFFT API. The spreading is onto the grid of the user-given size (N1 in x, N2 in y, etc), with grid points located at coordinates \(\{-\pi, -\pi+h, \dots, \pi-h\}\) in each dimension, where \(h = 2\pi/N\) is the spacing for that dimension (\(N\) here meaning N1, etc). Interpolation is from that same grid. The kernel (width and shape parameter) is determined by tol and opts.upsampfac, just as it would be in an actual NUFFT. Note that the upsampling factor here only controls the kernel; the grid size never differs from N1, etc. The kernel is not directly accessible, leaving the user to figure out how to make use of this interface to extract the actual kernel function. This provides a convenient interface to our spreadinterp module (including looping over multiple vectors, if ntransf>1).

Note

The known use-case of spreadinterponly=1 is estimating so-called density compensation weights, conventionally used in MRI (see MRI-NUFFT). It may also be useful in spectral Ewald or other scientific applications.

Diagnostic options¶

debug: Controls the amount of overall debug/timing output to stdout.

debug=0 : silent
debug=1 : prints some information
debug=2 : prints more information

spread_debug: Controls the amount of debug/timing output from the spreader/interpolator.

spread_debug=0 : silent
- spread_debug=1 : prints some timing information
- spread_debug=2 : prints lots. This can print thousands of lines since it includes one line per subproblem.

showwarn: Whether to print warnings (these go to stderr).

showwarn=0 : suppresses such warnings
showwarn=1 : prints warnings

Algorithm performance options¶

nthreads: (Ignored in single-threaded library builds.) If positive, sets the number of threads to use throughout (multi-threaded build of) library, or if 0 uses the maximum number of threads available according to OpenMP. In the positive case, no cap is placed on this number. This number of threads is passed to bin-sorting (which may choose to use less threads), but is adhered to in FFTW and spreading/interpolation steps. This number of threads (or 1 for single-threaded builds) also controls the batch size for vectorized transforms (ie ntr>1 here). For medium-to-large transforms, 0 is usually recommended. However, for (repeated) small transforms it can be advantageous to use a small number, even as small as 1.

fftw: FFTW planner flags. This number is simply passed to FFTW’s planner; the flags are documented here. A good first choice is FFTW_ESTIMATE; however if you will be making multiple calls, consider FFTW_MEASURE, which could spend many seconds planning, but will give a faster run-time when called again from the same process. These macros are bit-wise flags defined in /usr/include/fftw3.h on a linux system; they currently have the values FFTW_ESTIMATE=64 and FFTW_MEASURE=0. Note that FFTW plans are saved (by FFTW’s library) automatically from call to call in the same executable (incidentally, also in the same MATLAB/octave or python session); there is a small overhead for lookup of such plans, which with many repeated small problems can motivate use of the guru interface.

spread_sort: Sorting mode within the spreader/interpolator.

spread_sort=0 : never sorts
spread_sort=1 : always sorts
spread_sort=2 : uses a heuristic to decide whether to sort or not.

The heuristic bakes in empirical findings such as: generally it is not worth sorting in 1D type 2 transforms, or when the number of nonuniform points is small. Feel free to try experimenting here; if you have highly-structured nonuniform point ordering (such as coming from polar-grid or propeller-type MRI k-points) it may be advantageous not to sort.

upsampfac: This is the internal factor \(\sigma\) by which the FFT (fine grid) is chosen larger than the number of requested modes in each dimension, for type 1 and 2 transforms. For type 3 transforms this factor gets squared, due to type 2 nested in a type-1-spreading operation, so has even more influence. As of v2.5.0, due to on-the-fly polynomial coefficient fitting, the kernel is equally efficient for an arbitrary upsampling factor greater than 1, but the useful range is around 1.2 up to 3.0.

upsampfac=0.0 : use heuristics to choose a good upsampfac based on the problem.

The value chosen is visible in the text output via setting debug>=1. This default setting is recommended for most users; however, if you seek more performance you may want to set if

upsampfac>1.0 : fix the upsampling factor, overriding the heuristic choice. A standard setting is 2 (which is good for achieving 9-digit or more accuracy), while a typical “low” setting is 1.25 (this reduces the RAM and FFT costs, and is good for up to 5-digit accuracy, unless the density M/N is high enough that its 50% wider spreading kernel would be counterproductive). Low upsampfac is especially efficient for type 3 transforms. Because the kernel width is limited to 16, only 9-digit accuracy can be reached when using upsampfac=1.25, for instance.

spread_thread: in the case of multiple transforms per call (ntr>1, or the “many” interfaces), controls how multithreading is used to spread/interpolate each batch of data.

spread_thread=0 : makes an automatic choice between the below. Recommended.
spread_thread=1 : acts on each vector in the batch in sequence, using multithreaded spread/interpolate on that vector. It can be slightly better than 2 for large problems.
spread_thread=2 : acts on all vectors in a batch (of size chosen typically to be the number of threads) simultaneously, assigning each a thread which performs a single-threaded spread/interpolate. It is much better than 1 for all but large problems. (Historical note: this was used by Melody Shih for the original “2dmany” interface in 2018.)

Note

Historical note: A former option 3 has been removed. This was like 2 except allowing nested OMP parallelism, so multi-threaded spread-interpolate was used for each of the vectors in a batch in parallel. This was used by Andrea Malleo in 2019. We have not yet found a case where this beats both 1 and 2, hence removed it due to complications with changing the OMP nesting state in both old and new OMP versions.

maxbatchsize: in the case of multiple transforms per call (ntr>1, or the “many” interfaces), set the largest batch size of data vectors. Here 0 makes an automatic choice. If you are unhappy with this, then for small problems it should equal the number of threads, while for large problems it appears that 1 often better (since otherwise too much simultaneous RAM movement occurs). Some further work is needed to optimize this parameter.

spread_nthr_atomic: if non-negative: for numbers of threads up to this value, an OMP critical block for add_wrapped_subgrid is used in spreading (type 1 transforms). Above this value, instead OMP atomic writes are used, which scale better for large thread numbers. If negative, the heuristic default in the spreader is used, set in src/spreadinterp.cpp:setup_spreader().

spread_max_sp_size: if positive, overrides the maximum subproblem (chunking) size for multithreaded spreading (type 1 transforms). Otherwise the default in the spreader is used, set in src/spreadinterp.cpp:setup_spreader(), which we believe is a decent heuristic for Intel i7 and xeon machines.

spread_kerformula: 0 uses default spreading (gridding) kernel with default shape choice, whereas positive integers select among various kernels and shape parameter choices. In particular 1 returns to the “legacy ES” choices used from the first 2017 code to v2.4.1 (2025). Only developers should mess with this parameter; users should leave it at default.

spread_kerevalmeth: [DEPRECATED] Kernel evaluation method in spreader/interpolator; retained only for API compatibility and documentation. The library now always uses the Horner piecewise-polynomial evaluation internally (the historical =1 choice). Setting this field has no effect.

spread_kerpad: [DEPRECATED] This option historically controlled padding to help SIMD vectorization for the removed direct-evaluation method. It is ignored by the library.

Thread safety options (advanced)¶

With DUCC0 as the FFT, there are no thread safety issues. However, with FFTW as the FFT library, FINUFFT is thread safe so long as no other threads are calling FFTW plan creation/destruction routines independently of FINUFFT. If these FFTW routines are called outside of FINUFFT, then the program is liable to crash. In most cases, the calling program can simply call the FFTW routine fftw_make_planner_thread_safe() before threading out and thread safety will be maintained. However, in instances where this is less desirable, we provide a means to provide your own FFTW locking mechanism. The following example code should exercise FFTW thread safety, and can be built with c++ thread_test.cpp -o thread_test -lfinufft -lfftw3_threads -lfftw3 -fopenmp -std=c++11, assuming the finufft include and library paths are set.

// thread_test.cpp
#include <vector>
#include <mutex>
#include <complex>

#include <fftw3.h>
#include <finufft.h>
#include <omp.h>

using namespace std;

constexpr int N = 65384;

void locker(void *lck) { reinterpret_cast<recursive_mutex *>(lck)->lock(); }
void unlocker(void *lck) { reinterpret_cast<recursive_mutex *>(lck)->unlock(); }

int main() {
  int64_t Ns[3]; // guru describes mode array by vector [N1,N2..]
  Ns[0] = N;
  recursive_mutex lck;

  finufft_opts opts;
  finufft_default_opts(&opts);
  opts.nthreads = 1;
  opts.debug = 0;
  opts.fftw_lock_fun = locker;
  opts.fftw_unlock_fun = unlocker;
  opts.fftw_lock_data = reinterpret_cast<void *>(&lck);

  // random nonuniform points (x) and complex strengths (c)
  vector<complex<double>> c(N);

  // init FFTW threads
  fftw_init_threads();

  // FFTW and FINUFFT execution using OpenMP parallelization
  #pragma omp parallel for
  for (int j = 0; j < 100; ++j) {
    // allocate output array for FFTW...
    vector<complex<double>> F1(N);

    // FFTW plan
    lck.lock();
    fftw_plan_with_nthreads(1);
    fftw_plan plan = fftw_plan_dft_1d(N, reinterpret_cast<fftw_complex*>(c.data()),
                                      reinterpret_cast<fftw_complex*>(F1.data()),
                                      FFTW_FORWARD, FFTW_ESTIMATE);
    fftw_destroy_plan(plan);
    lck.unlock();

    // FINUFFT plan
    finufft_plan nufftplan;
    finufft_makeplan(1, 1, Ns, 1, 1, 1e-6, &nufftplan, &opts);
    finufft_destroy(nufftplan);
  }

  return 0;
}

fftw_lock_fun: void (fun*)(void *) C-style callback function to lock calls to FFTW plan manipulation routines. A nullptr or 0 value will be ignored. If non-null, fftw_unlock_fun must also be set.

fftw_unlock_fun: void (fun*)(void *) C-style callback function to unlock calls to FFTW plan manipulation routines. A nullptr or 0 value will be ignored. If non-null, fftw_lock_fun must also be set.

fftw_lock_data: void *data pointer, typically to the lock object itself. Pointer will be passed to fftw_lock_fun and fftw_unlock_fun if they are set.

Options parameters (CPU)¶

Summary and quick advice¶

Documentation of all options¶

Data handling options¶

Diagnostic options¶

Algorithm performance options¶

Thread safety options (advanced)¶

Table of Contents

Previous topic

Next topic

This Page