Troubleshooting

If you are having issues (segfaults, slowness, “wrong” answers, etc), there is a high probability it is something we already know about, so please first read all of the advice below in the section relevant to your problem: math, speed, or crashing.

Mathematical “issues” and advice

  • When requested tolerance is around \(10^{-14}\) or less in double-precision, or \(10^{-6}\) or less in single-precision, it will most likely be impossible for FINUFFT (or any other NUFFT library) to achieve this, due to inevitable round-off error. Here, “error” is to be understood relative to the norm of the returned vector of values. This is especially true when there is a large number of modes in any single dimension (\(N_1\), \(N_2\) or \(N_3\)), since this empirically scales the round-off error (fortunately, round-off does not appear to scale with the total \(N\) or \(M\)). Such round-off error is analysed and measured in Section 4.2 of our SISC paper.

  • If you request a tolerance that FINUFFT knows it cannot achieve, it will return ier=1 after performing transforms as accurately as it can. However, the status ier=0 does not imply that the requested accuracy was achieved, merely that parameters were chosen to give this estimated accuracy, if possible. As our SISC paper shows, for typical situations, relative \(\ell_2\) errors match the requested tolerances over a wide range. Users should always check convergence (by, for instance, varying tol and measuring any changes in their results); this is generally true in scientific computing.

  • On the above topic: while the tolerance is usually a good upper bound for actual accuracy, strangely, in single-precision requesting tolerance of \(10^{-7}\) or \(10^{-6}\) can give slightly worse accuracy than \(10^{-5}\) or \(10^{-4}\). We are looking into this. It is usually best to request at least 2–3 digits above the respective machine precision, for either single or double precision. Ie, do not set tolerance below \(10^{-5}\) in single or \(10^{-13}\) in double.

  • The type 1 and type 2 transforms are adjoints but not inverses of each other (unlike in the plain FFT case, where, up to a constant factor \(N\), the adjoint is the inverse). Therefore, if you are not getting the expected answers, please check that you have not made this assumption. In the tutorials we will add examples showing how to invert the NUFFT; also see NFFT3 inverse transforms.

Speed issues and advice

CPU library speed

If FINUFFT is slow (eg, less than \(10^6\) to \(10^7\) nonuniform points per second, depending on application), here is some advice:

  • Try printing debug output to see step-by-step timings. Do this by setting opts.debug to 1 or 2 then looking at the timing information in stdout.

  • Check that our test codes give similar speeds to what you observe, for a similar problem size. For example, if your application uses a 2D type 1 transform from a million nonuniform points to 500-by-500 modes, at 5-digit accuracy, using 8 threads, then build the tests and run:

    OMP_NUMTHREADS=8 test/finufft2d_test 500 500 1e6 1e-5
    

    which will give something like (on a laptop):

    test 2d type 1:
        1000000 NU pts to (500,500) modes in 0.0403 s   2.48e+07 NU pts/s
        one mode: rel err in F[185,130] is 4.38e-07
    test 2d type 2:
        (500,500) modes to 1000000 NU pts in 0.0274 s   3.65e+07 NU pts/s
        one targ: rel err in c[500000] is 6.1e-07
    test 2d type 3:
        1000000 NU to 250000 NU in 0.0626 s             2e+07 tot NU pts/s
        one targ: rel err in F[125000] is 2.76e-06
    

    Extract the relevant transform type (all three types are included), and compare its timing and throughput to your own. Usually the fact that these tests use random NU point distributions does not affect the speed that much compared to typical applications. If you instead use the vectorized (“many”) interface for a stack of, say, 50 such transforms, use:

    OMP_NUMTHREADS=8 test/finufft2d_test 500 500 1e6 1e-5
    

    which compares the stack of transforms to the same transforms performed individually. For single precision tests, append f to the executable name in both of the above examples. The command line options for each tester can be seen by executing without any options.

  • Try reducing the number of threads, either those available via OpenMP, or via opts.nthreads, perhaps down to 1 thread, to make sure you are not having collisions between threads, or slowdown due to thread overheads. Hyperthreading (more threads than physical cores) rarely helps much. Thread collisions are possible if large problems are run with a large number of (say more than 64) threads. Another case causing slowness is very many repetitions of small problems; see test/manysmallprobs which exceeds \(10^7\) points/sec with one thread via the guru interface, but can get ridiculously slower with many threads; see https://github.com/flatironinstitute/finufft/issues/86

  • Try setting a crude tolerance, eg tol=1e-3. How many digits do you actually need? This has a big effect in higher dimensions, since the number of flops scales like \((\log 1/\epsilon)^d\), but not quite as big an effect as this scaling would suggest, because in higher dimensions the flops/RAM ratio is higher.

  • If type 3, make sure your choice of points does not have a massive space-bandwidth product (ie, product of the volumes of the smallest \(d\)-dimension axes-aligned cuboids enclosing the nonuniform source and the target points); see Remark 5 of our SISC paper. In short, if the spreads of \(\mathbf{x}_j\) and of \(\mathbf{s}_k\) are both big, you may be in trouble. This can lead to enormous fine grids and hence slow FFTs. Set opts.debug=1 to examine the nf1, etc, fine grid sizes being chosen, and the array allocation sizes. If they are huge, consider direct summation, as discussed here.

  • The timing of the first FFTW call is complicated, depending on the FFTW flags (plan mode) used. This is really an FFTW planner flag usage question. Such issues are known, and modes benchmarked in other documentation, eg for 2D in poppy. In short, using more expensive FFTW planning modes like FFTW_MEASURE can give better performance for repeated FFTW calls, but be much more expensive in the first (planning) call. This is why we choose FFTW_ESTIMATE as our default opts.fftw option.

  • Check that you are not using too much RAM, hence swapping to hard disk or SSD. The multithreaded type-1 spreader can use up to another fine grid’s worth of storage in the form of subgrids. If RAM is too large, try overriding opts.spread_max_sp_size to a nonzero value smaller than the default value set in src/spreadinterp.cpp:setup_spreader(), to reduce RAM. However, note that this may slow it down, because we have built in a decent heuristic for the default.

  • Make sure you did not override opts.spread_sort, which if set to zero does no sorting, which can give very slow RAM access if the nonuniform points are ordered poorly (eg randomly) in larger 2D or 3D problems.

  • Are you calling the simple interface a huge number of times for small problems, but these tasks have something in common (number of modes, or locations of nonuniform points)? If so, try the vectorized or guru interfaces, which remove overheads in repeated FFTW plan look-up, and in bin-sorting. They can be 10-100x faster.

GPU library speed

If cuFINUFFT is slow (eg, less than \(10^8\) nonuniform points per second), here is some advice:

  • Run our test codes with a similar problem size on your hardware. Build the tests, then, for example (matching the vectorized CPU example above):

    test/cuda/cufinufft2dmany_test 1 1 500 500 50 0 1000000 1e-5 1e-4 f
    

    which gives (on my A6000) the output:

    #modes = 250000, #inputs = 50, #NUpts = 1000000
    [time  ] dummy warmup call to CUFFT  0.00184 s
    [time  ] cufinufft plan:             0.000624 s
    [time  ] cufinufft setNUpts:         0.000431 s
    [time  ] cufinufft exec:             0.0839 s
    [time  ] cufinufft destroy:          0.00194 s
    [gpu   ] 49th data one mode: rel err in F[185,130] is 2.61e-05
    [totaltime] 8.69e+04 us, speed 5.76e+08 NUpts/s
                                        (exec-only thoughput: 5.96e+08 NU pts/s)
    

    Check if your time is dominated by the plan stage, and if so, try to reuse your plan (often one has repeated transforms with sizes or points in common). Sometimes the CUFFT warm-up call can take as long as 0.2 seconds; make sure you do such a call (or a dummy transform) before your timed usage occurs. See https://github.com/flatironinstitute/finufft/issues/385 for an example of this discovery process. The command line options for each tester can be seen by executing without any options. Note that 1e6 for the GPU testers is not interpreted as \(10^6\), unlike in the CPU testers.

  • Try the different method types. Start with method=1. For instance, for type 1 transforms, method 2 (SM in the paper) is supposed to be faster than method 1 (GM-sort in the paper), but on the above test it is only 2% faster. In the test call, the 1st argument sets the method type and the next argument the transform type.

  • There is not currently a debug option for cufinufft, so the above timing of a test problem on your hardware is a good option. You could place timers around the various cufinufft calls in your own code, just as in our test codes.

Crash (segfault) issues and advice

  • The most common problem is passing in pointers to the wrong size of object, eg, single vs double precision, or int32 vs int64. The library includes both precisions, so make sure you are calling the correct one (commands begin finufft for double, finufftf for single).

  • If you use C++/C/Fortran and tried to change options, did you forget to call finufft_default_opts first?

  • Maybe you have switched off nonuniform point bounds checking (opts.chkbnds=0) for a little extra speed? Try switching it on again to catch illegal coordinates.

  • Thread-safety: are you calling FINUFFT from inside a multithreaded block of code without setting opts.nthreads=1? If gdb indicates crashes during FFTW calls, this is another sign.

  • To isolate where a crash is occurring, set opts.debug to 1 or 2, and check the text output of the various stages. With a debug setting of 2 or above, when ntrans>1 a large amount of text can be generated.

  • To diagnose problems with the spread/interpolation stage, similarly setting opts.spread_debug to 1 or 2 will print even more output. Here the setting 2 generates a large amount of output even for a single transform.

  • For the GPU code, did you run out of GPU memory? Keep track of this with nvidia-smi.

Other known issues with library or interfaces

The master list is the github issues for the project page, https://github.com/flatironinstitute/finufft/issues.

A secondary and more speculative list is in the TODO text file.

Please look through those issue topics, since sometimes workarounds are discussed before the problem is fixed in a release.

Bug reports

If you think you have found a new bug, and have read the above, please file a new issue on the github project page, https://github.com/flatironinstitute/finufft/issues. Include a minimal code which reproduces the bug, along with details about your machine, operating system, compiler, version of FINUFFT, and output with opts.debug=2. If you have a known bug and have ideas, please add to the comments for that issue.

You may also contact Alex Barnett (abarnett at-sign flatironinstitute.org) with FINUFFT in the subject line.