List of features / changes made / release notes, in reverse chronological order

* multithreaded one-mode accuracy test in C++ tests, t1 & t3.

V 2.0.1 (10/6/20)

* python (under-the-hood) interfacing changed from pybind11 to cleaner ctypes.
* non-stochastic test/*.cpp routines, zeroing small chance of incorrect failure
* Windows compatible makefile
* mac OSX improved installation instructions and*

V 2.0.0 (8/28/20)

* major changes to code, internally, and major improvements to operation and
  language interfaces.

	WARNING!: Here are all the interface compatibility changes from 1.1.2:
	- opts (nufft_opts) is now always passed as a pointer in C++/C, not
	  pass-by-reference as in v1.1.2 or earlier.
	- Fortran simple calls are now finufft?d?(..) not finufft?d?_f(..), and
	  they add a penultimate opts argument.
	- Python module name is now finufft not finufftpy, and the interface has
	  been completely changed (allowing major improvements, see below).
	- ier=1 is now a warning not an error; this indicates requested tol
	  was too small, but that a transform *was* done at the best possible
	- opts.fftw directly controls the FFTW plan mode consistently in all
	  language interfaces (thus changing the meaning of fftw=0 in MATLAB).
	- Octave now needs version >= 4.4, since OO features used by guru.

	These changes were deemed necessary to rationalize and improve FINUFFT
	for the long term.
	There are also many other new interface options (many-vector, guru)
	added; see docs.
* the C++ library is now dual-precision, with distinct function interfaces for
  double vs single precision operation, that are C and C++ compatible. Under
  the hood this is achieved via simple C macros. All language interfaces now
  have dual precision options.
* completely new (although backward compatible) MATLAB/octave interface,
  including object-style wrapper around the guru interface, dual precision.
* completely new Fortran interface, allowing >2^31 sized (int64) arrays,
  all simple, many-vector and guru interface, with full options control,
  and dual precisions.
* all simple and many-vector interfaces now call guru interface, for much
  better maintainability and less code repetition.
* new guru interface, by Andrea Malleo and Alex Barnett, allowing easier
  language wrapping and control of point-setting, reuse of sorting and FFTW
  plans. This finally bypasses the 0.1ms/thread cost of FFTW looking up previous
  wisdom, which slowed down performance for many small problems.
* removed obsolete -DNEED_EXTERN_C flag.
* major rewrite of documentation, plus tutorial application examples in MATLAB.
* numdiff dependency is removed for pass-fail library validation.
* new (professional!) logo for FINUFFT. Sphinx HTML and PDF aesthetics.

V 1.1.2 (1/31/20)

* Ludvig's padding of Horner loop to w=4n, speeds up kernel, esp for GCC5.4.
* Bansal's Mingw32 python patches.

V 1.1.1 (11/2/18)

* Mac OSX installation on clang and gcc-8, clearer install docs.
* LIBSOMP split off in makefile.
* printf(...%lld..) w/ long long typecast
* new basic passfail tester
* precompiled binaries

V 1.1 (9/24/18)

* NOTE TO USERS: changed interface for setting default opts in C++ and C, from
  pass by reference to pass by value of a pointer (see docs/). Unifies C++/C
  interfaces in a clean way.
* fftw3_omp instead of fftw3_threads (on linux), is faster.
* rationalized header files.

V 1.0.1 (9/14/18)

* Ludvig's removal of omp chunksize in dir=2, another 20%+ speedup.
* Matlab doesn't change omp internal state.

V 1.0 (8/20/18)

* repo transferred to flatironinstitute
* usage doc simpler
* 2d1many and 2d2many interfaces by Melody Shih, for multiple vectors with same
  nonuniform points. All tests and docs for these interfaces.
* horner optimized kernel for sigma=5/4 (low upsampling), to go along with the
  default sigma=2. Cmdline arg to change sigma in finufft?d_test.
* simplified various int types: only BIGINT remains.
* clearer docs.
* remaining C interfaces, with opts control.

V 0.99 (4/24/18)

* piecewise polynomial kernel evaluation by Horner, for faster spreading esp at
  low accuracy and 1d or 2d.
* various heuristic decisions re whether to sort, and if sorting is single or
* single-precision libs get an "f" suffix so can coexist with double-prec.

V 0.98 (3/1/18)

* makefile includes for OS-specific defs.
* decided that, since Legendre nodes code of GLR alg by Hale/Burkhardt is LGPL
  licensed, our use (not changing source) is not a "derived work", therefore
  ok under our Apache v2 license. See:
  * fixed MATLAB FFTW incompat alloc crash, by hack of Joakim, calling fft()
* python tests fixed, brought into makefile.
* brought in af Klinteberg spreader optimizations & SSE tricks.
* logo

V 0.97 (12/6/17)

* tidied all docs -> host. now a stub. TODO tidied.
* made sort=1 in tests for xeon (used to be 0)
* removed mcwrap and python dirs
* changed name of py routines to nufft* from finufft*
* python interfaces doc, up-to-date. Removed ms,.. from type-2 interfaces.
* removed RESCALEs from lower dims in bin_sort, speeds up a few % in 1D.
* allowed NU pts to be currectly folded from +-1 periods from central box, as
  per David Stein request. Adds 5% to time at 1e-2 accuracy, less at higher acc.
* corrected dynamic C++ array allocs in spreader (some made static, 5% speedup)
* removed all C++11 dependencies, mainly that opts structs are all explicitly
* fixed python interface to have chkbnds.
* tidied MEX interface
* removed memory leaks (!)
* opts.modeord implemented and exposed to matlab/python interfaces. Also removes
  looping backwards in RAM in deconvolveshuffle.

V 0.96 (10/15/17)

* apache v2 license, exposed flags in python interface.

V 0.95 (10/2/17)

* brought in JFM's in-package python wrapper & doc, create lib-static dir,
  removed devel dir.

V 0.9: (6/17/17)

* adapted adv4 into main code, inner loops separate by dim, kill
  the current spreader. Incorporate old ideas such as: checkerboard
  per-thread grid cuboids, compare speed in 2d and 3d against
  current 1d slicing. See cnufftspread:set_thread_index_box()
* added FFTW_MEAS vs FFTW_EST (set as default) opts flag in nufft_opts, and
  matlab/python interfaces
* removed opts.maxnalloc in favor of #defined MAX_NF
* fixed the 1-target case in type-3, all dims, to avoid nan; clarified logic
  for handling X=0 and/or S=0. 6/12/17
* changed arraywidcen to snap to C=0 if relative shift is <0.1, avoids cexps in
* t3: if C1 < X1/10 and D1 < S1/10 then don't rephase. Same for d=2,3.
* removed the 1/M type-1 prefactor, also in all test routines. 6/6/17
* removed timestamp-based make decision whether to rebuild matlab/finufft.cpp,
  since git clone creates files with random timestamp order!
* theory work on exp(sqrt) being close to PSWF. Analysis.
* fix issue w/ needing mwrap when do make matlab.
* makefile has variables customizing openmp and precision, non-omp tested
* fortran single-prec demos (required all direct ft's in single prec too!)
* examples changed to err rel to max F.
* matlab interface control of opts.spread_sort.
* matlab interface using doubles as big ints w/ correct typecasting.
* twopispread removed, used flag in spread_opts for [-pi,pi] input instead.
* testfinufft* use same integer type INT as for interfaces, typecast all %ld in
  printf warnings, use omp rand array filling
* INT64 for necessary size-setting arrays, removed all %lf printf warnings in
* all internal array indexing is BIGINT, switchable from long long to int via
  SMALLINT compile flag (default off, see utils.h)
* all integers in interfaces are type INT, default 64-bit, switchable to 32 bit
  by INTERGER32 compile flag (see utils.h)
* test big probs (speed, crashing) and decide if BIGINT is long long or int?
  slows any array access down, or spreading? allows I/O sizes
  (M, N1*N2*N3) > 2^31. Note June-Yub int*8 in nufft-1.3.x slowed things by
  factor 2-3.
* tidy up spreader to be BIGINT = long long compatible and test > 2^31.
* spreadtest parallel rand()
* sort flag passed to spreader via finufft, and test scripts check if Xeon
  (-> sort=0)
* opts in the manual
* removed all xk2, dNU2 sorted arrays, and not-needed dims y,z; halved RAM usage

V 0.8: (3/27/17)

* bnderr checking done in dir=1,2 main loops, not before.
* all kx2, dNU2 arrays removed, just done by permutation index when needed.
* MAC OSX test, makefile, instructions.
* matlab wrappers in 3D
* matlab wrappers, mcwrap issue w/ openmp, mex, and subdirs. Ship mex
  executables for linux. Link to .a
* matlab wrappers need ier output? yes, and internal omp numthreads control
  (since matlab's is poor)
* wrappers for MEX octave, instructions. Ship .mex for octave.
* python wrappers - Dan Foreman-Mackey starting to add something similar to
* check is done before attempting to malloc ridiculous array sizes, eg if a
  large product of x width and k width is requested in type 3 transforms.
* draft make python
* basic manual (txt)

V. 0.7:

* build static & shared lib
* fixed bug when Nth>Ntop
* fortran drivers use dynamic malloc to prevent stack segfaults that CMCL had
* bugs found in fortran drivers, removed
* split out devel text files (TODO, etc)
* made pass-fail test script counting crashes and numdiff fails.
* finufft?d_test have a no-timings option, and exit with ier.
* global error codes
* made finufft routines & testers return error codes rather than exit().
* dumbinput test executable
* found nan returned error for nj=0 in type-1, fixed so returns the zero array.
* fixed type 2 to not segfault when ms,mt, or mu=0, doing dir=2 0-padding right
* array utils use pointers to make which vars they write to explicit.
* don't do final type-3 rephase if C1 nan or 0.
* finished all dumbinputs, all dims
* fortran compilation fixed
* makefile self-documents
* nf1 (etc) size check before alloc, exit gracefully if exceeds RAM
* integrate into nufft_comparison, esp vs NFFT - jfm did
* simple examples, simpler than the test drivers
* fortran link via gfortran, better fortran docs
* boilerplate stuff as in CMCL page

pre-V. 0.7: (Jan-Feb 2017)

* efficient modulo in spreader, done by conditionals
* removed data-zeroing bug in t-II spreader, slowness of large arrays in t-I.
* clean dir tree
* spreader dir=1,2 math tests in 3d, then nd.
* Jeremy's request re only computing kernel vals needed (actually
  was vital for efficiency in dir=1 openmp version), Ie fix KB kereval in
  spreader so doesn't wdo 3d fill when 1 or 2 will do.
* spreader removed modulo altogether in favor of ifs
* OpenMP spreader, all dims
* multidim spreader test, command line args and bash driver
* cnufft->finufft names, except spreader still called cnufft
* make ier report accuracy out of range, malloc size errors, etc
* moved wrappers to own directories so the basic lib is clean
* fortran wrapper added ier argument
* types 1,2 in all dims, using 1d kernel for all dims.
* fix twopispread so doesn't create dummy ky,kz, and fix sort so doesn't ever
  access unused ky,kz dims.
* cleaner spread and nufft test scripts
* build universal ndim Fourier coeff copiers in C and use for finufft
* makefile opts and compiler directives to link against FFTW.
* t-I, t-II convergence params test: R=M/N and KB params
* overall scale factor understand in KB
* check J's bessel10 approx is ok. - became irrelevant
* meas speed of I_0 for KB kernel eval - became irrelevant
* understand origin of dfftpack (netlib fftpack is real*4) - not needed
* [spreader: make compute_sort_indices sensible for 1d and 2d. not needed]
* next235even for nf's
* switched pre/post-amp correction from DFT of kernel to F series (FT) of
  kernel, more accurate
* Gauss-Legendre quadrature for direct eval of kernel FT, openmp since cexp slow
* optimize q (# G-L nodes) for kernel FT eval on reg and irreg grids
  (common.cpp). Needs q a bit bigger than like (2-3x the PTR, when 1.57x is
  expected). Why?
* type 3 segfault in dumb case of nj=1 (SX product = 0). By keeping gam>1/S
* optimize that phi(z) kernel support is only +-(nspread-1)/2, so w/ prob 1 you
  only use nspread-1 pts in the support. Could gain several % speed for same acc
* new simpler kernel entirely
* cleaned up set_nf calls and removed params from within core libs
* test isign=-1 works
* type 3 in 2d, 3d
* style: headers should only include other headers needed to compile the .h;
  all other headers go in .cpp, even if that involves repetition I guess.
* changed library interface and twopispread to dcomplex
* fortran wrappers (rmdir greengard_work, merge needed into fortran)

Started: mid-January 2017.