# Details

## Transform Definition

\(\omega_{N}^{k,n} = e^{-2\pi i \frac{k n}{N}}\):

*Forward*transform from space domain to frequency domain\(\omega_{N}^{k,n} = e^{2\pi i \frac{k n}{N}}\):

*Backward*transform from frequency domain to space domain

## Complex Number Format

SpFFT always assumes an interleaved format in double or single precision. The alignment of memory provided for space domain data is guaranteed to fulfill to the requirements for std::complex (for C++17), C complex types and GPU complex types of CUDA or ROCm.

## Indexing

Indices for a dimension of size *n* must be either in the interval \([0, n - 1]\) or \(\left [ \left \lfloor \frac{n}{2} \right \rfloor - n + 1, \left \lfloor \frac{n}{2} \right \rfloor \right ]\). For Real-To-Complex transforms additional restrictions apply (see next section).

## Real-To-Complex Transforms

*must*be in the interval \(\left [ 0, \left \lfloor \frac{n}{2} \right \rfloor \right ]\). To fully utlize the symmetry property, the following steps can be followed:

Only non-redundent z-coloumns on the y-z plane at \(x = 0\) have to be provided. A z-coloumn must be complete and can be provided at either \(y\) or \(-y\).

All redundant values in the z-coloumn at \(x = 0\), \(y = 0\) can be omitted.

## Normalization

Normalization is only available for the forward transform with a scaling factor of \(\frac{1}{N_x N_y N_z}\). Applying a forward and backwards transform with scaling enabled will therefore yield identical output (within numerical accuracy).

## Optimal sizing

The underlying computation is done by FFT libraries such as FFTW and cuFFT, which provide optimized implementations for sizes, which are of the form \(2^a 3^b 5^c 7^d\) where \(a, b, c, d\) are natural numbers. Typically, smaller prime factors perform better. The size of each dimension is ideally set accordingly.

## Data Distribution

*must*be on the same MPI rank. The order and distribution of frequency space elements can have significant impact on performance. Locally, elements are best grouped by z-columns and ordered by their z-index within each column. The ideal distribution of z-columns between MPI ranks differs for execution on host and GPU.

## MPI Exchange

The MPI exchange is based on a collective MPI call. The following options are available:

- SPFFT_EXCH_BUFFERED
Exchange with MPI_Alltoall. Requires repacking of data into buffer. Possibly best optimized for large number of ranks by MPI implementations, but does not adjust well to non-uniform data distributions.

- SPFFT_EXCH_COMPACT_BUFFERED
Exchange with MPI_Alltoallv. Requires repacking of data into buffer. Performance is usually close to MPI_alltoall and it adapts well to non-uniform data distributions.

- SPFFT_EXCH_UNBUFFERED
Exchange with MPI_Alltoallw. Does not require repacking of data into buffer (outside of the MPI library). Performance varies widely between systems and MPI implementations. It is generally difficult to optimize for large number of ranks, but may perform best in certain conditions.

*SPFFT_EXCH_BUFFERED*and

*SPFFT_EXCH_COMPACT_BUFFERED*, an exchange in single precision can be selected. With transforms in double precision, the number of bytes sent and received is halved. For execution on GPUs without GPUDirect, the data transfer between GPU and host also benefits. This option can provide a significant speedup, but incurs a slight accuracy loss. The double precision values are converted to and from single precision between the transform in z and the transform in x / y, while all actual calculations are still done in the selected precision.

## Thread-Safety

The creation of Grid and Transform objects is thread-safe only if:

No FFTW library calls are executed concurrently.

In the distributed case, MPI thread support is set to

*MPI_THREAD_MULTIPLE*.

The execution of transforms is thread-safe if

Each thread executes using its own Grid and associated Transform object.

In the distributed case, MPI thread support is set to

*MPI_THREAD_MULTIPLE*.

## GPU

Note

Additional environment variables may have to be set for some MPI implementations, to allow GPUDirect usage.

Note

The execution of a transform is synchronized with the default stream.

## Multi-GPU

Multi-GPU support is not available for individual transform operations, but each Grid / Transform can be associated to a different GPU. At creation time, the current GPU id is stored internally and used for operations later on. So by either using the asynchronous execution mode or using the multi-transform functionality, multiple GPUs can be used at the same time.