Skip to content

Conversation

@vneiger
Copy link
Collaborator

@vneiger vneiger commented Nov 9, 2024

This PR aims to have an ulong-based version of small prime FFT. This is a draft, comments and suggestions highly welcome (on any aspect: for example I have no idea if n_fft is relevant naming).

For the moment, the features implemented are:

  • forward FFT, inverse FFT, transposed forward FFT, transposed inverse FFT
  • restriction on the modulus: it must be 62 bits at most (for performance reasons)
  • length power of 2 (other lengths: zero padding means non-smooth timings between powers of 2)

Performance: observed on a few different machines, AMD zen 4 and various Intel. This slightly outperforms NTL's versions of the forward and inverse FFTs (acceleration of 0% to 30% depending on lengths). This is between 2 and 4 times slower, often around 3, than the vectorized floating point-based small-prime FFT in fft_small (or than the similar AVX-based version in NTL). This version uses no simd: enabling/disabling automatic vectorization does not change performance, and a straightforward "manual" vectorization should not bring much. The reason being that every few operations there is a full 64 bit multiplication (umul_ppmm) happening. (Still, I made some experiments that suggest avx could help, maybe substantially on AMD processors which have a very fast vpmullq, but I leave this aside for later.)

Planned:

  • more thorough testing files (for the transposed variants, which are only tested indirectly at the moment)
  • cleaning things here and there, add documentation
  • add mechanism to avoid too memory-consuming precomputation when a root of unity of very large order is available (maybe, in a first version, simply forbid transforms of length more than 2**25 or so?).

Planned, but likely not within this PR:

  • truncated FFT variants, for smooth performance when length varies from one power of 2 to the next
  • versions with strides, useful e.g. for polynomial matrices stored as a list of matrix coefficients (e.g., might help for the half-GCD algorithm)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants