small prime FFT based on ulong #2107

vneiger · 2024-11-09T10:24:02Z

This PR aims to have an ulong-based version of small prime FFT. This is a draft, comments and suggestions highly welcome (on any aspect: for example I have no idea if n_fft is relevant naming).

For the moment, the features implemented are:

forward FFT, inverse FFT, transposed forward FFT, transposed inverse FFT
restriction on the modulus: it must be 62 bits at most (for performance reasons)
length power of 2 (other lengths: zero padding means non-smooth timings between powers of 2)

Performance: observed on a few different machines, AMD zen 4 and various Intel. This slightly outperforms NTL's versions of the forward and inverse FFTs (acceleration of 0% to 30% depending on lengths). This is between 2 and 4 times slower, often around 3, than the vectorized floating point-based small-prime FFT in fft_small (or than the similar AVX-based version in NTL). This version uses no simd: enabling/disabling automatic vectorization does not change performance, and a straightforward "manual" vectorization should not bring much. The reason being that every few operations there is a full 64 bit multiplication (umul_ppmm) happening. (Still, I made some experiments that suggest avx could help, maybe substantially on AMD processors which have a very fast vpmullq, but I leave this aside for later.)

Planned:

more thorough testing files (for the transposed variants, which are only tested indirectly at the moment)
cleaning things here and there, add documentation
add mechanism to avoid too memory-consuming precomputation when a root of unity of very large order is available (maybe, in a first version, simply forbid transforms of length more than 2**25 or so?).

Planned, but likely not within this PR:

truncated FFT variants, for smooth performance when length varies from one power of 2 to the next
versions with strides, useful e.g. for polynomial matrices stored as a list of matrix coefficients (e.g., might help for the half-GCD algorithm)

…on and handling other ilen

…longing to specific set of roots of unity

…lo product of x-w

…enchmarked

vneiger added 30 commits September 16, 2024 11:24

add profile for powmod

0bf0127

Merge branch 'main' into introduce_nmod_fft

c363adb

add .h file

4a887b4

fix ifndef

aee38b3

context and init code

17faaea

add profile

e738256

fix include

dcaede7

improve profile init

cd50787

rename ctx init

afa5ddc

testing init

fd24de2

fix explanations and complete test for init

211ab75

remove printf

6368823

forgot to add main

9eeedd6

dft, test passes

3fa7944

add profile

ff33533

clean things a bit

f4520c9

introducing dft32 base case

e10c29c

dft32 base case

7b605a6

cleaning things

1f236d8

testing from length 1

9bf18c7

fix

fb88c54

remove useless function argument

f6cc96c

vaguely faster with added lazy14 layer

a675b68

clean explanations

28b3276

finalize lazy14 version

b71649d

small fixes

8cd392c

tentative fix for flint_bits == 32

9fa9020

dft8 is now a macro, code generation was too unpredictable

ccd3f71

putting more args slightly slows down for large lengths...

f0587e5

macro for dft16 helps, let's see for dft32

4cf7343

vneiger added 29 commits November 4, 2025 00:48

cleaner version of TFT for ilen == olen

a8e2991

version using faster divrem

efdcd34

some minor fixes; speed of TFT seems ok; still needs lazy consolidati…

ce95be2

…on and handling other ilen

Merge branch 'flintlib:main' into introduce_nmod_fft

36d3e5b

add macros for TFT; put prototypes in impl.h

c32d5d8

add impl.h

29041af

inserting base cases

455117f

fix some typos in base cases; tested

d41a3da

clean reduce circulant

757f0c5

simplify computation of new depth / new node

b4ed9d9

working version with arbitrary ilen

0dfc3c4

working version with arbitrary ilen, improved for large ilen

c164063

improving array traversal

2c9f586

profiling and testing more

1d9da3e

accelerate ilen close to len/2, clean code a bit

5c85008

TFT seems ready. Next step: inverse TFT.

8a7b853

gather fucntions in file

e22b6d7

removed now unused file

a787980

move some implem in impl.h and add _prepare_tft function

7fa7173

added more helper functions (not fully tested yet)

79ccf99

add functions for remainder and unrolling modulo product of x-w, w be…

f7988c7

…longing to specific set of roots of unity

complete testing/fixing of functions for remainder and unrolling modu…

ecf1230

…lo product of x-w

reorganize and add prototypes

a121eb7

add first version of inv-TFT, tested -- to be better documented and b…

eb357e2

…enchmarked

profile file for itft

5b68243

some improvements and fixes

42f984b

towards a version with node 0

08036fd

more improvements and some profiling

4e247e9

itft -- refinements in progress

e0c3412

vneiger mentioned this pull request Nov 15, 2025

Accelerate single point evaluation for nmod_poly #2492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

small prime FFT based on ulong #2107

small prime FFT based on ulong #2107

Uh oh!

vneiger commented Nov 9, 2024

Labels

3 participants

small prime FFT based on ulong #2107

Are you sure you want to change the base?

small prime FFT based on ulong #2107

Uh oh!

Conversation

vneiger commented Nov 9, 2024

Labels

3 participants