lunes, 8 de octubre de 2012

The 'Classic' Non-Recursive DIF FFT Routine.

http://www.engineeringproductivitytools.com/stuff/T0001/PT03.HTM#Head134


The 'Classic' Non-Recursive DIF FFT Routine.

For many processors/languages a recursive routine is not attractive, because of the overhead incurred by a procedure call. Be careful though, in reality an aversion to recursion could cost performance (see the note about cache efficiency). There is also a particular problem with DSP's which often have small hardware stacks, so deeply recursive routines will cause stack overflow. It is fairly easy to see that the above routine can be flattened into nested loops, to yield something resembling the 'classic' (non recursive) DIF FFT. Each DIF uses 2 'half size' DIF's, which in turn will use 4 'quarter size' DIF's, etc.. the last (not trivial) DIF will be for size 2. So, keeping notation consistent with the above recursive DIF, we get 3 nested loops:
  • Pass Loop: An N (=2p ) point transform will perform p 'passes', indexed by P=0..p-1.
  • Block Loop: Pass will P operate on BP (=2P ) sub-blocks, each of size NP (=N/BP=2p-P), indexed by b=0..BP-1
  • Butterfly Loop: Each sub-block operation will perform N'P (=NP/2=2p-P-1) butterflies, indexed by n=0..N'P-1.
{Perform in place DIF of 2^p points (=size of f)}
PROCEDURE DIF(p,VAR f);
LOCAL Bp,Np,Np',P,b,n,BaseE,BaseO,e,o;
BEGIN {DIF}
{initialise pass parameters}
Bp:=1;    {No. of blocks}
Np:=1<<p; {No. of points in each block}  
{perform p passes}
FOR P:=0 TO (p-1) DO
  BEGIN {pass loop}
  Np':=Np>>1; {No. of butterflies}
  BaseE:=0;   {Reset even base index}
  FOR b:=0 TO (Bp-1) DO
    BEGIN {block loop}
    BaseO:=BaseE+Np'; {calc odd base index}
    FOR n:=0 TO (Np'-1) DO
      BEGIN {butterfly loop}
      e:= f[BaseE+n]+f[BaseO+n];
      o:=(f[BaseE+n]-f[BaseO+n])*T(Np,n);
      f[BaseE+n]:=e;
      f[BaseO+n]:=o;
      END; {butterfly loop}
    BaseE:=BaseE+Np; {start of next block}
    END; {block loop}
  {calc parameters for next pass}
  Bp:=Bp<<1; {twice as many blocks}
  Np:=Np>>1; {half as many points in each block}
  END; {pass loop}
END; {DIF}
That's as far as we'll go on DIF code. Of course, if you look, there's still plenty of scope for further optimisation of this routine. We won't do it here for fear of obscuring the code even more. The remainder of this section is devoted to a few observations about the practical details of the twiddle factors, which so far have been largely neglected.


Bottom     Previous     Contents

The Radix 2 Decimation In Frequency (DIF) Algorithm.

The Maths.

We defined the FFT as:
equation
If N is even, the above sum can be split into 'top' (n=0..N/2-1) and 'bottom' (n=N/2..N-1) halves and re-arranged as follows:
equation
equation
equation
If we now consider the form of this result for even and odd valued k separately, we can see how this result enables us to express an N point FFT in terms of 2 (N/2) point FFT's.
Even k, k=2k' (k'=0..N/2-1):
equation
or equivalently..
equation
equation

Odd k, k=2k'+1 (k'=0..N/2-1):
equation
or equivalently..
equation
equation
equation
The process of dividing the frequency components into even and odd parts is what gives this algorithm its name 'Decimation In Frequency'. If N is a regular power of 2, we can apply this method recursively until we get to the trivial 1 point transform. The factors TN are conventionally referred to as 'twiddle factors'.

A Recursive DIF FFT Routine.

Given the above results, we can now have a 'first stab' at a recursive routine to implement this algorithm (in a hypothetical Pascal like programming language which supports complex numbers and allows arrays as function arguments and results):
FUNCTION DIF(N,f);
LOCAL N',n,fe,fo,Fe,Fo,k',F;
IF N==1
   THEN RETURN(f); {trivial if N==1}
   ELSE BEGIN      {perform 2 sub-transforms}
        N':=N/2; {size of sub-transforms}
        FOR n:=0 TO (N'-1) DO {perform N' DIF 'butterflies'}
          BEGIN
          fe[n]:= f[n]+f[n+N'];         {even subset}
          fo[n]:=(f[n]-f[n+N'])*T(N,n); {odd  subset}
          END;
        Fe:=DIF(N',fe); {even k}
        Fo:=DIF(N',fo); {odd  k}
        FOR k':=0 TO (N'-1) DO
          BEGIN
          F[2*k'  ]:= Fe[k']; {even k}
          F[2*k'+1]:= Fo[k']; {odd  k}
          END;
        RETURN(F);
        END;
This is simplest form of DIF implementation and directly reflects the mathematical derivation of the algorithm. The process of calculating the fe and fo components is conventionally referred to as a (DIF) 'butterfly', and is the basic primitive operation of the FFT. (Apparently, this name was aquired from certain diagramatic representations of this operation which resemble butterflies.)

DIF FFT Algorithmic Complexity (Why it's fast).

The algorithmic complexity of an FFT is usually quantified in terms of the total number of butterfly operations performed. Let this number be C(p) for a 2p point transform. Looking at the DIF routine above, it is easy to see that C(p) must satisfy the following recurrence relation:
equation
This has solution:
equation
or, in terms of N (=2p):
equation
Dropping the constant scaling factors (including the log base) we get an algorithmic complexity of O(N.logN)

A Recursive 'In Place' DIF FFT Routine.

If you coded something like the above routine in a real programming language it would work fine. However, from an efficiency point of view it is somewhat less than ideal. Allocating local arrays on the stack will make this implementation fairly memory hungry, and the last loop in the above routine performs no useful numerical computation, it simply re-arranges values previously calculated.
What if the last loop were somehow replaced by a simple catenation the results of the two sub-transforms? Say we put all the even k samples in the top half of the result and all the odd samples in the bottom half of the result. Treating k as a binary number, the least significant bit of k determines which 'half' of the result array the corresponding value resides in. The least significant bit of k becomes the most significant bit of the corresponding (binary) array index. Because the function is recursive, this bit swapping will occur for each sub-transform etc.. I.E. The array index is determined by bit reversing k. Doing this won't save any time, unless we can arrange that the catenation operation is actually performed implicitly by each sub-transform. This suggests that an 'in place' algorithm should be used (one which performs all necessary operations in a single array, rather than allocating temporary arrays on the stack). Looking at the form of the first loop in the above routine indicates how to do this, put the 'even' subset values back in f[n] (the top half) and put the 'odd' subset values back in f[n+N'] (the bottom half), then perform the DIF operation on each half. Pulling all these ideas together, we get a more efficient (recursive) in-place DIF routine:
{Perform in place DIF of N points starting at position BaseE
 DIF(0,N,f) performs DIF FFT on entire array f, N= size of f
}
PROCEDURE DIF(BaseE,N, VAR f); {f is an external array}
LOCAL N',BaseO,n,e,o;
IF N==1
   THEN {do nothing}
   ELSE BEGIN
        N':=N>>1; {shift right to get size of sub-transforms}
        BaseO:=BaseE+N'; {split block into 2 halves}
        FOR n:=0 TO (N'-1) DO
          BEGIN
          e:= f[BaseE+n]+f[BaseO+n];
          o:=(f[BaseE+n]-f[BaseO+n])*T(N,n);
          f[BaseE+n]:=e;
          f[BaseO+n]:=o;
          END;
        DIF(BaseE,N',f); {even sub-transform}
        DIF(BaseO,N',f); {odd  sub-transform}
        END;
This version of the DIF routine is a little simpler and more efficient than the first, but has the disadvantage that the output is in 'jumbly' (bit reversed) order. This may on may not be a serious problem, depending on what the output is to be used for and whether or not your processor/programming language supports bit reversed addressing (most DSP's do). If bit reversed addressing is not available then you may need to produce a bit reversed index look up table to access the output.
It is worth noting that this is the simplest formulation of the 'in-place' DIF algorithm. It takes normal order input and generates bit reversed order output. However, this is not fundamental to the algorithm, it is merely a consequence of the simplest implementation. It is possible to write a reverse order DIF algorithm FFT which takes bit reversed order input and generates normal order output. This requires the DIF FFT routine to be amended in two ways:
  1. Passing an additional parameter which specifies the spacing between elements in each sub-array to be transformed. In the simple code above, this is always 1. In the reverse order FFT, this will start at 1 and double for each sub transform. This parameter is also useful feature for multi-dimensional transforms.
  2. Since the input to each sub-transform is now in bit reversed order, the twiddle factors must also used in bit reversed order. This isn't difficult if the twiddle factors are taken from a table in bit reversed order, or if bit reversed addressing is available.
This modified algorithm could be used to implement an inverse DIF FFT, but it's probably simpler to use the DIT algorithm presented in the next section.

The 'Classic' Non-Recursive DIF FFT Routine.

For many processors/languages a recursive routine is not attractive, because of the overhead incurred by a procedure call. Be careful though, in reality an aversion to recursion could cost performance (see the note about cache efficiency). There is also a particular problem with DSP's which often have small hardware stacks, so deeply recursive routines will cause stack overflow. It is fairly easy to see that the above routine can be flattened into nested loops, to yield something resembling the 'classic' (non recursive) DIF FFT. Each DIF uses 2 'half size' DIF's, which in turn will use 4 'quarter size' DIF's, etc.. the last (not trivial) DIF will be for size 2. So, keeping notation consistent with the above recursive DIF, we get 3 nested loops:
  • Pass Loop: An N (=2p ) point transform will perform p 'passes', indexed by P=0..p-1.
  • Block Loop: Pass will P operate on BP (=2P ) sub-blocks, each of size NP (=N/BP=2p-P), indexed by b=0..BP-1
  • Butterfly Loop: Each sub-block operation will perform N'P (=NP/2=2p-P-1) butterflies, indexed by n=0..N'P-1.
{Perform in place DIF of 2^p points (=size of f)}
PROCEDURE DIF(p,VAR f);
LOCAL Bp,Np,Np',P,b,n,BaseE,BaseO,e,o;
BEGIN {DIF}
{initialise pass parameters}
Bp:=1;    {No. of blocks}
Np:=1<<p; {No. of points in each block}  
{perform p passes}
FOR P:=0 TO (p-1) DO
  BEGIN {pass loop}
  Np':=Np>>1; {No. of butterflies}
  BaseE:=0;   {Reset even base index}
  FOR b:=0 TO (Bp-1) DO
    BEGIN {block loop}
    BaseO:=BaseE+Np'; {calc odd base index}
    FOR n:=0 TO (Np'-1) DO
      BEGIN {butterfly loop}
      e:= f[BaseE+n]+f[BaseO+n];
      o:=(f[BaseE+n]-f[BaseO+n])*T(Np,n);
      f[BaseE+n]:=e;
      f[BaseO+n]:=o;
      END; {butterfly loop}
    BaseE:=BaseE+Np; {start of next block}
    END; {block loop}
  {calc parameters for next pass}
  Bp:=Bp<<1; {twice as many blocks}
  Np:=Np>>1; {half as many points in each block}
  END; {pass loop}
END; {DIF}
That's as far as we'll go on DIF code. Of course, if you look, there's still plenty of scope for further optimisation of this routine. We won't do it here for fear of obscuring the code even more. The remainder of this section is devoted to a few observations about the practical details of the twiddle factors, which so far have been largely neglected.

Twiddle Factor Tricks.

Recall the form of the twiddle factors:
equation
In practice, calculating these will require time consuming COS's and SIN's, so you certainly don't want to do this for every butterfly (if you do your FFT will be anything but fast). So what you need is a look up table (an array of size N/2) which is calculated just once. The important thing to realise is that you don't need a separate table for each DIF pass (N halves each pass). Instead you use the same table each pass and introduce an additional 'twiddle_step_size' parameter which starts at 1 and doubles after each pass. The twiddle factor used has index n*twiddle_step_size. (The array bounds will never be exceeded, because N halves on each pass and the maximum value of n is N/2-1.)
The next trick is to notice that often the twiddle factor will be either +1 or -j (it will never be -1 or +j). In such cases there is no need to do a full blown complex multiply (remember this requires 4 real multiplies and 2 real adds). Simple addition and subtraction will suffice. We shall call butterflies that use on or other of these twiddle factors 'trivial butterflies'. Specifically:
equation
How often does this occur? Consider the last DIF FFT pass (N=2). In this case T2(0)=1 is the only twiddle factor used. So the entire last pass consists of trivial butterflies. Similarly, the first butterfly of of any sub-block will use TN(0)=1. In general, in pass P (P=0..p-1), there are BP (=2P) sub-blocks and therefore 2P butterflies which use a twiddle factor of +1.
In the penultimate (N=4) pass the butterflies will use either T4(0)=1 or T4(1)=-j. So this pass also uses only trivial butterflies. As in the above case, in each of the previous (N>4, P<p-2) passes, there are BP (=2P) sub-blocks and therefore 2P butterflies which use a twiddle factor of -j (at n=N/4) .
We can now calculate exactly how many of these trivial butterflies Triv(p) there are in a typical DIF FFT for sizes >4 (p>2). This is the sum of the number of butterflies in the last 2 passes (the easy bit) and the number of trivial butterflies in all the other passes (the hard bit, but the result in Annex B is useful here):

equation
equation
equation
The total number of butterflies was calculated earlier:
equation
So for a 2p point DIF FFT, the proportion of trivial butterflies is about 3/p. If this shortcut is exploited, then large FFT's benefit less than small FFT's. For example, in a 512 point (p=9) DIF FFT about 1/3 of the butterflies will be trivial.
The non-recursive routine presented above can be modified to exploit trivial butterflies by doing something like this:
  • Make the last 2 passes 'specials' which have the appropriate twiddle factors (+1 or -j) 'hard wired' into the code.
  • For the remaining passes (N>4, N'p>2) break the inner butterfly loop into two, each half of which has a 'special' as the first butterfly. The first loop special butterfly should use +1 as the twiddle factor. The second loop special butterfly should use -j as the twiddle factor.
Of course, exactly how much time this will really save in practice is greatly dependent on the processor used.
As a final note, we can observe that there are further opportunities to exploit which we haven't considered:
equation
These aren't quite so trivial (should call them semi-trivial?), but there are still potentially faster ways of using these than full blown complex multiplication.


Next     Top

No hay comentarios:

Publicar un comentario