AN2094 Freescale Semiconductor / Motorola, AN2094 Datasheet - Page 27

no-image

AN2094

Manufacturer Part Number
AN2094
Description
ITU-T G.729 Implementation on StarCore SC140
Manufacturer
Freescale Semiconductor / Motorola
Datasheet
After these optimizations, the ACELP_Codebook() function ran 1.2 times faster than the initial version but now
consumed 45 percent of optimized C encoder time instead of the initial 31 percent. Thus, the decision was made to
apply algorithmic changes directly to the original ACELP_Codebook() function and its descendants.
4.2.2 Algorithmic Changes
The original code contained many pointers, indices, and variables. For example, in the original search block there
were 19 pointers and at least 18 variables in four-nested loops which required a great deal of time-consuming save
and restore operations in both C and assembly. Thus, reducing the number of variables became a primary focus of
the algorithm change phase. The major modifications to the algorithms were as follows:
After these changes, the ACELP_Codebook() function was 1.43 times faster than the initial version.
4.2.3 Reoptimizing C After Algorithmic Changes
Because the algorithmic changes were applied to the reference G.729 C code, a new function-level C optimization
was needed. The following optimizations were included in both the C and assembly versions:
The C optimizations after algorithmic changes focused on primarily on D4i40_17() because it was by far the
most time-consuming function. Programming tips used to speed up the ‘C’ implementation included:
Freescale Semiconductor
Replace function calls with operators when integer values are used. For example, replace time=
sub(time,1) with time--
Merge two lower-level loops into a single loop.
Include the sign information in the correlation matrix at the time it is built. In the original code, the sign
information is not introduced until the D4i40_17()function. We modified Cor_h_X() to compute
the sign information and reversed the sequence of the first two functions calls so that Cor_h_X() is
called first. Thus, the sign information is now computed in Cor_h_X() and passed to Cor_h()
through a separate vector. The function Cor_h() now includes the sign information as it builds the
correlation matrix, so the sign computations in D4i40_17() are no longer necessary.
Rearrange the correlation matrix and correlation vector so that the correlation vector is addressed
sequentially using the same pointer, thus reducing the number pointers required from 19 to 13.
Combine the two 16-bit values E and C in the innermost loop into a single 32-bit value, E:C. In this
way the two pairs of 16-bit values are stored in two registers instead of four, and computations are
performed more efficiently using 32-bit multiplication instructions.
All vectors were aligned on 8-byte boundaries to enable multiple data transfers.
The max instruction was used instead of the classic compare function.
Split summation and multisample were applied to inner loops.
Two or more data variables were combined into a single variable.
Multiplication was performed by right-shifting.
The else branch was removed from if(…)…else statements.
Multisample by two instead of four in all loops.
Use variables instead of constants to force the C compiler to use all four ALUs. For example, in the
initial C code, the use of the constant 1 in certain instructions, for example, L = L_mac(L, v[i], 1),
results in assembly statements such as mac #1,Dx,Dy which cannot be grouped into a single instruction
set due to instruction size. Replacing the constant 1 with the variable one, initialized with the value 1,
enables the use of a statement such as L = L_mac(L, v[i], one).
ITU-T G.729 Implementation on the StarCore™ SC140/SC1400 Cores, Rev. 1
Details of Selected Functions
27

Related parts for AN2094