AN2094 Freescale Semiconductor / Motorola, AN2094 Datasheet

no-image

AN2094

Manufacturer Part Number
AN2094
Description
ITU-T G.729 Implementation on StarCore SC140
Manufacturer
Freescale Semiconductor / Motorola
Datasheet
© Freescale Semiconductor, Inc., 2001, 2004. All rights reserved.
Freescale Semiconductor
Application Note
This application note illustrates the process of optimizing the C
source code of applications written for Freescale
Semiconductor DSPs based on the StarCore™ SC140/SC1400
cores while ensuring that the bit-level output of the modified
software is identical to that of the original application (bit-
exactness). The optimization steps and the results for this
application are discussed in detail. The application chosen for
this purpose is the vocoder defined by the ITU-T G.729
Recommendation [1]
applied to the C source code for any application.
StarCore SC140 features give powerful support and allow
advanced implementations of DSP algorithms. Characteristics
of the StarCore SC140/SC1400 cores include:
ITU-T G.729 Implementation on the
StarCore™ SC140/SC1400 Cores
By Bogdan Costinescu, Razvan Ungureanu, Madalin Stoica, Emilian Medve, Radu Preda, Mugur Alexiu,
and Costel Ilas
Integer and fractional 16-bit data types supported in
hardware
Instruction parallelism—up to six instructions per cycle
(four DALU and two AGU)
Suitable architecture for a high-performance compiler
Efficient support for double-precision arithmetic
Single-cycle operation for almost all instructions, including
MAC
Zero-overhead hardware loops with up to four levels of
nesting
Modulo addressing for circular buffers—delay lines,
sample buffers
Multiple operands support via MOVE in single
instructions—up to 64 bytes for aligned data
,
but the principles illustrated can be
CONTENTS
1
1.1
1.2
2
2.1
2.2
2.3
2.4
2.5
2.6
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
4
4.1
4.2
4.3
5
5.1
5.2
5.3
6
7
APPENDIXES
A
B
C
G.729 Recommendation for Speech Compression . 2
Assessing Speech Quality ....................................... 2
Technical Overview of ITU-T G.729 .....................2
Optimization Process ..............................................5
Test Vectors and Development Tools..................... 5
Porting G.729 Code to the SC140........................... 6
Project-Level Optimizations ................................... 7
Function-Level C Optimization ............................11
Algorithm Changes ...............................................14
Function Implementation in Assembly .................16
Results ...................................................................18
Measurement Techniques ..................................... 18
Performance Estimation ........................................20
Project Milestones .................................................20
Execution Time .....................................................20
Code Size ..............................................................22
Data Size ............................................................... 23
Testing ...................................................................24
Details of Selected Functions................................ 24
Optimizations in Norm_Corr().............................. 24
Optimizations in ACELP_Codebook() ................. 26
Optimizations in Lag_max() .................................28
Implementation Strategies .....................................30
Theoretical Background ........................................30
Project Implementation ......................................... 31
Project Results....................................................... 32
Conclusions........................................................... 33
References .............................................................34
Selected C and Pseudocode Listings .....................36
Selected Assembler Operations ............................40
Script Sources ........................................................45
Rev. 1, 12/2004
AN2094

Related parts for AN2094

AN2094 Summary of contents

Page 1

... Code Size ..............................................................22 3.6 Data Size ............................................................... 23 3.7 Testing ...................................................................24 4 Details of Selected Functions................................ 24 4.1 Optimizations in Norm_Corr().............................. 24 4.2 Optimizations in ACELP_Codebook() ................. 26 4.3 Optimizations in Lag_max() .................................28 5 Implementation Strategies .....................................30 5.1 Theoretical Background ........................................30 5.2 Project Implementation ......................................... 31 5.3 Project Results....................................................... 32 6 Conclusions........................................................... 33 7 References .............................................................34 APPENDIXES A Selected C and Pseudocode Listings .....................36 B Selected Assembler Operations ............................40 C Script Sources ........................................................45 AN2094 ...

Page 2

G.729 Recommendation for Speech Compression • Stack optimization for improved multi-tasking • Peak performance of 1200 DSP-MIPS at 300 MHz This application note is written for SC140 programmers, system engineers, tool developers, and project managers. 1 G.729 Recommendation for Speech ...

Page 3

Excitation Codebook 1.2.1 Encoding The G.729 encoding scheme is based on a code-excited linear-prediction model. In this model, the locally decoded signal is compared with the original signal. The filter parameters are then selected to minimize the mean-square weighted error ...

Page 4

G.729 Recommendation for Speech Compression Fixed Codebook Adaptive Codebook . Parameter Encoding LPC Info 1.2.2 Decoding The G.729 decoder synthesizes the output speech samples from the received bitstream as show in Figure 3. Codebook Adaptive Codebook Figure 3. Principle of ...

Page 5

... The development tools included the StarCore SC140 Enterprise C compiler and Metrowerks® CodeWarrior® for StarCore. ITU-T G.729 Implementation on the StarCore™ SC140/SC1400 Cores, Rev. 1 Freescale Semiconductor Description Data type definitions, introduction of intrinsic functions, multichannel transformations ...

Page 6

Optimization Process The PC compiler used to check the modifications to the C code and generate the test vectors for unit testing was Microsoft Visual C++ 6.0. The project was developed on the Windows 2000 Professional platform. Hardware tests were ...

Page 7

This routine requires four cycles to execute. However, this approach forces the compiler to use a dedicated DALU register. A more acceptable solution is to use the intrinsic functions provided by the C compiler, as shown in Example 2. A= ...

Page 8

Optimization Process 2.3.1 Function Inlining Function inlining (also referred to as ‘inline expansion’ or simply ‘inlining’), is the process of replacing a function call with the body of the function itself. This optimization technique improves execution time by eliminating function-call ...

Page 9

Manual inlining is recommended when several identical calls are made in sequence, and enables the programmer to fully exploit potential parallelism. By replacing a function call with the C statements of the function, the programmer places in parallel, by hand, ...

Page 10

Optimization Process When internal pointers are used, conflicts can arise when two different functions require the same array to be aligned differently. Tests should be run to identify the type of alignment that gives better overall performance. When a compromise ...

Page 11

Profile data was obtained after the project-level optimizations were implemented, focusing on the total number of cycles per application step. The new profile data served as a baseline for all further analysis. 2.4 Function-Level C Optimization The primary focus of ...

Page 12

Optimization Process 2.4.3.1 Multisample Multisampling is a pipelining technique to process multiple samples simultaneously. It takes full advantage of the SC140 multiple-ALU architecture to maximize parallel operation of the execution units. In addition, this technique preserves bit-exactness and reduces the ...

Page 13

Loop Merging Combining two or more loops into a single loop loads the ALUs more efficiently and reduces the number of AGU operations, as illustrated in Code Example merged loop still does not use all available ...

Page 14

Optimization Process • Manually inline small functions, such as L_Extract(), that involve multisample operations. • Do not apply the C operators (*, +) to operations on fractional values that only employ intrinsic functions (L_mult(), L_add()). Also, do not use intrinsic ...

Page 15

It is also useful to examine the relationship between module results and internal computed values. Often the number of values a module computes and stores is significantly larger than the number of values returned. For example, the output of a ...

Page 16

Optimization Process 2.5.3 Platform-Dependent Changes Platform-dependent changes to algorithms are those that reorder and restructure data or reorder and regroup computation blocks to take advantage of the parallel architecture of a particular processor. Such changes, if they are done, prove ...

Page 17

... Therefore the proper approach is to nest hardware loops in reverse order of their indices. 2.6.4 Programming Tips The following programming tips for assembly optimization are valid for all versions of the development tools; some of these ideas also apply to C optimizations [13]. • ...

Page 18

... Measurement Techniques Various tools and techniques were used to measure the execution time, code size, and data size of each function. 3.1.1 Execution Time The execution time of functions and modules were evaluated using the number of simulated cycles spent in the measured unit ...

Page 19

... All tools report the number of cycles to execute each function. For the vocoder project, a simple computation was used to convert the number of cycles to the processing load, measured in million cycles per second (MCPS). The number of MCPS required to encode or decode a frame is obtained by multiplying the measured number of cycles by the number of frames to be processed per second (in G ...

Page 20

Results script that performs the analysis is presented in Appendix C. For stack size measurements important to note that the encoder and decoder do not run concurrently. Thus, the stack figure is the maximum of the individual stack ...

Page 21

Ported version (29.29 MCPS) 25 Project-level opt. (24.7 MCPS Our target The X axis represents the manpower used to achieve each milestone, and the Y axis represents the number of MCPS for ...

Page 22

Results 3.5 Code Size The evolution in code size as the project progressed is summarized in Figure 5. Again, the X axis represents the manpower used to achieve each milestone. The Y axis represents the code size ...

Page 23

The final implementation in assembly resulted in a substantial decrease in code size. Although only a few functions were implemented in assembly, the code size was reduced to 36.2 KB, which is actually less than the initial code size. This ...

Page 24

Details of Selected Functions One way to decrease the channel data size is to remove the storage area allocated to hold new samples. For example, the speech data structure contains two main parts—old speech, which contains previous frames, and new ...

Page 25

Scale the filtered excitation to avoid overflow and compute the energy of the scaled filtered excitation. 4. For every possible delay between minimum and maximum, compute the normalized correlation vector 5. and modify the excitation for the next iteration. 4.1.1 ...

Page 26

Details of Selected Functions These modifications were first applied to the unoptimized C code to verify bit-exactness. The function was then reoptimized in C, resulting in a speed improvement of 2.8 over the unoptimized version. 4.1.3 Assembly Implementation Assembly implementation ...

Page 27

Replace function calls with operators when integer values are used. For example, replace time= sub(time,1) with time-- • Merge two lower-level loops into a single loop. After these optimizations, the ACELP_Codebook() function ran 1.2 times faster than the initial ...

Page 28

Details of Selected Functions • Replace the L_shl() function with the << operator. • Replace the mult() instruction with L_mult() combined with a 16-bit right shift. After function-level reoptimization the encoder ran 1.77 times faster than the initial version. 4.2.4 ...

Page 29

The reference C code uses ‘greater or equal’ to compare values in its search for the maximum correlation, but there is no such instruction for the SC140. The ‘greater or equal’ comparison was replaced with ‘greater’ so that the compiler ...

Page 30

Implementation Strategies Table 12. Lag_max() Performance Summary Initial C version Optimized C version Final assembly version Note: Includes three calls. These results show that the C compiler generates efficient code when optimization techniques are used. Code generated for the inner ...

Page 31

S is the application performance improvement P is the percentage of the application run-time taken by the G1 functions, and f is the optimization factor. Given computed as A common rule of thumb for software performance ...

Page 32

Implementation Strategies Compared to the ported version, we achieved an application improvement of 2.28 times for optimized C and 3.47 times for the best assembly implementation. These values translate into a StarCore optimization factor of 2.44 for the optimized C ...

Page 33

Table 13. Performance Versus Number of Assembly-Implemented Functions Number of Functions The diamonds in Figure 7 represent a hypothetical variation based on our actual results. In this variation, only those functions which are not eventually ...

Page 34

References The algorithmic changes employed on the vocoder project included both architecture-independent modifications and StarCore-specific adaptations. The latter algorithmic changes would not have been necessary if the algorithms were developed specifically for a StarCore implementation from the beginning. Basic algorithmic ...

Page 35

... SC140 DSP Core Reference Manual (order number, MNSC140CORE/D). [9] SC100 C Compiler User’s Manual (order number, MNSC100CC/D). [10] SC100 Assembly Language Tools User’s Manual (order number, MNSC100ALT/D). [11] SC100 Application Binary Interface Reference Manual (order number, MNSC100ABI/D). [12] StarCore Multisample Programming Technique Application Note (order number, STCR140MLAN/D) ...

Page 36

References Appendix A Selected C and Pseudocode Listings A.1 Norm_Corr() prototype void Norm_Corr(Word16 exc[], Word16 xn[], Word16 h[], Word16 L_subfr, Word16 t_min, Word16 t_max, Word16 corr_norm[]) { #pragma align *exc 8 #pragma align *xn 8 #pragma align *h 8 Word16 ...

Page 37

E=0; for(j=0;j<40;j++) { s_excf[j]=excf[j] >> s_excf[j] * s_excf[j /*loop for every possible period*/ for(i=t_min; i<t_max; i++) { /*compute 1/sqrt[E]}*/ E1 = Inv_sqrt(E); /* Compute correlation between xn[] and ...

Page 38

References A.3 Lag_max code Word16 Lag_max( /* output: lag found Word16 signal[], /* input : signal used to compute the open loop pitch */ Word16 L_frame, /* input : length of frame to compute pitch Word16 lag_max, /* ...

Page 39

L_mac(c2, rs, sig3 L_mac(c3, rs, sig0 ref_signal[j+2]; sig1 = signal[i+j+5 L_mac(c0, rs, sig2 L_mac(c1, rs, sig3 L_mac(c2, rs, sig0 L_mac(c3, rs, sig1 ref_signal[j+3]; sig2 ...

Page 40

References Appendix B Selected Assembler Operations B.1 32-Bit DPF Operations Multiplying a 16-bit integer by a 32-bit DPF can be viewed as the multiplication of a Q31 number by a Q15 number, with the result in Q31 format: L_32 = ...

Page 41

Code Example 13 illustrates how this formula can be applied to the multiplication of two 32-bit DPF numbers in StarCore. In this example, d0 and d1 contain the 32-bit DPF values, d4 contains -2, and the result is stored in ...

Page 42

References ;* lag_mag-lag_min is multiplied by 2 because variables used are Word16 *; [ push d6 sub #3, two maxima are used ;* initialized with MIN_32 value (-2147483648 move.l #-2147483648,d0 asl d1, move.l d1,r2 ...

Page 43

Determine maximal correlation using two values—one for even and one for odd. ;* Four correlations are ...

Page 44

References mac d4,d4,d2 move.f (r2)+n3,d5 move.f (r3)+n3,d4 ] loopend3 add d2,d0,d0 ;* 1/sqrt(energy) jsr _Inv_sqrt tfr d7,d1 ;* **************************************************************************************** *; ;* max = max/sqrt(energy) ;* This result will always be a 16-bit value! ;* **************************************************************************************** *; mpysu d0,d1,d3 and #$fffe0000,d3,d3 ...

Page 45

Appendix C Script Sources C.1 Perl Script for Generating Cycle Statistics #!c:/Perl/bin/perl # DSP Center Romania, 2000 =head1 The script parses the log file generated by coder_worst_case.sc or decoder_worst_case.sc simulator scripts. The output of this script is a table with ...

Page 46

References system("simsc100 $cmd_file"); } # check and open the worst-case simulation log $log_file_name=$module."_worst_case.log"; open ( fin, $log_file_name ) || die "Can’t open log file : $!\n"; # the comparison Perl module is needed to check if the test was passed ...

Page 47

# if the time per last frame is greater than current maximum, keep it if ($cycles > $max_cycles_test) { $max_cycles_test = $cycles run summarizy routine if end of test case if ($count == $frames[$test_count]) { $frames_in_test = $frames[$test_count]-$frames[$test_count-1]; ...

Page 48

References C.2 Simulator Command Files for Worst-Case and Average Analysis C.2.1 Command File for Coder (coder_worst_case.sc) display off break off output off input off load ..\..\..\..\..\code\bin\coder.eld radix d break _g729_encode s break _frame_end s break _exit log s coder_worst_case.log -o ...

Page 49

# run the module tester to extract build number and date # g729coder.ini has to be renamed temporary to not perform the test system("mv g729coder.ini g729coder.tmp"); system("runsc100 $exec_file > tmp.txt"); open (tmpfile,"tmp.txt") || die "Cannot create temporary files!!!"; while (<tmpfile>) ...

Page 50

References if ($begin_flag == > $stack_top ) { $stack_top = $ else { $stack_base = $1; $begin_flag = output stack dimension $decoder_stack = $stack_top - $stack_base; print "Decoder ...

Page 51

Example 19. Fragments from the Vocoder Map File Value Size … 1 0x00000200 15464 Section: .data … 2 0x000005f0 5956 3 0x000005f0 4 0x00000ff0 5 0x00001270 … 6 0x00010000 66592 Section: .text … 7 0x000159b0 4638 8 0x00016030 9 0x00016bce ...

Page 52

... P.O. Box 5405 Denver, Colorado 80217 1-800-441-2447 or 303-675-2140 Fax: 303-675-2150 LDCForFreescaleSemiconductor@hibbertgroup.com Document Order No.: AN2094 Rev. 1 12/2004 Information in this document is provided solely to enable system and software implementers to use Freescale Semiconductor products. There are no express or implied copyright licenses granted hereunder to design or fabricate any integrated circuits or integrated circuits based on the information in this document ...

Related keywords