AN2203 Freescale Semiconductor / Motorola, AN2203 Datasheet

no-image

AN2203

Manufacturer Part Number
AN2203
Description
MPC7450 RISC Microprocessor Family Software Optimization Guide
Manufacturer
Freescale Semiconductor / Motorola
Datasheet

Available stocks

Company
Part Number
Manufacturer
Quantity
Price
Part Number:
AN22030A
Manufacturer:
PANASONIC/松下
Quantity:
20 000
Application Note
AN2203/D
Rev. 1, 07/2002
MPC7450 RISC
Microprocessor Family
Software Optimization Guide
Part I
Overview
The primary objective of this document is to provide information to programmers to write
optimal code for the MPC750, MPC7400, and MPC7450 microprocessors that implement the
PowerPC architecture, with particular emphasis on the MPC7450, which is significantly
different from previous designs. The target audience includes performance-oriented writers of
both compilers and hand-coded assembly. This document may be regarded as a companion to
the PowerPC Compiler Writer’s Guide (CWG) with major updates for new implementations
not covered by that work.
This document is not intended as a guide for making a basic PowerPC compiler work. For
basic compiler guidelines, see the CWG. However, many of the code sequences suggested in
the CWG are no longer optimal, especially for the MPC7450.
The following documentation provides useful information about the three different
microprocessors and compiler guidelines in detail:
Table 1-1 lists the three main processors referenced in this document and their derivatives. The
derivative list is not necessarily complete and will change.
MPC750 RISC Microprocessor Family User’s Manual
MPC7410 & MPC7400 RISC Microprocessor User’s Manual
MPC7450 RISC Microprocessor Family User’s Manual
The PowerPC Compiler Writer’s Guide (available on the IBM website) for compiler
information
Freescale Semiconductor, Inc.
For More Information On This Product,
First Implementation
Go to: www.freescale.com
MPC7400
MPC7450
MPC750
Table 1-1. Microarchitecture List
MPC740, MPC745, MPC755
MPC7410
MPC7441, MPC7451
Derivatives (Similar Devices)

Related parts for AN2203

AN2203 Summary of contents

Page 1

... Freescale Semiconductor, Inc. Application Note AN2203/D Rev. 1, 07/2002 MPC7450 RISC Microprocessor Family Software Optimization Guide Part I Overview The primary objective of this document is to provide information to programmers to write optimal code for the MPC750, MPC7400, and MPC7450 microprocessors that implement the PowerPC architecture, with particular emphasis on the MPC7450, which is significantly different from previous designs ...

Page 2

Freescale Semiconductor, Inc. Terminology and Conventions 1.1 Terminology and Conventions This section provides an alphabetical glossary of terms used in this chapter. These definitions review these commonly used terms and point out specific ways these terms are used in this ...

Page 3

Freescale Semiconductor, Inc. • Pipeline—In the context of instruction timing, the term ‘pipeline’ refers to the interconnection of the stages. The events necessary to process an instruction are broken into several cycle-length tasks to allow work to be performed on ...

Page 4

Freescale Semiconductor, Inc. High-Level Differences Part II Processor Overview This section describes the high-level differences between the MPC750, MPC7400, and MPC7450. Also, it describes the pipeline differences for these three processors. 2.1 High-Level Differences To achieve a higher frequency, the ...

Page 5

Freescale Semiconductor, Inc. Table 2-1. High-Level Differences (continued) Microprocessor Feature Data cache load hit (integer, vector, float) IU1 (add, shift, rotate, logical) IU2: multiply (32-bit) IU2: divide FPU: single (add, mul, madd) FPU: single (divide) FPU: double (add) FPU: double ...

Page 6

Freescale Semiconductor, Inc. Pipeline Differences 2.2 Pipeline Differences The MPC7450 instruction pipeline differs significantly from the MPC750 and MPC7400 pipelines. Figure 2-1 shows the basic pipeline of the MPC750/MPC7400 processors. Figure 2-1. MPC750 and MPC7400 Pipeline Diagram Table 2-2 briefly ...

Page 7

Freescale Semiconductor, Inc. Table 2-3 briefly explains the MPC7450 pipeline stages. Table 2-3. MPC7450 Pipeline Stages Pipeline Stage Abbreviation Fetch1 Fetch2 Branch execute Dispatch Issue Execute E, E0, E1, ... Completion Write back The MPC7450 pipeline is longer than the ...

Page 8

Freescale Semiconductor, Inc. Overview of Target Microprocessors Figure 2-3. MPC750 Microprocessor Block Diagram 8 MPC7450 RISC Microprocessor Family Software Optimization Guide For More Information On This Product, Go to: www.freescale.com MOTOROLA ...

Page 9

Freescale Semiconductor, Inc. Instructions are fetched from the instruction cache and placed into a six-entry IQ. When the fetch pipeline is fully utilized, as many as four instructions can be fetched to the IQ during each clock cycle, subject to ...

Page 10

Freescale Semiconductor, Inc. Overview of Target Microprocessors If the branch is either b or bc, a taken branch can get instructions from the BTIC. The BTIC lookup is automatically performed based on the instruction address of the executing branch, and ...

Page 11

Freescale Semiconductor, Inc. Figure 2-4. MPC7400 Microprocessor Block Diagram MOTOROLA MPC7450 RISC Microprocessor Family Software Optimization Guide For More Information On This Product, Overview of Target Microprocessors Go to: www.freescale.com 11 ...

Page 12

Freescale Semiconductor, Inc. Overview of Target Microprocessors 2.3.2.2 MPC7400 Compiler Model A good compiler scheduling model for the MPC7400 includes the dispatch limitations of two instructions per clock, a base model of the CQ with maximum of eight instructions, the ...

Page 13

Freescale Semiconductor, Inc. Figure 2-5. MPC7450 Microprocessor Block Diagram MOTOROLA MPC7450 RISC Microprocessor Family Software Optimization Guide For More Information On This Product, Overview of Target Microprocessors Go to: www.freescale.com 13 ...

Page 14

Freescale Semiconductor, Inc. Overview of Target Microprocessors 2.3.3.1 Dispatch The bottom three IQ entries are available for dispatch, which involves the following: • Renaming—16 rename registers are available for each of the integer, floating-point, and vector operations. • Dispatching—Available issue ...

Page 15

Freescale Semiconductor, Inc. • Operand check—All source operands must be available before any execution can start. • Serialization check—If the instruction is execution serialized, it must wait to become the oldest instruction in the machine (bottom of the CQ entry) ...

Page 16

Freescale Semiconductor, Inc. Overview of Target Microprocessors a code performance point of view, the need for biasing the branch to be fall-through has increased to avoid the 1- or 2-cycle fetch bubble of a taken branch. The longer pipeline makes ...

Page 17

Freescale Semiconductor, Inc. Part III MPC7450 Microprocessor Details This section describes many architectural details of the MPC7450 and gives examples of the pipeline behavior. These attributes are also described in the MPC7450 RISC Microprocessor Family User’s Manual. 3.1 Fetch/Branch Considerations ...

Page 18

Freescale Semiconductor, Inc. Fetch/Branch Considerations 3.1.1.1 Fetch Alignment Example The following code loop is a simple array accumulation operation. xxxxxx18 loop: lwzu r10,0x4(R9) xxxxxx1C add r11,r11,r10 xxxxxx20 bdnz loop The lwzu and add are the last two instructions in one ...

Page 19

Freescale Semiconductor, Inc. Table 3-2. MPC7450 Loop Example—Three Iterations (continued) Instruction 0 add (3) bdnz (3) Loop unrolling and vectorization can further increase performance. These are described in Section 4.4.3, “Loop Unrolling for Long Pipelines,” and Section 4.4.4, “Vectorization.” 3.1.1.2 ...

Page 20

Freescale Semiconductor, Inc. Fetch/Branch Considerations Table 3-4. Eliminating the Branch-Taken Bubble Instruction lwz cmpi beq add 3.1.2 Branch Conditionals The cost of mispredictions increases with pipeline length. The following section shows common problems and suggests how to minimize them. 3.1.2.1 ...

Page 21

Freescale Semiconductor, Inc. the misprediction causes the outer loop code to be dispatched in cycle 13. If the branch had been correctly predicted as not taken, these instructions would have dispatched five cycles earlier in cycle 8. Table 3-6 shows ...

Page 22

Freescale Semiconductor, Inc. Fetch/Branch Considerations As Table 3-7 shows, the inner loop termination branch does not need to be predicted and is executed as a fall-through branch. Instructions in the outer loop start dispatching in cycle 8, saving five cycles ...

Page 23

Freescale Semiconductor, Inc. 3.1.4 Using the Link Register (LR) Versus the Count Register (CTR) for Branch Indirect Instructions On the MPC7450, a bclr uses the link stack to predict the target. To use the link stack correctly, each branch-and-link (bl) ...

Page 24

Freescale Semiconductor, Inc. Fetch/Branch Considerations ack: .... (possible calls to other functions) .... lwzu r4,4(r1) mtlr r4 bclr The bl in main pushes a value onto the hardware managed link stack (in addition to the architecturally-defined link register). Then the ...

Page 25

Freescale Semiconductor, Inc. Table 3-9. Position-Independent Code Example Instr. Instruction 0 1 No. 0 bcl 20, 31, $+ mfl — 2 addi r2, r2,#constant F1 F2 — 3 mtctr — ...

Page 26

Freescale Semiconductor, Inc. Dispatch Considerations break; } Assume r6 holds the address of SWITCH_TABLE for the following assembly code: lwz r4,x slwi r4, r4, 2 lwzx r5, r4, r6 mtctr r5 bctr Function pointers and virtual function calls should also ...

Page 27

Freescale Semiconductor, Inc. The dispatcher can send three instructions to the various issues queues, with a maximum of three to the GIQ, two to the VIQ, and one to the FIQ. Thus only two instructions can be dispatched per cycle ...

Page 28

Freescale Semiconductor, Inc. Dispatch Considerations 3.2.2 Dispatching Load/Store Strings and Multiples The MPC7450 splits load/store multiple instructions (lmw and stmw) and strings (lsw and stsw) into micro-operations at the dispatch point. The processor can dispatch only one micro-operation per cycle, ...

Page 29

Freescale Semiconductor, Inc. Because the MPC7450 can dispatch only one LSU operation per cycle, the lmw is micro-oped at a rate of one per cycle and so in this example takes seven cycles to dispatch all the operations. However, when ...

Page 30

Freescale Semiconductor, Inc. Issue Queue Considerations Table 3-12. GIQ Timing Example (continued) Instr. Instruction 0 No. 7 subf F2 GIQ5 GIQ4 GIQ3 GIQ2 GIQ1 GIQ0 Similar examples could also be given for loads bypassing adds, and multiplies bypassing loads. However, ...

Page 31

Freescale Semiconductor, Inc. 3.4 Completion Queue The following sections describe the conditions for the completion queue such as the re-order sizing, how the instruction sequence is grouped, and the effects of serialization. 3.4.1 Reorder Size The completion queue size on ...

Page 32

Freescale Semiconductor, Inc. FPU Considerations three IU1 instructions are stuck in the three reservation stations, requiring operands (or until the GIQ or dispatcher stalls for other reasons). Table 3-12 shows a case where although two IU1s are blocked, the third ...

Page 33

Freescale Semiconductor, Inc. Like the IU2, the FPU requires a separate finish stage to return CR and FPSCR data, as shown in Table 3-16. However, FPR data produced in E4 (the fifth stage) is ready and can be forwarded directly ...

Page 34

Freescale Semiconductor, Inc. FPU Considerations The timing for this sequence in Table 3-17 assumes that the load misses in the data cache. Here, after the first four fadds, the MPC7450 runs out of FPSCR rename registers and the pipeline stalls. ...

Page 35

Freescale Semiconductor, Inc. Instruction 0 vaddfp D vsubfp D — vlogefp — vcmpbfp. — vmaddfp F2 — 3.7 Load/Store Unit (LSU) The LSU has two reservation stations. Instruction execution is allowed only from the bottom reservation station (reservation station 0). ...

Page 36

Freescale Semiconductor, Inc. Load/Store Unit (LSU) 3.7.2 Store Hit Pipeline The pipeline for stores before the data is written to the cache includes several different queues. A store instruction must go through E0 and E1 to handle address generation and ...

Page 37

Freescale Semiconductor, Inc. 3.7.3 Store Gathering and Merging The MPC7450 implements two techniques to improve store performance by coalescing adjacent entries in the CSQ. Store gathering refers to coalescing adjacent cache-inhibited or write-through stores; store merging refers to coalescing adjacent ...

Page 38

Freescale Semiconductor, Inc. Load/Store Unit (LSU) Table 3-23. Load/Store Interaction (Assuming Full Alias) Instruction 0 1 stw r3,0x0(r9 lwz r4,0x0(r9 3.7.5 Misalignment Effects Misalignment, particularly the back-to-back misalignment of loads, can cause negative performance effects. The ...

Page 39

Freescale Semiconductor, Inc. Table 3-25. Data Cache Miss, L2 Cache Hit Timing Instruction 0 lwz r4,0x0(r9) E0 add r5,r4,r3 — load misses in the L1 data cache and in the L2 data cache, critical data forwarding occurs, followed ...

Page 40

Freescale Semiconductor, Inc. Load/Store Unit (LSU) Note that instruction 2 stalls in stage E1 (in the RA latch in Table 3-27). This stall occurs because the line miss caused by instruction 0 is the same line that instruction 2 requires. ...

Page 41

Freescale Semiconductor, Inc. 3.7.7 DST Instructions and the Vector Touch Engine (VTE) The MPC7450 VTE engine is similar to that on the MPC7400 but can only initiate an access every three cycles rather than two. However, due to miss-handling differences ...

Page 42

Freescale Semiconductor, Inc. Memory Subsystem (MSS) For a 128-Kbyte object, 82.8 percent is left in the L2 cache after one pass, but for a 256-Kbyte object only slightly less than two-thirds of the structure is left in the L2 cache. ...

Page 43

Freescale Semiconductor, Inc. However, if hardware prefetching is enabled, the hardware starts prefetching the line desired by instruction 4 even before instruction 4 accesses (and misses) the L1 data cache, thus parallelizing some serialized bus accesses. In Table 3-30, with ...

Page 44

Freescale Semiconductor, Inc. Optimizations to Exploit the MPC7450 Microprocessor Part IV Microprocessor Application to Optimal Code Although many of the code optimizations described in this document can also be performed by hand in assembly language, this chapter focuses on improving ...

Page 45

Freescale Semiconductor, Inc. in the final code. A general set of rules is given below. Although these rules are generally reliable, there are always a few cases where it can make sense to break them. • Use the load update ...

Page 46

Freescale Semiconductor, Inc. Optimizations to Exploit the Memory Hierarchy 4.2.2 Using the Link Register The CTR instruction pair mtctr/bcctr should be used for all computed branches. This includes case statement jumps and all indirect function calls. Note that to save ...

Page 47

Freescale Semiconductor, Inc. In future high performance processors that implement the PowerPC architecture, the preferred instruction alignment will be that the branch target be the first instruction in a quad word (target address = 0xxxxx_xxx0). 4.3.3 Load Hoisting Load hoisting ...

Page 48

Freescale Semiconductor, Inc. Other Optimizations Worth Investigating aval += bval; ... *a = aval bval; } else { * &= 0xff *b; ... } } Assembly code: cmpw 0,3,4 beq alias_case lwz 9,0(3) ...

Page 49

Freescale Semiconductor, Inc. For example, when it is known (or strongly suspected) that a 128-byte array structure is not in the data cache often not a good idea to load using a looped series of ...

Page 50

Freescale Semiconductor, Inc. Other Optimizations Worth Investigating 4.4.2 Software Pipelining With longer pipelines, more functional units, and higher instruction issue rate, the MPC7450 can provide more instruction level parallelism (ILP) than previous microprocessors. Loops that have long dependency chains may ...

Page 51

Freescale Semiconductor, Inc. Table 4-1. MPC7450 Execution of One—Two Iterations of Code Loop Example (continued) Instruction 0 add (4) bdnz lwzu (5) add (5) 4.4.4 Vectorization Transforming code to reference vector data as opposed to scalar data can produce significant ...

Page 52

Freescale Semiconductor, Inc. Other Optimizations Worth Investigating Table 4-2. MPC7450 Execution of 1–2 Iterations of Code Loop Example Instruction lvx (1-4) D vaddsws (1-4) D lvx (5-8)) — vaddsws (5-8) — lvx (9-12) vaddsws (9-12) lvx (13-16) vaddsws (13-16)) addi ...

Page 53

Freescale Semiconductor, Inc. Part V Optimized Code Sequences Many of the code sequences given in the book the PowerPC Compiler Writer’s Guide as optimal code sequences are no longer optimal for current microprocessors. The primary problem with the sequences suggested ...

Page 54

Freescale Semiconductor, Inc. Comparisons and Comparisons Against Zero 5.2 Comparisons and Comparisons Against Zero Table 5-2 shows the code sequences from Section D.1 of the PowerPC Compiler Writer’s Guide. In each example located in r3 and v1 is ...

Page 55

Freescale Semiconductor, Inc. Table 5-2. Comparisons and Comparisons Against Zero (continued) Operation ltu/gtu r = (unsigned_word) v0 < (unsigned_word (unsigned_word) v1 > (unsigned_word) v0; eq0 r = (v0 == 0); ne0 r = (v0 != 0); les0 ...

Page 56

Freescale Semiconductor, Inc. Negated Comparisons and Negated Comparisons Against Zero Table 5-3. Negative Comparisons and Negative Comparisons Against Zero Operation neq r = -(v0 == v1) nne r = -(v0 != v1) nles/nges r = -((signed_word) v0 <= (signed_word) v1) ...

Page 57

Freescale Semiconductor, Inc. Table 5-3. Negative Comparisons and Negative Comparisons Against Zero (continued) Operation nne0 nles0 r = -((signed_word) v0 <= 0); nges0 r = -((signed_word) v1 >= 0); nlts0 r = -((signed_word) v0 < ...

Page 58

Freescale Semiconductor, Inc. Comparisons with Addition are r5. For the cases where the second operand is assumed such as eq0+, assume that and v2 is ...

Page 59

Freescale Semiconductor, Inc. Table 5-4. Comparisons with Addition (continued) Operation les0 ((signed_word) v0 < ges0 ((signed_word) v0 > lts0 ((signed_word) v0 < gts0 ...

Page 60

Freescale Semiconductor, Inc. MPC7450 Execution Latencies Appendix A MPC7450 Execution Latencies This appendix lists the MPC750, MPC7400, and MPC7450 instruction execution latencies. Instructions are sorted by mnemonic, primary, extend, form, unit, and cycle. A high level summary of execution latencies ...

Page 61

Freescale Semiconductor, Inc. • The variable ‘b’ represents the processor/system-bus clock ratio. • The term ‘broadcast’ indicates a bus broadcast that has a minimum value of 3*b. • Additional cycles due to serialization are indicated in the cycles column with ...

Page 62

Freescale Semiconductor, Inc. MPC7450 Execution Latencies Table A-3. System Operation Instruction Execution Latencies (continued) MPC750 Mnemonic Unit mftb SRU mtmsr SRU mtspr (DBATs) SRU mtspr (IBATs) SRU mtspr (MSS mtspr (other) SRU mtspr (XER) SRU mtsr SRU ...

Page 63

Freescale Semiconductor, Inc. Table A-4. Condition Register Logical Execution Latencies (continued) Mnemonic mcrxr mfcr mtcrf 1 mtcrf of a single field is executed by an IU1 in a single cycle and is not serialized. The single field mtcrf executes significantly ...

Page 64

Freescale Semiconductor, Inc. MPC7450 Execution Latencies Table A-5. Integer Unit Execution Latencies (continued) Mnemonic cntlzw[.] divwu[o][.] divw[o][.] eqv[.] extsb[.] extsh[.] mulhwu[.] mulhw[.] mulli mull[o][.] nand[.] neg[o][.] nor[.] orc[.] ori oris or[.] rlwimi[.] rlwinm[.] rlwnm[.] slw[.] srawi[.] sraw[.] srw[.] subfc[o][.] subfe[o][.] ...

Page 65

Freescale Semiconductor, Inc. Table A-5. Integer Unit Execution Latencies (continued) Mnemonic xori xoris xor[.] 1 If the record bit is set, the GPR result is available in 1 cycle while the CR result is available in the second cycle. 2 ...

Page 66

Freescale Semiconductor, Inc. MPC7450 Execution Latencies Table A-6. Floating-Point Unit (FPU) Execution Latencies (continued) Mnemonic fnabs[.] fneg[.] fnmadds[.] fnmadd[.] fnmsubs[.] fnmsub[.] fres[.] frsp[.] frsqrte[.] fsel[.] fsubs[.] fsub[.] mcrfs mffs[.] mtfsb0[.] mtfsb1[.] mtfsfi[.] mtfsf[.] Table A-7 shows load and store instruction ...

Page 67

Freescale Semiconductor, Inc. Table A-7. Store Unit (LSU) Instruction Latencies (continued) Mnemonic Class dssall - NA - dsts[ dst[ eciwx - NA - icbi - NA - lbz - NA - lbzu GPR lbzux ...

Page 68

Freescale Semiconductor, Inc. MPC7450 Execution Latencies Table A-7. Store Unit (LSU) Instruction Latencies (continued) Mnemonic Class lvsl Vector lvsr Vector lvx Vector lvxl Vector lwarx GPR lwbrx GPR lwz GPR lwzu GPR lwzux GPR lwzx GPR stb GPR stbu GPR ...

Page 69

Freescale Semiconductor, Inc. Table A-7. Store Unit (LSU) Instruction Latencies (continued) Mnemonic Class stvehx Vector stvewx Vector stvx Vector stvxl Vector stw GPR stwbrx GPR stwcx. GPR stwu GPR stwux GPR stwx GPR tlbie - NA - tlbld - NA ...

Page 70

Freescale Semiconductor, Inc. MPC7450 Execution Latencies Table A-8. AltiVec Operations—Vector Simple Integer Unit (continued) Mnemonic vand vandc vavgsb vavgsh vavgsw vavgub vavguh vavguw vcmpequb[.] vcmpequh[.] vcmpequw[.] vcmpgtsb[.] vcmpgtsh[.] vcmpgtsw[.] vcmpgtub[.] vcmpgtuh[.] vcmpgtuw[.] vmaxsb vmaxsh vmaxsw vmaxub vmaxuh vmaxuw vminsb vminsh ...

Page 71

Freescale Semiconductor, Inc. Table A-8. AltiVec Operations—Vector Simple Integer Unit (continued) Mnemonic vslb vslh vslw vsrab vsrah vsraw vsrb vsrh vsrw vsubcuw vsubsbs vsubshs vsubsws vsububm vsububs vsubuhm vsubuhs vsubuwm vsubuws vxor Table A-9 lists vector complex integer instruction latencies. ...

Page 72

Freescale Semiconductor, Inc. MPC7450 Execution Latencies Table A-9. AltiVec Operations—Vector Complex Interger Unit (continued) Mnemonic vmulesb vmulesh vmuleub vmuleuh vmulosb vmulosh vmuloub vmulouh vsum2sws vsum4sbs vsum4shs vsum4ubs vsumsws Table A-10 lists vector floating-point (VFPU) instruction latencies. Table A-10. AltiVec Operations—Vector ...

Page 73

Freescale Semiconductor, Inc. Table A-10. AltiVec Operations—Vector Floating-Point Unit (continued) Mnemonic vnmsubfp vrefp vrfim vrfin vrfip vrfiz vrsqrtefp vsubfp 1 In Java mode, MPC7400 VFPU instructions need a fifth cycle of execution (5:1) but data dependencies are still forwarded from ...

Page 74

Freescale Semiconductor, Inc. MPC7450 Execution Latencies Table A-11. AltiVec Operations—Vector Permute Unit (continued) Mnemonic vsplth vspltisb vspltish vspltisw vspltw vsr vsro vupkhpx vupkhsb vupkhsh vupklpx vupklsb vupklsh 74 MPC7450 RISC Microprocessor Family Software Optimization Guide For More Information On This ...

Page 75

Freescale Semiconductor, Inc. Appendix B Revision History Table B-1 provides a revision history for this hardware specification. Rev. No. 0 Initial release, 11/01 Section 3.1.4, third sentence in the third paragraph, MOTOROLA MPC7450 RISC Microprocessor Family Software Optimization ...

Page 76

... Motorola and the Stylized M Logo are registered in the U.S. Patent and Trademark Office. digital dna is a trademark of Motorola, Inc. All other product or service names are the property of their respective owners. Motorola, Inc Equal Opportunity/Affirmative Action Employer. © Motorola, Inc. 2002 AN2203/D For More Information On This Product, Go to: www.freescale.com ...

Related keywords