AN2203 Freescale Semiconductor / Motorola, AN2203 Datasheet - Page 49

AN2203

Manufacturer Part Number

AN2203

Description

MPC7450 RISC Microprocessor Family Software Optimization Guide

Manufacturer

Freescale Semiconductor / Motorola

Datasheet

1.AN2203.pdf (76 pages)

Available stocks

Company

Part Number

Manufacturer

Quantity

Price

Company:

Meier Automation Equipment Co., Limited

Part Number:

AN22030A

Manufacturer:

PANASONIC/松下

Quantity:

20 000

Current page: 49 of 76
Download datasheet (650Kb)

For example, when it is known (or strongly suspected) that a 128-byte array structure is not in the data cache,

it is often not a good idea to load it in by using a looped series of lwzu rx, 0x4(ry) instructions. Note that

128 bytes is equal to four cache blocks on the MPC750/MPC7400/MPC7450, because all three

microprocessors have 32-byte cache blocks.

The second (and subsequent) loads stall until the ﬁrst gets its data from memory. When the 9th,17th, and

25th loads miss, the 10th, 18th, and 26th loads collide on them and again stall the pipe. Better bandwidth

can be achieved if the four cache block misses are allowed to go out in parallel, which requires that each of

the ﬁrst four accesses be to one of the four lines that needs loading.

Determining whether this is best done with loads, dcbt instructions, a dst, or a combination of the above,

can be complicated. In the above scenario, one load and three dcbt instructions may be the best solution.

Generally, dcbt instructions are best used to prefetch a few cache blocks of information, but dst instructions

are best used when pulling in a larger amount of information. However, the trade-offs are often application

dependent.

The VTE engine on the MPC7450 can initiate a prefetch once every three cycles. Because the engine can

sometimes fall behind actual code execution and thus become useless, one useful trick can be to prefetch

less data with a particular dst, and then refresh the dst every so often with a new block to prefetch.

Determining the amount of data to prefetch with a particular dst and the refresh rate is often very application

(also platform/environment) dependent, and usually requires some trial and error experimentation. See

Section 5.2.1.8 “Stream Usage Notes,” in the AltiVec Technology Programming Environments Manual for

additional reasons why numerous small dst operations are likely to provide better performance than a few

large dst operations.

The following code shows pseudo-code for two loops. The ﬁrst loop performs a single dst operation for the

entire data stream, while the second performs several smaller dst operations. If the VTE engine falls behind

for the ﬁrst loop, it provides no beneﬁt from that time forward. If the VTE engine in the second loop falls

behind the computation, it is likely that in the next iteration of the outer loop, the VTE engine will again be

prefetching useful data, as the VTE engine is reprogrammed to prefetch what is going to be required next.

For example, assume that the VTE engine only prefetches the ﬁrst four blocks in the dst before falling

behind. In the ﬁrst loop, only 4 out of 256 blocks are prefetched. In the second loop, the ﬁrst four blocks in

each iteration of the outer loop are prefetched in time, for a total of 128 blocks usefully prefetched.

MOTOROLA

/* Single dst for entire array. */

vec_dst(a, <256 blocks of 32 byte size>)

for (i=0; i<2048; i++) {

}

/* Series of smaller dsts. */

for (i=0; i<2048; i+=64) {/* 32 iterations of this loop. */

}

MPC7450 RISC Microprocessor Family Software Optimization Guide

total += A[i];

vec_dst(a[i], <8 blocks of 32 byte size>)

for (j=i; j<i+64; j++) {

}

Freescale Semiconductor, Inc.

For More Information On This Product,

total += A[j];

Go to: www.freescale.com

Other Optimizations Worth Investigating

AN2203 Freescale Semiconductor / Motorola, AN2203 Datasheet - Page 49

AN2203

Available stocks

Related parts for AN2203