AN2203 Freescale Semiconductor / Motorola, AN2203 Datasheet - Page 49

no-image

AN2203

Manufacturer Part Number
AN2203
Description
MPC7450 RISC Microprocessor Family Software Optimization Guide
Manufacturer
Freescale Semiconductor / Motorola
Datasheet

Available stocks

Company
Part Number
Manufacturer
Quantity
Price
Part Number:
AN22030A
Manufacturer:
PANASONIC/松下
Quantity:
20 000
For example, when it is known (or strongly suspected) that a 128-byte array structure is not in the data cache,
it is often not a good idea to load it in by using a looped series of lwzu rx, 0x4(ry) instructions. Note that
128 bytes is equal to four cache blocks on the MPC750/MPC7400/MPC7450, because all three
microprocessors have 32-byte cache blocks.
The second (and subsequent) loads stall until the first gets its data from memory. When the 9th,17th, and
25th loads miss, the 10th, 18th, and 26th loads collide on them and again stall the pipe. Better bandwidth
can be achieved if the four cache block misses are allowed to go out in parallel, which requires that each of
the first four accesses be to one of the four lines that needs loading.
Determining whether this is best done with loads, dcbt instructions, a dst, or a combination of the above,
can be complicated. In the above scenario, one load and three dcbt instructions may be the best solution.
Generally, dcbt instructions are best used to prefetch a few cache blocks of information, but dst instructions
are best used when pulling in a larger amount of information. However, the trade-offs are often application
dependent.
The VTE engine on the MPC7450 can initiate a prefetch once every three cycles. Because the engine can
sometimes fall behind actual code execution and thus become useless, one useful trick can be to prefetch
less data with a particular dst, and then refresh the dst every so often with a new block to prefetch.
Determining the amount of data to prefetch with a particular dst and the refresh rate is often very application
(also platform/environment) dependent, and usually requires some trial and error experimentation. See
Section 5.2.1.8 “Stream Usage Notes,” in the AltiVec Technology Programming Environments Manual for
additional reasons why numerous small dst operations are likely to provide better performance than a few
large dst operations.
The following code shows pseudo-code for two loops. The first loop performs a single dst operation for the
entire data stream, while the second performs several smaller dst operations. If the VTE engine falls behind
for the first loop, it provides no benefit from that time forward. If the VTE engine in the second loop falls
behind the computation, it is likely that in the next iteration of the outer loop, the VTE engine will again be
prefetching useful data, as the VTE engine is reprogrammed to prefetch what is going to be required next.
For example, assume that the VTE engine only prefetches the first four blocks in the dst before falling
behind. In the first loop, only 4 out of 256 blocks are prefetched. In the second loop, the first four blocks in
each iteration of the outer loop are prefetched in time, for a total of 128 blocks usefully prefetched.
MOTOROLA
/* Single dst for entire array. */
vec_dst(a, <256 blocks of 32 byte size>)
for (i=0; i<2048; i++) {
}
/* Series of smaller dsts. */
for (i=0; i<2048; i+=64) {/* 32 iterations of this loop. */
}
MPC7450 RISC Microprocessor Family Software Optimization Guide
total += A[i];
vec_dst(a[i], <8 blocks of 32 byte size>)
for (j=i; j<i+64; j++) {
}
Freescale Semiconductor, Inc.
For More Information On This Product,
total += A[j];
Go to: www.freescale.com
Other Optimizations Worth Investigating
49

Related parts for AN2203