[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

Indigo2 and POWER Indigo2 Technical Report

Section 9 MIPSpro Compiler Technology

This chapter consists of three sections:

The MIPSpro Compiler
Optimization Technology in the MIPSpro Compilation System
Porting Cray Code

The chapter concludes with a brief reference section.

9.1 The MIPSpro Compiler

The MIPSpro compilers are Silicon Graphics third-generation family of optimizing and parallelizing compilers. The compilers generate 64-bit optimized code for the MIPS R8000-based Indigo2 systems running IRIX 6.0.1, the new 64-bit operating system, and support Fortran 77, C, and C++. Highlights:

support for 64-bit computing across all compilers
state-of-the-art optimization technology
comprehensive support for parallel application development
support for 64-bit native and migration development

The compilers can be installed on IRIX 6.0.1- or IRIX 5.3-based Silicon Graphics systems:

On an IRIX 6.0.1-based Indigo2 system, the compilers run in native mode and support development, debugging and execution of 64-bit programs.
On an IRIX 5.3 system, the compilers run in 64-bit migration mode to support development and static debugging of 64-bit programs. The resulting 64-bit programs can be deployed on IRIX 6.0.1-based Indigo2 systems. Thus, software developers can develop 64-bit programs on their existing IRIX 5.X-based desktop and server systems and deploy the resulting programs on IRIX 6.0.1-based Indigo2 systems.

Figure 22 diagrams the relationship between MIPSpro compilers and IRIX systems.

FIGURE 22 MIPSpro Compilers and IRIX Versions

Figure 23 illustrates the various components of the MIPSpro compilation system. Components include the compiler back end and the Parallel C and Fortran 77 analyzer. Scalar and superscalar optimizations common to all the compilers are performed in the common back end. All parallel processing optimizations are performed in the parallel analyzer phases of the compilation system. Kuck and Associates (KAI) technology is an integral part of the MIPSpro compilation system.

The compilers support value-added extensions to:

enrich the basic functionality of the compilers
ease the porting of applications written for other platforms
boost run-time performance of applications

This section explains:

optimizations
memory hierarchy optimizations
automatic and user-assisted parallelization
high-performance scientific libraries
software development support

9.1.1 Optimizations

The compilers perform a range of general-purpose and architecture-specific optimizations to improve application performance by reducing the number of instructions executed, thus making better use of the CPU's instruction set, maximizing register use, minimizing memory references, and eliminating unused or redundant code.

A range of new optimization techniques takes maximum advantage of the new processor features such as on-chip and off-chip caches, pipelining, and superscalar chip architecture. The optimizations are applicable to a wide range of scientific and engineering applications and benefit both scalar and parallel performance of applications.

A rich assortment of command-line options can leverage different combinations of optimizations. In general, the optimizations are spread across the compilation system for better efficiency. For instance, high-level optimizations like loop interchange and loop unrolling are performed in the compiler front ends, whereas architecture-specific optimizations like software pipelining and automatic blocking are implemented in the common back end. All optimizations are fine-tuned to take advantage of the new system. Key optimizations:

architecture-specific optimizations
- software pipelining
- instruction scheduling
- automatic blocking
- register blocking
- array padding
- global instruction distribution
Statement level optimizations
- array expansion
- common subexpression elimination
- global constant propagation
- dead code elimination
- Global Code Motion (GCM)
Loop-level optimizations, including combinations of:
- loop unrolling
- loop interchange
- unroll-and-jam
- loop distribution
- loop fusion
- loop invariant code motion
- sum reduction
Procedure-level optimizations
- procedure inlining
- interprocedural analysis (IPA)

9.1.2 Memory Hierarchy (Cache) Optimizations

Memory hierarchy optimizations play a key role in matching the performance capabilities of the fast superscalar processor with the relatively slower main memory system. The primary function of the cache subsystem is to bridge the gap between processor and main memory speed. In this sense the cache architecture of a modern superscalar microprocessor like the MIPS R8000 is similar to the vector register sets of traditional vector supercomputers that used vector registers to keep the processors busy with computation without having to frequently reference main memory.

Caches are based on the observation that most application programs exhibit some degree of locality of reference: programs access pieces of data that are "near" already requested data, in space and time. A program that accesses memory without regard to locality of reference might perform poorly because of a high number of cache misses. The compiler plays a crucial role in restructuring programs to reduce cache misses by interchanging loops, or by tiling or blocking loop nests so that data is consumed most efficiently by the processor. This arrangement is similar to traditional vectorizing compilers that restructured programs to fit in vector memory (registers) in pieces. In short, the compiler restructures programs so that a useful subset of the problem can fit into the cache. Thus the processor can work on patches of the original code and data from the cache memory (thereby avoiding main memory references) before moving on to the next patch.

Figure 24 illustrates the cache misses per FLOP as a function of different cache sizes for a broad range of scientific and engineering applications.

FIGURE 24 Working set effects on scientific applications [1]

[1] From "Working Sets, Cache Sizes, and Node Granularity Issues for Large-scale Multiprocessors" by Jaswinder Pal Singh, Anoop Gupta, and Edward Rothberg, in Proceedings of the 20th International Symposium on Computer Architecture.

For cache sizes of less than 1 MB, the miss ratio becomes negligible. This experiment illustrates how a combination of a moderately large cache size and good compiler technology can reduce cache misses to a negligible amount for a large (but not all-inclusive) class of big scientific and engineering problems.

9.1.3 Automatic and User-assisted Parallelization

The MIPSpro Power compilers (MIPSpro Power Fortran 77 and MIPSpro Power C) support automatic and user-directed parallelization of Fortran and C applications for multiprocessing execution. The compilers employ automatic parallelization techniques to analyze and restructure user applications for parallel execution, as preferred by users who rely on the compilers to parallelize their applications.

The compilers also provide a comprehensive set of standards-based comment directives that enable users to assist the compiler in the parallelization process. Users can use these directives to provide additional information to the compiler to boost parallel performance.

The parallelization technology means that those using a POWER Indigo2 to write applications to be run on POWER Challenge and POWER Onyx can take advantage of the POWER Challenge system architecture. A combination of automatic and user-assisted parallelization can lead to substantial improvements in the performance of many programs.

9.1.4 High-performance Scientific Libraries

The compilers are complemented by CHALLENGEcomplib, a comprehensive collection of scientific and math subroutine libraries that provide support for mathematical and numerical algorithms used in scientific computing. The key motivation for creating CHALLENGEcomplib is to provide standard library functionality and to improve the runtime performance of applications.

CHALLENGEcomplib is similar to scientific libraries provided by other supercomputing vendors like the Cray SCILIB, IBM ESSL and the Convex VECLIB. The library consists of two subcomponents: complib.sgimath and complib.slatec. These libraries include support for Basic Linear Algebra Subprograms (BLAS), Extended BLAS (Level 2 and Level 3), LAPACK, FFT, Convolutions, and selected routines from Cray's SCILIB and SLATEC from the Energy, Science and Technology Software Center.

9.1.5 Software Development Support

The MIPSpro compiler family includes basic debugging and program runtime analysis tools including dbx, pixie and prof.

dbx is the source level debugger that facilitates debugging of Fortran 77, C and C++ code. dbx also supports debugging of parallel Fortran 77 and C code.
prof is the standard profiling tool that provides "program counter (PC) sampling" of an application's execution. This information identifies the compute intensive portions of the application and forms the basis for performance tuning of programs.
pixie is also a profiling tool that provides statement-level execution profile by using a technique called basic-block counting, pixie provides much finer resolution than prof.

Both pixie and prof can be used to profile parallel programs.
CASEVision/Workshop, a suite of software development tools that includes:
- Workshop/Pro MPF, the parallel program development tool
- Workshop/Debugger, the parallel debugger
- Workshop/Performance Analyzer, the parallel program profiling and performance tuning tools
- other static and dynamic analysis tools

9.2 Optimization Technology in the MIPSpro Compilation System

Optimization technology is an integral part of the MIPSpro Compilation system. Supercomputing microprocessors like the MIPS R8000 offer enormous performance potential that needs to be harnessed by advanced optimization techniques. The optimizer takes the processor, system and program characteristics into consideration when restructuring programs for performance.

Figure 25 illustrates three important parameters that influence the effectiveness of the optimization phases of the compilers.

FIGURE 25 Optimization Phase Parameters

This section discusses

processor architecture
system architecture
application characteristics
optimization technology
basic block optimizations
global optimizations
advanced optimizations
floating-point optimizations
global code motion
software pipelining
pointer optimizations
loop optimizations
parallelization
procedure inlining
interprocedural analysis

9.2.1 Processor Architecture

The optimizer has a sound understanding of the architecture and performance characteristics of the MIPS R8000 64-bit superscalar processor and it uses this information in generating optimized code for applications. The R8000 processor implements the MIPS IV Instruction Set Architecture (ISA) and has the following features:

a large fast-access, high-throughput cache subsystem specially designed for moving floating-point data
64-bit integer and floating-point operations, 32 floating-point registers, and virtual addressing capabilities
on-chip instruction and data caches
branch-prediction capabilities
ANSI/IEEE-754 standard floating-point coprocessor with imprecise exceptions
full compatibility with earlier 32-bit and 64-bit MIPS microprocessors

9.2.2 Application Characteristics

: The compilers perform extensive analysis of application programs to perform appropriate optimizing transformations. Different applications exhibit different characteristics that determine the degree of optimizations that can be performed. For instance, compute-intensive applications that reference data in a regular manner may be very amenable to an optimization known as loop blocking. Similarly, applications that reference different sections of data simultaneously can be parallelized by the compiler for concurrent execution. In short, application characteristics play an important role in determining the degree and effectiveness of the optimizations that can be performed.

9.2.3 Optimization Technology

The MIPSpro compilers perform a hierarchy of optimizations ranging from fine-grained instruction-level optimizations to coarse-grained parallelization through loops and tasks to reduce the execution time of applications. The optimization phases are spread across the compilation system.

Instruction-level optimizations are performed mostly in the common back end to extract maximum performance out of the R8000 superscalar processor. Common instruction-level optimizations include software pipelining, instruction scheduling, global instruction movement and register allocation.

Loop-level optimizations are performed in the early stages of the compilation process. Key loop-level optimizations include automatic loop blocking, loop interchange and loop unrolling.

Automatic loop blocking and loop interchange are memory hierarchy optimizations that take advantage of the cache architecture of the machine. Similarly, loop unrolling attempts to expose more instruction level parallelism to the optimizer for fine-grained parallelism.

Other optimizations such as loop distribution and loop fusion are important for efficient parallel execution. The compiler uses extensive analysis and transformation techniques to detect parallelism in programs. The compilation system supports a complete runtime environment for parallel execution. This runtime library is common to all the MIPSpro compilers.

Figure 26 illustrates the different kinds of parallelism exploited by the compilers.

FIGURE 26 Compiler Parallelisms

9.2.4 Basic Block Optimizations

Basic optimizations are performed at optimization level 1 (-O1) and are meant to create efficient scalar code at the basic block level for both C and Fortran programs. A basic-block is a sequence of statements ending with a condition or unconditional branch. Many scalar optimizations are performed at the basic block level to improve the efficiency of the generated code. The following are some of the most common optimizations performed at the basic-block level:

algebraic simplification
common-subexpression elimination
constant propagation and constant folding
dead-code elimination

The compiler also performs the full range of scalar optimizations, including:

invariant IF floating
loop unrolling and loop rerolling
loop fusion, loop peeling
array expansion

9.2.5 Global Optimizations

The compilers perform extensive global optimizations at optimization level 2 (-O2) that are usually beneficial to most applications, and are conservative in nature to ensure integrity of the results of floating point computations. The compiler performs memory hierarchy optimizations to maximize reuse of data in the cache. Optimizations performed at this level include:

loop blocking, loop interchange
global constant propagation to propagate and fold constants across basic blocks.
control flow optimizations to remove redundant statements, delete unreachable sections of the program, and to combine different basic-blocks into large basic blocks.
strength reduction to replace expensive operations with simple ones.
induction variable simplification
fenceposting
backward motion of region invariants
forward motion of stores
collapsing if statements
global copy propagation
arithmetic expression folding
all -O1 level optimizations

9.2.6 Advanced Optimizations

: The compilers perform aggressive optimizations at level 3 (-O3) that focus on maximum code quality. Considerable flexibility is provided to enable combination of optimizations at this level for maximum floating point performance. Important superscalar optimizations like software pipelining are performed at this level.

9.2.7 Floating-point Optimizations

The compilers normally generate floating-point code that conforms to the IEEE 754 floating-point standard. However, many floating-point-intensive codes that have not been written with careful attention to floating-point behavior do not require precise conformance with the source language expression evaluation standards or the IEEE 754 arithmetic standards. It is therefore possible to relax the conformance restrictions in favor of better performance. The MIPSpro compilers provide a number of different command-line options to accomplish this goal:

roundoff options
IEEE floating-point options
reciprocal and reciprocal square-root
fast intrinsics

Roundoff Option

The -OPT:roundoff=n flag is available to determine the extent to which optimizations are allowed to affect floating-point results, in terms of both accuracy and overflow/underflow behavior.

IEEE Option

The -OPT:IEEE_arithmetic=n flag specifies the extent to which optimizations should preserve IEEE floating-point arithmetic.

Reciprocal and Reciprocal Square Root

The flexible floating-point options provide users with a range of alternatives to trade off accuracy for speed. Thus applications can take advantage of fast MIPS IV instructions like recip and rsqrt. This is particularly significant for applications that were running on Crays, which have several fewer bits of precision than IEEE 64-bit. Heavy users of Cray and other non-IEEE compliant vector machines who have a need for speed can use these options.

In short, optimizing divides into multiplies by using reciprocal, and lifting the inverse calculation outside the loop can give rise to superior performance improvements. For example, if IEEE conformance is required, the generated code must do the n loop iterations in order, with a divide and an add in each iteration. Alternatively, if IEEE conformance is not required, the implementation of x/y as x * recip(y), or sqrt(x) as x * rsqrt(x) can be used to treat the divide as a(i) * (1.0/divisor). On the MIPS R8000, the reciprocal can be calculated once before the loop is entered, thereby reducing the loop body to a much faster multiply and add per iteration, which can be a single madd instruction on the R8000:

INTEGER i,n
REAL sum, divisor, a(n)
sum = 0.0
do i = 1, n
  sum = sum + a(i) / divisor
enddo

For example, a loop encountered in the Computational Chemistry Application called WESDYN developed at Wesleyan University is representative of loops that are frequently encountered in computation-intensive portions of chemistry applications. The loop contains reciprocal as well as square-root operations that are candidates for higher performance.

do 200i = 1,n

  r2(i) = 1 / ( xx(i3) ** 2 + xx(i3 + 1) ** 2 + xx(i3 + 2) ** 2)
  r1(i) = sqrt( r2(i))
  i3 = i3 + 3

200 continue

recip and rsqrt are also important to graphics applications that use the reciprocal and reciprocal square root operations in important computational parts.

Valid speculative code motion must normally avoid moving operations which may cause runtime traps. As a result, turning off certain traps at runtime enables more code motion. Thus, applications that can ignore floating point exceptions in certain segments of the program can take advantage of this optimization. Similarly, applications that can ignore memory access exceptions can also take advantage of this feature. For example, in the SPECfp92 benchmark 013.spice2g6, it is possible to enable speculative code motion for a critical loop by turning off floating point and memory exceptions around that loop, resulting in a healthy performance gain.

Fast Intrinsics

The MIPSpro compilation system supports a fast version of intrinsic library functions. Selected mathematical functions from the standard mathematical library are hand-coded in assembly language to take maximum advantage of the MIPS IV instruction set of the MIPS R8000 architecture. Specifically, frequently used intrinsics like the transcendental functions (log, exp, power, sin, cos, cis and tan) are hand-coded in assembly and are part of a separate fast mathematical library.

The fast library can be invoked with the -lfastm command-line flag. The accuracy level of all the hand-code transcendental functions (except for tan) is better than 2 ULPS (units in the least significant place).

9.2.8 Global Code Motion

Global Code Motion (GCM) is a general-purpose optimization that redistributes the instructions among basic blocks along an execution path to improve instruction-level parallelism and make better use of machine resources.

Traditional global optimizers avoid moving instructions in cases that might cause them to be executed along control flow paths where they would not have been in the original program. The MIPSpro global optimizer does perform such code motion, called speculative code motion, because the instructions moved are executed based on speculation that they will actually prove useful. This kind of aggressive code motion is unique to the MIPSpro compilers. By default, GCM is very conservative in its speculations. However, a number of options are available to control the degree of speculation.

Valid speculative code motion must normally avoid moving operations that might cause runtime traps. As a result, turning off certain traps at runtime enables more code motion. Thus, applications that can ignore floating point exceptions in certain segments of the program can take advantage of this optimization. Similarly, applications that can ignore memory access exceptions can also take advantage of this feature. For example, in the SPECfp92 benchmark 013-spice2g6, speculative code motion can be enabled for a critical loop by turning off floating-point and memory exceptions around that loop, resulting in a healthy performance gain.

Figure 27 illustrates the different kinds of speculations available.

FIGURE 27 Speculation Types

What follows is a brief description of each of these optimizations (with appropriate user-level flag control):

aggressive speculation. (-GCM:aggressive_speculation [ = (ON | OFF) ] )
GCM normally does not move instructions to basic blocks that are already using most of the instruction execution resources available, since doing so will likely extend the execution time of the block. This option minimizes that bias, which often helps floating-point-intensive code.
array speculation. (-GCM:array_speculation [ = (ON | OFF)])
A form of speculation that is often very effective is called bottom loading. It moves instructions from the top of the a loop's body to both the block before the loop (for the first iteration) and to the end of the loop body (which executes them at the end of one iteration so that they will be ready early for the next iteration). Doing this, however, means that the instructions are executed one or more extra times in the last iteration(s) of the loop. If the instructions moved are loading elements of an array, extra accesses might occur beyond the end of the array. This option permits such out-of-bounds array references by padding the arrays to prevent the out-of-bounds references from causing memory exceptions.
pointer speculation. (-GCM:pointer_speculation [ = (ON | OFF ) ] )
This option allows speculative motion of loads of two kinds, both involving pointer usage. The first allows motion of loads through pointers that may be NULL. The second form moves a reference like *(p + n), for a small integer n, to a block that already contains a reference to *p. The assumption is that if p is a valid address, p+n will be, too. In the example below, the load of p->next->val can be moved before the if through a potentially NULL pointer, as can the load of p->final_val, which is offset by a small amount from the p->next reference.
```
if ( p -> next != NULL ) {
  sum = sum + p -> next -> val ;
} else {
  sum = sum + p -> final_val ;
}
```
Static Load Speculation. (-GCM:static_load_speculation [ = (ON | OFF ) ] )

This option allows the speculative motion of loads from static data areas.

9.2.9 Software Pipelining

Software pipelining is arguably the most important architecture-specific optimization implemented in the MIPSpro Compilers. It is a practical, efficient and general-purpose scheduling technique for exploiting fine-grained, instruction level parallelism available in modern-day superscalar processors like the MIPS R8000.

In software pipelining, iterations of loops are continuously initiated at constant intervals without having to wait for preceding iterations to complete. That is, multiple iterations, in different stages of computation, are in progress simultaneously. The steady state of this pipeline constitutes the loop body of the object code. Consider this simple DAXPY loop:

do i = 1, n
  v (i) = v(i) + X * w(i)
enddo

On the MIPS R8000 architecture, this loop can be coded in assembly language as two load instructions followed by a multiply-add instruction and a store. Figure 28 shows this execution order constraint.

FIGURE 28 Execution Order Constraint

This simple schedule completes one iteration of the loop body in five machine cycles. Considering that the R8000 processor allows up to two memory operations and two floating-point operations in the same cycle, the instructions above are initiated by the machine as outlined in Table 8. (This schedule completes one iteration of the loop in five machine cycles):

Cycle Count    Memory Operations             Floating-point Operations

0              load andv(i); load andw(i)    madd X, w(i), v(i)
1
2
3
4              store andv(i)

TABLE 8 Simple Schedule for Loop Body of DAXPY

If the same loop were unrolled by a factor of four, the loop body would look like:

do i = 1, n, 4

  v(i) = v(i) + X * w(i)
  v(i+1) = v(i+1) + X * w(i+1)
  v(i+2) = v(i+2) + X * w(i+2)
  v(i+2) = v(i+2) + X * w(i+2)

enddo

The unrolled loop allows for instructions from four independent iterations to execute in parallel, thereby increasing the instruction level parallelism and hence the performance of this loop. Table 9 illustrates the sequence of loads, madds and stores for this unrolled loop. (This schedule completes four iterations of the loop in eight machine cycles)

Cycle Count    Memory Operations                 Floating Point Operations

0 	       load andv(i); load andw(i)        madd X, w(i), v(i)
1 	       load andv(i+1); load andw(i+1)    madd X, w(i+l), v(i+l)
2 	       load andv(i+2); load andw(i+2)    madd X, w(i+2), v(i+2)
3 	       load andv(i+3); load andw(i+3)    madd X, w(i+3), v(i+3)
4 	       store andv(i)
5 	       store andv(i+1)
6 	       store andv(i+2)
7 	       store andv(i+3)

TABLE 9 Schedule for Loop Body of Four-way Unrolled DAXPY

The schedule above completes four iterations of the loop body in eight cycles, thereby improving the performance to one quarter of R8000's peak megaFLOPs. However, this schedule still leaves room for improvement. Each store has to wait three cycles for its corresponding madd instruction to complete, thereby forcing the four store operations to be initiated in different cycles. In other words, the schedule above does not take advantage of the R8000's ability to do two stores in one cycle.

By using software pipelining, the loop instructions can be initiated at constant intervals such that each iteration executes a combination of loads and stores from different iterations. In the DAXPY example, this can result in a schedule that would complete two iterations every three cycles to realize significant performance improvements over the two previous schedules.

Table 10 illustrates the machine schedule for the software pipelined version of the above loop. To prepare properly for entry into such a loop, a prologue section of code is added that sets up the registers for the first few stores in the main loop. Similarly, in order to exit the loop properly, an epilog section is added that performs the final stores. Any preparation of registers needed for the epilog is done in the cleanup section of the code. (The main loop completes four different iterations in six machine cycles).

Cycle Count    Memory Operations        Floating Point Operations

PROLOG:
               t1 = load andv(i);       t7 = madd X, w(i), v(i)
               t2 = load andw(i)

               t4 = load andv(i+1);     t8 = madd X, w(i+1), v(i+1)
               t5 = load andw(i+1)
MAINLOOP:

0              t1 = load andv(i+2);     t3 = madd X, w(i+2), v(i+2)
               t2 = load andw (i+2)

1              t4 = load andv(i+3);     t6 = madd X, w(i+3), v(i+3)
               t5 = load andw (i+3)

2              store t7; store t8       beq CLEANUP

3              t1 = load andv(i+4);     t7 = madd X, w(i+4), v(i+4)
               t2 = load andw(i+4)

4              t4 = load andv(i+5);     t8 = madd X, w(i+5), v(i+5)
               t5 = load andw(i+5)

5              store t3; store t6       bne MAINLOOP;

EPILOG:        store t7; store t8       br ALLDONE

CLEANUP:       t7 = t3; t8 = t6
               br EPILOG

ALLDONE:

TABLE 10 Schedule for Software-pipelined DAXPY

In the main loop, the code completes four different iterations in six cycles, which is better than the previous two schedules. Table 11 illustrates the performance improvement in the three cases.

Scheduling Type        Performance                 Speedup

Simple scheduling      1 iteration in 5 cycles     1.0
4-way unrolling        4 iterations in 8 cycles    2.5
Software pipelining    4 iterations in 6 cycles    3.3

TABLE 11 DAXPY Speedup Factor for Simple Schedule

As Table 11 suggests, software pipelining can make a huge difference in the performance of compute-intensive applications.

Another advantage of software pipelining is its ability to generate compact code, compared to transformations like loop unrolling that can increase the program size by a noticeable amount. Compact code prevents instruction cache penalties resulting from increased code size. The most important aspect of software pipelining is its ability to generate near-optimal code for loops.

Software pipelining in the MIPSpro compilers is based on the concept of "modulo iteration-interval scheduling", a technique for software pipelining innermost loops. This is a proven technique for generating code with near-optimal performance. In effect, this technique sets a performance goal for each loop prior to scheduling and then attempts to achieve this goal by taking into account resource constraints and the program data reference constraints.

Typical benchmark programs, like the SPECfp92 benchmarks, illustrate substantial improvements in performance when compiled with software pipelining. Figure 29 summarizes the performance and cost effectiveness of software pipelining as implemented in the MIPSpro compilers.

FIGURE 29 Cost-effectiveness of Software Pipelining

9.2.10 Pointer Optimization

For many compiler optimizations, ranging from simply holding a value in a register to the parallel execution of a loop, it is necessary to determine whether two distinct memory references designate distinct objects. If the objects are not distinct, the references are said to be aliases. When these references are pointers, there is often not enough information available within a single function or compilation unit to determine whether the two pointers are aliased. Even when enough information is available, the analysis can require substantial amounts of time and space. For example, it could require an analysis of a whole program to determine the possible values of a pointer that is a function parameter.

In short, compilers must normally be conservative in optimizing memory references involving pointers (especially in languages like C), since aliases (ie. different ways of accessing the same memory location) may be very hard to detect. Consider the following example:

float x[100];
float *c;

void f4(n, p, g)
int n;
float *p;
float *g;

  for (i = 0; i < n; i++ ) {
    p[i] = g[i] + c[i] + x[i] ;
  }
}

To be safe, the compiler must assume that the pointer references a, b, and c may all be aliased to each other. This in turn precludes the possibility of aggressive loop optimizations by the optimizer.

The MIPSpro compilers alleviate this general problem of aliasing by providing users with a number of different options for specifying pointer aliasing information to the compiler. The compiler can use this information to perform aggressive optimizations in the presence of pointers and thereby achieve healthy performance improvements.

Table 12 illustrates the various user-controlled options that can be useful for improving runtime performance.

Flag                   Description

-OPT:cilias=any        Compiler should assume that any pair of
                       memory references may be aliases unless
                       proven otherwise. This is the default setting
                       in the compiler and reflects a safe assumption
                       by the compiler.

-OPT:alias=typed       Compiler assumes that any pair of memory
                       references that are of distinct types cannot
                       be aliased. For example:

                       void dbl(i, f) {
                       int *i;
                       float *f;

                         *i = *i + *i;
                         *f = *f + *f;
                       }

                       The compiler assumes that i and f point to
                       different memory locations as they are of
                       different types. This can result in producing
                       an overlapped schedule for the two calculations.

-OPT:alias=unnamed     Compiler can assume that pointers will never
                       point to named objects. In the following
                       example, the compiler will assume that the
                       pointer p cannot point to the object g, and
                       will produce an overlapped schedule for the
                       two calculations. This is the default
                       assumption for the pointers implicit in Fortran
                       dummy arguments according to the ANSI standard.

                       float g;
                       void double(f) {
                       float *p;

                         g = g * g;
                         *p = *p + *p;
                       }

-OPT:alias=restrict    Compiler should assume a very restrictive
                       model of aliasing, where no two pointers ever
                       point to the same memory area. This is a
                       rather restrictive assumption, but when
                       applied for specific well-controlled, valid
                       cases, can produce significantly better code.

                       void double(p,q)
                       int *p;
                       int *q;
                       {
                         *p = *p + *p;
                         *q = *q + *q;
                       }

TABLE 12 User-assisted Pointer Optimizations

9.2.11 Loop Optimizations

Most compute-intensive programs spend a significant portion of their execution time in loops; the MIPSpro compilers spend a significant portion of their time in optimizing loops in the program. Various techniques optimize the performance of loop, many of them automatically enabled by the compilers at various levels of optimizations. This section explains important loop transformations performed by the MIPSpro compilers.

Loop Interchange

Loop interchange is a memory hierarchy optimization that modifies the data access pattern of nested loops to match with the way data is laid out in memory. For example, in a typical Fortran 77 loop nest, Fortran stores array elements in column-major order (not row-major like most programming languages). Each iteration of the i loop steps across contiguous elements of A, while each iteration of the j loop steps over an entire column of A. Assuming that A is dimensioned as A(M,N), each iteration of the j loop steps across M elements of A. If M is larger than a page size of the machine, each iteration of the j loop steps on a new page, thereby exhibiting bad locality. As a result, the program may spend a considerable portion of its time moving data between memory and the cache system, and exhibit poor performance.

FIGURE 30 Loop Comparison

This problem of the innermost loop having a large stride is eliminated by interchanging the two loops, as shown in the right-hand example in Figure 30. Now the innermost loop runs across contiguous elements, minimizing page faults and cache misses. Depending on the dimensions of the arrays, the transformed loop can exhibit significantly better runtime performance.

Another advantage of loop interchange is its ability to move parallelism to outer levels of a nested loop. Before interchange, the innermost j loop can be parallelized by the compiler. However, the amount of work performed within the j loop may not be sufficient for efficient parallel execution. Once loop interchange is performed, the parallelism moves to the outer loop thereby increasing the amount of work in the loop. In effect the compiler is able to parallelize a larger region of code for better performance.

Loop Distribution

The compiler performs loop distribution to partition a single loop into multiple loops. Loop distribution has the advantage of making a loop's working set better fit the paging structure of the underlying machine. It can also expose more parallelism to the compiler. By distributing the loop into a sequential loop and a parallel loop, the compiler is able to efficiently execute parts of the original loop in parallel. The multiple loops are usually smaller (in body size) compared to the original loop and are more amenable to software pipelining. Figure 31 illustrates this transformation:

FIGURE 31 Loop Before (Left) and After (Right) Distribution

The original loop (left) cannot be parallelized because of the data dependency arising from the reference to array D. However, after distribution the first i loop can be parallelized and the ii loop software pipelined for performance.

Loop Fusion

Loop fusion, the inverse of loop distribution, involves "jamming" two originally separate loops into a single loop. Figure 32 illustrates this transformation.

FIGURE 32 Loop Before (Left) and After (Right) Fusion

Loop fusion can be used in many cases to combine two loops, each of which utilizes a large portion of the page space of the machine. The fused loop can have a working set that is smaller than the sum of the two individual loops, improving data reuse, and permitting better register allocation. In Figure 32, after loop fusion the elements of array A are immediately available for use by the second statement in each iteration. The optimizer recognizes the reuse of elements of A, and keeps them in registers for the operation.

Loop fusion can also increase the size of loops to improve the efficiency of parallel execution. By combining two small loops into a bigger loop, fusion sets the stage for profitable parallelization of the bigger loop. In Figure 32, the two individual loops may be too small to overcome the overheads of parallelization. However, the combined loop after fusion may be large enough to realize performance improvements from parallelization.

Loop Blocking

Loop blocking is an effective technique available in the MIPSpro compilers for optimizing the performance of the memory hierarchy for numerical algorithms. The reason for blocking is that entire matrices typically do not fit in the fast data storage (for example, the register file or cache) of the machine. Blocking decomposes matrix operations into submatrix operations, with a submatrix size chosen so that the operands can fit in the register file or cache. Since elements of a submatrix are reused in matrix operations, this reduces slow memory accesses and speeds up the computation.

Figure 33 (courtesy of Dr. James Winget, Chief Scientist, Silicon Graphics) shows the change in the memory access pattern as a result of loop blocking.

FIGURE 33 Memory Reference Pattern Before (Top) and After (Bottom) Loop Blocking

The before picture in Figure 33 references four sets of consecutive addresses over a certain period of time before repeating the access pattern. Blocking restructures the loop to reflect the memory access pattern illustrated in the after picture. Here, subsets of all four data sets reside in cache and are accessed in a shorter period of time. This arrangement enables useful computation to be performed efficiently on a cache resident subset of the original dataset before moving on to the next subset. Performance improvements come from reduced processor-to-main memory traffic as a result of efficient cache utilization.

Loop Unrolling

Loop unrolling is a fundamental transformation that is a basic component of other restructuring techniques like software pipelining and unroll-and-jam. The unrolling of outer loops of nested loop regions are usually important for good use of the memory hierarchy. Unrolling of inner loops improves the usage of the floating-point registers and provides more room for instruction overlap. Unrolling decreases the trip count of loops, thereby reducing the loop's conditional branch overhead.

The number of times a loop should be unrolled (unrolling factor) is determined by the compiler, based on numerous considerations including the amount of data referenced in the loop body, the data access dependencies, the availability of registers, the size of data cache, and the purpose of unrolling. Figure 34 illustrates the process of unrolling.

FIGURE 34 Loop Before (Left) and After (Right) Unrolling

In Figure 34, the unrolled loop exposes a lot of instruction level parallelism as the different assignments in the unrolled loop can be overlapped for performance.

Loop Multiversioning

Multiversioning is a technique employed by the compiler to improve the efficiency of parallel performance. Many loops, specially in Fortran, use symbolic bounds as trip counts which cannot be determined at compile time. However, the compiler can generate multiple versions of the original code at compile time. The resulting program will execute the appropriate path depending on the loops's trip count as determined dynamically at execution time.

FIGURE 35 Loop Before (Left) and After (Right) Multiversioning.

Multiversioning improves the overall efficiency of parallel execution by using accurate information at program execution time.

Pattern Matching

The compiler back end understands patterns of standard computational kernels like SAXPY, DAXPY, UNPACK, and the like, and generates optimal code for such code sequences. Pattern matching is also used for performing basic dependency analysis on loops to improve the effectiveness of software pipelining.

9.2.12 Parallelization

The MIPSpro Fortran 77 and C compilers fully support automatic and user-assisted parallelization. User-assisted parallelization is available in the form of comment-based directives to guide program parallelization, which is enabled by specifying the -mp flag for both Fortran and C.

The directives provide comprehensive support for specifying and controlling the degree and dynamics of parallel execution. For example, the directives can be used to specify conditional parallelism to ensure that parallelism occurs only under certain dynamic conditions. In the two examples shown in Figure 36, the if clause in the directive specifies the conditions for parallel execution. In the Fortran 77 example, the loop executes in parallel only if the value of jmax is greater than 1000. Similarly, the C example executes in parallel only if the value of max is greater than 1000.

FIGURE 36 Fortran and C Parallelization Examples

The compilers automatically detect program parallelism by employing the technique of data dependency analysis. Both Fortran 77 (-pfa flag) and C (-pca flag) compilers have this capability. Data dependence information is also used by a number of loop transformations listed in previous sections of this chapter.

9.2.13 Procedure Inlining

The compilers provide for automatic and user-directed inlining of functions and subroutines in Fortran and C programs. Inlining is the process of replacing a function reference with the text of the function. This process eliminates function call overhead and improves the effectiveness of numerous scalar and parallel optimizations by exposing the relationships between function arguments, returned values, and the surrounding code.

The compilers provide command-line options to direct the inlining of the specified list of subroutines. Flags are available to limit inlining to routines that are referenced in deeply nested loops, where the reduced call overhead or enhanced optimization is multiplied. Options exist to perform interprocedural inlining, whereby instances of routines can be inlined across different files.

One drawback of unlimited inlining is its tendency to increase the code size of the resulting program. Uncontrolled replacement of function or subroutine calls with the actual body of the called routine can cause "code explosion," which in turn increases compile time and reduces the effectiveness of other optimizations. The technique of Interprocedural Analysis (IPA) provides the benefits of inlining without necessitating inlining the code.

9.2.14 Interprocedural Analysis (IPA)

To increase the effectiveness of the automatic parallelizer via interprocedural analysis, the -ipa[=list] option can be used to specify the degree of interprocedural analysis to be performed. By default, this option turns on interprocedural analysis on all eligible routines. IPA can be enabled on selective routines by specifying the list of routines.

IPA tracks the flow of control and data across procedure boundaries and uses this information to drive the optimization process. IPA is particularly useful for performing interprocedural constant propagation, which enables routines with incoming loop bounds information to be available at compile time. This ability can be useful for driving optimization decisions. Figure 37 shows an example.

FIGURE 37 Loops With and Without Interprocedural Analysis

In this example, the value of n is used as a loop bound within the subroutine foo. In the absence of IPA, the compiler assumes that the values of n is modified inside the call to subroutine foo.

Moreover, the value of n on entry to subroutine foo will not be known at compile-time. As a result, the compiler must generate multiversion code when parallelizing the j loop within foo. However, with IPA turned on, it is possible to know (at compile time) the value of n on entry to foo at the call site. This information is then used by the automatic parallelizer to decide how to parallelize the j loop in subroutine foo. If the value of n is small, the loop may not be profitably parallelized. On the other hand, if the value of n is large, the loop gets parallelized for profitable execution.

In short, IPA provides a mechanism to propagate information across procedure boundaries without having to inline calls, thereby increasing the effectiveness of all optimizations.

9.3 Porting Cray Code [2]

The MIPSpro compilers have a number of value-added features and extensions to facilitate easy migration of existing Cray Fortran programs over to Silicon Graphics systems. Important Cray extensions are accepted as-is by the MIPSpro compilers. A large portion of the remaining Cray features are handled by providing equivalent functionality in the MIPSpro compilation system.

Before examining the degree of Cray compatibility provided by the MIPSpro compilers, it is worthwhile to understand the key features of Cray's compilation system. Cray extensions fall into three categories:

language extensions:
- Fortran 77 datatype language extensions for high-precision floating point computations
- Fortran 77 language extensions to support elementary Fortran 90 style constructs
compiler directives to boost scalar and vector performance of applications
compiler directives to parallelize Fortran code on a Cray system

This section describes the Cray extensions in each category and the corresponding capabilities available in the MIPSpro compilers.

[2] This section is based on work done by David (Wei) Chen, Systems Engineer, Silicon Graphics.

9.3.1 Language Extensions

Cray Fortran 77 supports the 64-bit programming model for most elementary datatypes. For instance, basic datatypes like REAL, INTEGER and LOGICAL are 64-bit quantities for Cray. Support also exists for the quadruple-precision COMPLEX data type. Table 13 illustrates the equivalent functionality supported by the MIPSpro compilation system.


Cray Fortran    Corresponding        Comments
Extension       Silicon Graphics
                Functionality

real            real*8               Replace real with real*8 or turn
				     on the command-line option -r8,
				     which promotes all real data
				     types to 64-bit quantities.


integer         integer*8            Replace integer declarations
				     with integer*8, or turn
				     on the command-line option -i8
				     to promote all integer
				     declarations to become 64-bit
				     quantities. The -i8 flag
				     provides a convenient way for
				     users to selectively promote
				     integers to be 64-bit
				     quantities. This enables default
				     integer data type size to remain
				     as a 32-bit quantity, thereby
				     preventing integer data cache
				     conflicts that may result from
				     larger integer quantities. Thus,
				     the -i8 flag provides a much
				     more flexible, performance-oriented
				     approach to manage 64-bit
				     integer data types. Library
				     routines called from user
				     applications should conform to
				     the integer*8 format.

logical         logical*8            Replace logical withlogical*8
				     if a 64-bit quantity is required.

complex         complex*32           Use complex*32 to get quad
                                     precision support.

TABLE 13 Cray Fortran 77 Data Type Extensions

Cray Fortran 77 also supports Fortran 90 style indexed array syntax. Table 14 illustrates the equivalent Silicon Graphics functionality for handling Cray Fortran 77 array extensions. The MIPSpro Fortran 77 supports the Fortran 90 style array syntax. MIPSpro Fortran 77 also supports dynamic allocation of arrays.

Cray Program with Indexed and Vector-     Equivalent Silicon
valued Array Section Selectors            Graphics Program

dimension A(10), B(10), C(5), TEMP(5)     dimension A(10), B(10), C(5)
dimension X(5,5), Y(2:10)                 dimension X(5,5), Y(2:10)

A=B                                       do i = 1, 10
                                          A(i) = B(i)
                                          enddo

C = A(3:7)                                do i = 1,5
                                          C(i)=A(i+2)
                                          enddo

B(I: 5) = X(3, 1:5)                       do i = 1, 5
                                          B(i) = X(3, i)
                                          enddo

A = sin(B)                                do i = 1, 10
                                          A(i) = sin(B(i))
                                          enddo

TEMP = A(C)                               do i = 1,5
                                          TEMP(i) = A(C(i))
                                          enddo

TEMP = X(1, C)                            do i = 1,5
                                          TEMP(i) - X(1, C(i))
                                          enddo

TEMP = A(C+C)                             do i = 1,5
                                          TEMP(i) = A(C(i) + C(i))
                                          enddo

TABLE 14 Cray Fortran 77 Array Syntax

9.3.2 Compiler Directives for Scalar and Vector Performance

Cray Fortran supports a few directives to improve the performance of scalar and vector code. Notable vectorization directives are listed in Table 15 with the equivalent Silicon Graphics functionality.

Cray Fortran
Vectorization      Silicon Graphics Corresponding Functionality
Directive

cdir$ivdep         This directive informs the Cray compiler that there
		   is no data dependency in the loop that follows this
		   directive, thus ensuring complete vectorization of
		   the loop. This directive is accepted as is by the
		   MIPSpro compilers. It tells the compiler to be less
		   strict when deciding whether it can get some sort of
		   parallelism between loop iterations. By default, the
		   compiler makes conservative assumptions about
		   possible memory reference conflicts. The directive
		   allows the compiler to be more aggressive about such
		   assumptions. Thus, superscalar optimizations like
		   software pipelining can benefit from recognizing
		   this directive.


cdir$nextscalar    This directive informs the compiler to generate
		   only scalar (nonparallel) instructions for the
		   loop that immediately follows this directive. The
		   MIPSpro compiler accepts this directive as is and
		   adheres to the original meaning of this directive.

TABLE 15 Cray Fortran 77 Vectorization Directives

9.3.3 Cray Compiler Parallel-processing Directives

The Cray compilers support three different techniques to parallelize Fortran code. These are:

autotasking (Cray's version of the automatic parallelizer): The automatic parallelization capability of the MIPSpro Power Fortran 77 compiler is equivalent to this feature of Cray.
microtasking (Cray's version of user-assisted parallelization): Users can insert comment-based directives in their original Fortran 77 code to specify loop-level and region-level parallelism to the compiler. MIPSpro Power Fortran 77 compiler supports a complete set of Parallel Computing Forum (PCF) standards-based comment directives. The PCF directives support loop-level and region-level parallelism and are similar to Cray's microtasking facility. By using the PCF directives users can replace the Cray microtasking directives with equivalent PCF directives to specify the same degree of parallelism.
macrotasking (Cray's version of task-level parallelism): This was Cray's original facility for specifying coarse-grained process-level parallelism. Silicon Graphics compilation system provides a complete user-level threads library for users interested in taking advantage of process-level parallelism. In general, macrotasking is nonportable because of the dependency on vendor-specific libraries.

Table 16 maps Cray's parallel processing capabilities with their equivalent Silicon Graphics functionality.

Cray's Parallel                      Silicon Graphics
Processing Functionality             Equivalent Capability

The "cmic$parallel" directive        The PCF directive "c*KAP*
is used to declare a parallel        parallel region" provides
region in Cray Fortran.              equivalent functionality.

  cmic$ parallel [ if (exp) ]        c*KAP* parallel region [ if (exp) ]
  cmic$ shared ( var, ... )          c*KAP* shared ( var, ... )
  cmic$ private ( var, ... )         c*KAP* local ( var, ... )

    Parallel code includes loops       Parallel code includes loops
    with independent iterations,       with independent iterations,
    adjacent independent blocks        adjacent independent blocks
    of code, critical sections,        of code, critical sections,
    or a combination of all            or a combination of all 
    these cases.                       these cases.

  cmic$ endparallel                  c*KAP* end parallel region

The "cmic$doall" directive is        This functionality can be
used for specifying loop-level       replicated by using Silicon
parallelism in Cray Fortran.         Graphics Fortran MP "c$doacross"
                                     directive.

  cmic$ doall [ if (exp) ]           c$doacross [ if (exp) ]
  cmic$ shared ( var, ... )          c$and shared ( var, ... ),
  cmic$ private ( var. ... )               local ( var, ... )
  cmic$ [single |chunksize (n)       c$and lastlocal (var, ... )
                |numchunks(n) |      c$and [ mp_schedtype=type, chunk=n ]
  cmic$ guided | vector]
               [ savelast ]

    do i = n1, n2                      do i = n1, n2
    ....                               ....
    enddo                              enddo

  cmic$ endparallel

TABLE 16 Cray Parallel Processing Functionality

To summarize, the MIPSpro compilers have sufficient functionality built into them to facilitate smooth migration of Cray code onto Silicon Graphics platforms. Important Cray directives like c$dir ivdep are recognized and processed to improve the performance of superscalar optimizations like software pipelining, while other vector specific directives like c$dir vector are essentially ignored by the MIPSpro compilers. In terms of general purpose optimizations, the MIPSpro compilers favor superscalar optimizations like software pipelining and global speculative code motion while the Cray compilers attempt vectorization for best performance.

9.4 References

MIPSpro Fortran 77 Language Reference Manual
MIPSpro POWER Fortran 77 Programmer's Guide
MIPSpro POWER C Programmer's Guide
MIPSpro Compiling, Debugging, and Performance Tuning
Practical Parallel Programming, Barr E. Bauer, Academic Press, Inc. Presents the practical aspects of parallel programming on Silicon Graphics multiprocessor systems.
"Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors,' Jaswinder Pal Singh, Anoop Gupta, and Edward Rothberg, in Proceedings of the 20th International Symposium on Computer Architecture, IEEE, New York, 1993, pp. 14-25.

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

Indigo2 and POWER Indigo2 Technical Report

Cycle Count Memory Operations Floating-point Operations 0 load andv(i); load andw(i) madd X, w(i), v(i) 1 2 3 4 store andv(i)

[2] This section is based on work done by David (Wei) Chen, Systems Engineer, Silicon Graphics.

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades