RISC Fights Back with the Mips R12000 The latest Rx000-series processor reaffirms Silicon Graphics' commitment to RISC. Tom R. Halfhill It hasn't been a stellar year for RISC. The biggest desktop RISC vendor, Apple, has its own troubles. In the workstation and server markets, Intel's x86 and Microsoft's Windows NT are making significant inroads against RISC and Unix. Microsoft halted development of NT on the PowerPC and the Mips Rx000-series microprocessors. And Silicon Graphics, Inc. (SGI) -- Mips Technologies' parent company -- has agreed to make x86-based workstations that run NT. Where does th at leave Mips? Still in the ball game, if the new R12000 processor is successful. Derived from the R10000, announced in 1994 (see "T5: Brute Force," November 1994 BYTE), the R12000 is an evolutionary design that improves upon the R10000 in several ways. The R12000 isn't a radical redesign of the Rx000 microarchitecture, as was the R10000. Instead, Mips decided to tweak a proven core. The R12000 was taped out in early September and should begin volume production in the first half of 1998 -- assuming there aren't any last-minute snags in the silicon. It will be manufactured by Mips's foundry partners, NEC Electronics and Toshiba. Workstations and servers should follow in the second half of 1998. Smaller, Faster, Better The R12000 will debut at 300-MHz on a 0.25-micron, four-layer-metal CMOS process. The R10000 currently peaks at 200-MHz on a 0.35-micron, four-layer process. Actually, NEC and Toshiba can produce five metal layers with their 0.25-micron processes, but Mips engineers limited themselves to four layers to accelerate the production schedule. They could press the optional fifth layer into service if they encounter problems with the initial samples. At four layers, the R12000's die area is 204 square millimeters -- roughly one-third smaller than the R10000, even though the new chip has about 100,000 more transistors (6.9 million total). Sometime in 1999 or 2000, the next-generation 0.18-micron processes should become available, along with at least six layers of metal interconnects. Those advances will greatly shrink the R12000 and permit even higher clock speeds (over 400-MHz), lower operating voltages, and reduced power consumption. At 0.25 micron, with a 2.5-V core and 1.5-V I/O, the first production version of the R12000 should dissipate about 20 W. It will first appear in a 600-pin ceramic land grid array (CLGA) package, but it will soon afterward adopt the more popular ball grid array (BGA). It's pin-compatible with the R10000. Like Father, Like Son The R12000 retains the basic 64-bit core of the R10000, which was the first single-chip superscalar processor from Mips. It can execute up to five instructions per cycle and retire up to four per cycle using its two integer units, two FPUs, and load/store unit, as shown in the figure "R12000 Microarchitecture" . The processor's minimum pipeline depth is five stages. The R12000 can execute instructions out of order, dynamically predict branches, and speculatively execute instructions up to four branches deep. It has 64 integer registers and 64FP registers (each 64 bits wide), which the CPU dynamically renames to represent the architectural set of 32 integer and 32FP registers. It adheres to the 64-bit Mips 4 architecture, and Mips says that binaries optimized for the R10000 should run even better on the R12000 without recompiling. Even the Level 1(L1) caches remain unchanged, bucking the tren d toward more on-chip memory. However, the caches are respectfully large to begin with: 32KB each for instructions and data, twice as much as on a Pentium II. One of the most significant changes in the R12000 is that it can juggle 50 percent more pending instructions than an R10000 while reordering the instruction stream. In effect, this opens a larger window onto the executing program and gives the R12000 more flexibility to rearrange the instructions in the most efficient order to keep its execution units busy. Here's how it works. The CPU maintains a list of occupied registers, called the active list . Registers on the active list can have two states: active (currently in use by an executing instruction) or completed (the final result of an executed instruction). When a completed result retires, the register is free to handle a new instruction, so the CPU removes it from the active list. The more instructions the CPU can maintain on the active list, the larger the chunk of code it can r eorder to optimize the instruction stream. The R10000 maintains an active list of 32 instructions; the R12000 increases that number to 48. Mips also added a branch target buffer (BTB) and quadrupled the size of the branch-prediction table. The BTB is a 32-entry, two-way set-associative cache that holds the target addresses of branches. Most of the time, the R12000 finds the target address it needs in this cache instead of fetching from the L1 cache. The branch-prediction table now holds 2048 entries instead of 512. Each entry is a 2-bit value that predicts the outcome of a branch instruction. Two bits allow four possibilities: strongly taken, weakly taken, weakly not taken, and strongly not taken. The CPU dynamically adjusts those predictions by watching the outcomes of previous branches. Likewise, Mips doubled the size of the way-prediction table for the Level 2 (L2) cache; it now holds 16,384 entries. This table allows the CPU to fetch things more quickly from the cache. Because the cache is two-way set-associative, the CPU loads two lines of instructions and data during each fetch, one after the other. The way-prediction table helps the CPU decide which line to load first. All these changes should improve performance when running large programs, especially databases. Mips points out that processors like the R12000 are typically found in servers and workstations, not in desktop PCs, so they should be optimized for different tasks. For instance, the expanded way-prediction table works best with an L2 cache of 4 MB -- eight to 16 times larger than the L2 caches typically found in desktop PCs. That's also why the R12000 (like the R10000) supplements the 64-bit-wide system bus with a 128-bit-wide backside bus for the L2 cache. That's twice as wide as the backside bus on a Pentium Pro or a Pentium II. Also, the signaling required for the backside bus (address tags, error correction, and so on) travels on separate wires instead of being multiplexed with the data. The result is higher throughput. The trade-off is a package with 600 pins -- too many for the R12000 to be an economical mass-market processor. One important difference between the backside bus on the R10000 and that on the R12000 is that the new processor can't drive its bus at the core frequency. Clock divisors range from 1.5 to 3.5 in 0.5 increments, so the R12000's backside bus cannot run faster than 200 MHz if the core runs at 300 MHz. Intel's 0.25-micron Pentium II (aka Deschutes) will have the ability to drive its backside bus at core speeds of 333 MHz or more when it appears in midyear. But the R12000's bus is twice as wide, so even at 200-MHz it will have more peak bandwidth (3.2 GBps) than a Deschutes at 333 MHz (2.6 GBps). Alive and Kickin' Together with an improved die layout and optimized signal paths, all these tweaks should boost the R12000's performance about 50 percent beyond the R10000's. Although Mips has successfully tested Unix on simulations of the R12000, the engineers were still awaiting the first silicon samples when this article went to press, so actual benchmarks are not yet available. Will the R12000 be good enough to fend off Intel for another CPU generation? With its superior bus bandwidth, wider parallelism, and stronger emphasis on FP performance, the R12000 should be better suited for high-end graphics workstations and servers. Where to Find NEC Electronics Santa Clara, CA Phone:408-588-6000 Fax:408-588-6130 Internet: http://www.nec.com/necel/ Silicon Graphics/Mips Group Mountain View, CA Phone:650-933-3900 Fax:650-960-0197 Internet: http://www.sgi.com/MIPS/ Toshiba Irvine, CA Phone:714-455-2000 Fax:714-859-3963 Internet: http://www.toshiba.com/taec/ R12000: What's New 64-bit core is substantially the same as the R10000's; it's compatible with the Mips 4 architecture. 300-MHz core frequency; the dedicated L2 cache bus can run at 200 MHz. Estimated performance is about 50 percent greater than the R10000's at 200 MHz. Handles up to 48 pending instructions, versus 32 on the R10000. New 32-entry branch target buffer for caching branch addresses. Branch-prediction table is quadrupled in size (2048 entries). L2 cache way-prediction table is doubled in size (16,384 entries). 0.25-micron, four-layer-metal CMOS, 600-pin CLGA package, pin-compatible with the R10000. 6.9 million transistors on a 204-square-millimeter die. Tape-out in early September 1997; volume production in the first half of 1998.