RISC Fights Back with the Mips R12000 

The latest Rx000-series processor reaffirms Silicon Graphics'
commitment to RISC.

Tom R. Halfhill 

It hasn't been a stellar year for RISC. The biggest desktop RISC
vendor, Apple, has its own troubles. In the workstation and server
markets, Intel's x86 and Microsoft's Windows NT are making
significant inroads against RISC and Unix. Microsoft halted
development of NT on the PowerPC and the Mips Rx000-series
microprocessors. And Silicon Graphics, Inc.  (SGI) -- Mips
Technologies' parent company -- has agreed to make x86-based
workstations that run NT.

Where does th at leave Mips? Still in the ball game, if the new
R12000 processor is successful.  Derived from the R10000, announced
in 1994 (see "T5: Brute Force," November 1994 BYTE), the R12000 is an
evolutionary design that improves upon the R10000 in several ways.

The R12000 isn't a radical redesign of the Rx000 microarchitecture,
as was the R10000. Instead, Mips decided to tweak a proven core. The
R12000 was taped out in early September and should begin volume
production in the first half of 1998 -- assuming there aren't any
last-minute snags in the silicon. It will be manufactured by Mips's
foundry partners, NEC Electronics and Toshiba. Workstations and
servers should follow in the second half of 1998.


Smaller, Faster, Better 

The R12000 will debut at 300-MHz on a 0.25-micron, four-layer-metal
CMOS process. The R10000 currently peaks at 200-MHz on a 0.35-micron,
four-layer process. Actually, NEC and Toshiba can produce five metal
layers with their 0.25-micron processes, but Mips engineers limited
themselves to four layers to accelerate the production schedule. They
could press the optional fifth layer into service if they encounter
problems with the initial samples.

At four layers, the R12000's die area is 204 square millimeters --
roughly one-third smaller than the R10000, even though the new chip
has about 100,000 more transistors (6.9 million total).

Sometime in 1999 or 2000, the next-generation 0.18-micron processes
should become available, along with at least six layers of metal
interconnects. Those advances will greatly shrink the R12000 and
permit even higher clock speeds (over 400-MHz), lower operating
voltages, and reduced power consumption.

At 0.25 micron, with a 2.5-V core and 1.5-V I/O, the first production
version of the R12000 should dissipate about 20 W. It will first
appear in a 600-pin ceramic land grid array (CLGA) package, but it
will soon afterward adopt the more popular ball grid array (BGA).
It's pin-compatible with the R10000.


Like Father, Like Son 

The R12000 retains the basic 64-bit core of the R10000, which was the
first single-chip superscalar processor from Mips. It can execute up
to five instructions per cycle and retire up to four per cycle using
its two integer units, two FPUs, and load/store unit, as shown in the
figure "R12000 Microarchitecture" . The processor's minimum pipeline
depth is five stages.

The R12000 can execute instructions out of order, dynamically predict
branches, and speculatively execute instructions up to four branches
deep. It has 64 integer registers and 64FP registers (each 64 bits
wide), which the CPU dynamically renames to represent the
architectural set of 32 integer and 32FP registers. It adheres to the
64-bit Mips 4 architecture, and Mips says that binaries optimized for
the R10000 should run even better on the R12000 without recompiling.

Even the Level 1(L1) caches remain unchanged, bucking the tren d
toward more on-chip memory. However, the caches are respectfully
large to begin with: 32KB each for instructions and data, twice as
much as on a Pentium II.

One of the most significant changes in the R12000 is that it can
juggle 50 percent more pending instructions than an R10000 while
reordering the instruction stream. In effect, this opens a larger
window onto the executing program and gives the R12000 more
flexibility to rearrange the instructions in the most efficient order
to keep its execution units busy.

Here's how it works. The CPU maintains a list of occupied registers,
called the active list .  Registers on the active list can have two
states: active (currently in use by an executing instruction) or
completed (the final result of an executed instruction). When a
completed result retires, the register is free to handle a new
instruction, so the CPU removes it from the active list. The more
instructions the CPU can maintain on the active list, the larger the
chunk of code it can r eorder to optimize the instruction stream. The
R10000 maintains an active list of 32 instructions; the R12000
increases that number to 48.

Mips also added a branch target buffer (BTB) and quadrupled the size
of the branch-prediction table. The BTB is a 32-entry, two-way
set-associative cache that holds the target addresses of branches.
Most of the time, the R12000 finds the target address it needs in
this cache instead of fetching from the L1 cache.

The branch-prediction table now holds 2048 entries instead of 512.
Each entry is a 2-bit value that predicts the outcome of a branch
instruction. Two bits allow four possibilities: strongly taken,
weakly taken, weakly not taken, and strongly not taken. The CPU
dynamically adjusts those predictions by watching the outcomes of
previous branches.

Likewise, Mips doubled the size of the way-prediction table for the
Level 2 (L2) cache; it now holds 16,384 entries. This table allows
the CPU to fetch things more quickly from the cache.  Because the
cache is two-way set-associative, the CPU loads two lines of
instructions and data during each fetch, one after the other. The
way-prediction table helps the CPU decide which line to load first.

All these changes should improve performance when running large
programs, especially databases. Mips points out that processors like
the R12000 are typically found in servers and workstations, not in
desktop PCs, so they should be optimized for different tasks. For
instance, the expanded way-prediction table works best with an L2
cache of 4 MB -- eight to 16 times larger than the L2 caches
typically found in desktop PCs.

That's also why the R12000 (like the R10000) supplements the
64-bit-wide system bus with a 128-bit-wide backside bus for the L2
cache. That's twice as wide as the backside bus on a Pentium Pro or a
Pentium II. Also, the signaling required for the backside bus
(address tags, error correction, and so on) travels on separate wires
instead of being multiplexed with the data.  The result is higher
throughput. The trade-off is a package with 600 pins -- too many for
the R12000 to be an economical mass-market processor.

One important difference between the backside bus on the R10000 and
that on the R12000 is that the new processor can't drive its bus at
the core frequency. Clock divisors range from 1.5 to 3.5 in 0.5
increments, so the R12000's backside bus cannot run faster than 200
MHz if the core runs at 300 MHz. Intel's 0.25-micron Pentium II (aka
Deschutes) will have the ability to drive its backside bus at core
speeds of 333 MHz or more when it appears in midyear. But the
R12000's bus is twice as wide, so even at 200-MHz it will have more
peak bandwidth (3.2 GBps) than a Deschutes at 333 MHz (2.6 GBps).


Alive and Kickin' 

Together with an improved die layout and optimized signal paths, all
these tweaks should boost the R12000's performance about 50 percent
beyond the R10000's. Although Mips has successfully tested Unix on
simulations of the R12000, the engineers were still awaiting the
first silicon samples when this article went to press, so actual
benchmarks are not yet available.

Will the R12000 be good enough to fend off Intel for another CPU
generation? With its superior bus bandwidth, wider parallelism, and
stronger emphasis on FP performance, the R12000 should be better
suited for high-end graphics workstations and servers.


Where to Find 

NEC Electronics
Santa Clara, CA
Phone:408-588-6000
Fax:408-588-6130
Internet: http://www.nec.com/necel/

Silicon Graphics/Mips Group
Mountain View, CA
Phone:650-933-3900
Fax:650-960-0197
Internet: http://www.sgi.com/MIPS/

Toshiba
Irvine, CA
Phone:714-455-2000
Fax:714-859-3963
Internet: http://www.toshiba.com/taec/


R12000: What's New 

64-bit core is substantially the same as the R10000's; it's
compatible with the Mips 4 architecture.

300-MHz core frequency; the dedicated L2 cache bus can run
at 200 MHz.

Estimated performance is about 50 percent greater than the R10000's
at 200 MHz.

Handles up to 48 pending instructions, versus 32 on the R10000.

New 32-entry branch target buffer for caching branch addresses.

Branch-prediction table is quadrupled in size (2048 entries).

L2 cache way-prediction table is doubled in size (16,384 entries).

0.25-micron, four-layer-metal CMOS, 600-pin CLGA package,
pin-compatible with the R10000.

6.9 million transistors on a 204-square-millimeter die.

Tape-out in early September 1997; volume production in the first 
half of 1998.