[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

Origin200/2000 Series

Overview and Architecture

Last Change: 11/Oct/2004

As any technical manager of a major computing facility will know, planning for the future can be a nightmare when it comes to cost. Or at least it used to be. Until now, buying a multiprocessor supercomputer of any kind involved an annoying tradeoff when deciding on price/performance:

If I want to plan for many CPUs in the future (eg. 30) I'll have to get a system that can take that many (eg. Power Challenge), but the cost of such a large initial infrastructure may mean I won't be able to afford to buy as many CPUs as I'd like in the first instance.

As a result, one saw people using expensive rack systems with few CPUs (thus wasting the extreme capacity of the system), or they had a smaller system with more CPUs but without the ability to expand for the future. Of course, those who did have big budgets may have had a big system with many CPUs, but the architecture would always means that the CPUs had to compete for resources over a shared bus topology of some kind.

What was needed in multiprocessing systems was the ability to scale the infrastructure that supports the processing as well as the number of processors. Origin offered this funcionality for the first time. One decides on how many processors one needs in the first instance and purchases a system that offers the required number of processors combined with sufficient memory I/O bandwidth to properly support them. The latter aspect is very important: many systems offer large numbers of processors, but these older systems do not scale their memory bandwidth capacity as the number of processors scales. As a result, more and more processors means more competition for the fixed bandwidth available. Origin breaks this bottleneck by allowing one to expand the bandwidth infrastructure as desired.

Fundamentally, one can purchase a small Origin system maximally configured and, at a later date, purchase a second system which is connected to the first using high-speed CrayLink cables. The speed of this connection is so fast that, to the user/programmer/etc., the combined systems are treated as a single system. This allows one to buy added processing power and memory bandwidth as and when one needs it, as opposed to having to cater for future memory/processor requirements in the first instance.

At the low end, one can use the Origin200, offering 1 or 2 R10K/R12K CPUs and up to 2GB RAM. When performance/memory/storage demands increase, one can purchase a second Origin200 and connect the two systems together via CrayLink, giving a single system comprising 4 CPUs and up to 4GB RAM. I ran a dual-180 O200 for several years in a research dept. - they are very capable machines and can easily support an environment comprising many dozens of users.

In the middle and high-end of the range is the deskside Origin2000, offering 1 to 8 CPUs. These can be rack mounted and expanded up to 128 CPUs single-image and then, using metarouter racks, up to thousands of CPUs connected as a cluster of multiple 128-CPU systems, eg. the 6400-CPU Blue Mountain system at Los Alamos.

At this stage, I'll bow to SGI's Engineers' superior technical prosal skills - they have already constructed a detailed architectural description of Origin; what follows is a local copy of the original document.

Table of Contents

Introduction

Origin2000 Components

What Makes Origin2000 Different

Scalability and Modularity

Systems Interconnections

Crossbar

Distributed Shared Address Space(Memory and I/O)

System Bandwidth

The guide describes the hardware architecture of the Origin family, and its specific implementations:

The entry-level Origin200 system, consisting of a maximum of two towers (up to four processors) that can be linked together.

The deskside/rackmount Origin2000 system, consisting of from 1 to 128 processors, housed in deskside and rackmount cabinets.

The Origin family is a revolutionary follow-on to the Challenge-class symmetric multiprocessing (SMP) system. It uses Silicon Graphics' distributed shared-memory multiprocessing architecture, called S2MP.

The development path of Silicon Graphics' multiprocessor systems is shown in Figure 1-1.

Developmental Path of Silicon Graphics Multiprocessing Architectures

Figure 1-2 is a block diagram of an Origin2000 system showing the central Node board, which can be viewed as a system controller from which all other system components radiate.

Origin2000 Block Diagram

Figure 1-3 is a block diagram of a system with four Node boards; Nodes 1 and 3 connect to Crossbow 1, and Nodes 2 and 4 connect to Crossbow 2. Crossbow 1 connects to XIO boards 1 through 6 and Crossbow 2 connects to XIO boards 7 through 12.

Block Diagram of a 4-Node System

The components shown in Figure 1-2 are described further in this chapter, in the section titled ``Origin2000 Components.''

As illustrated in Figure 1-4, Origin2000 is a number of processing nodes linked together by an interconnection fabric. Each processing node contains either one or two processors, a portion of shared memory, a directory for cache coherence, and two interfaces: one that connects to I/O devices and another that links system nodes through the interconnection fabric.

Nodes in an Origin2000 System

The interconnection fabric links nodes to each other, but it differs from a bus in several important ways. A bus is a resource that can only be used by one processor at a time. The interconnection fabric is a mesh of multiple, simultaneous, dynamically-allocable -- that is, connections are made from processor to processor as they are needed -- transactions. This web of connections differs from a bus in the same way that multiple dimensions differ from a single dimension: if a bus is a one-dimensional line, then the interconnection fabric is a multi-dimensional mesh.

Single Datapath Over a Bus

As shown in Figure 1-5, a bus is a shared, common link that multiprocessors must contest for and that only a single processor can use at a time. The interconnection fabric allows many nodes to communicate simultaneously, as shown in Figure 1-6. Paths through the interconnection fabric are constructed as they are needed by Router ASICs, which act as switches.

Multi-dimensional Datapaths Through an Interconnection Fabric

Origin2000 is said to be scalable, because it can range in size from 1 to 128 processors. As you add nodes, you add to and scale the system bandwidth. Origin2000 is also modular, in that it can be increased in size by adding standard modules to the interconnection fabric. The interconnection fabric is implemented on cables outside of these modules.

Origin2000 uses Silicon Graphics' new Scalable Shared-Memory Multiprocessor (S2MP) architecture to distribute shared memory amongst the nodes. This shared memory is accessible to all processors through the interconnection fabric and can be accessed with low latency.

Origin2000 Components

An Origin2000 system has the following components:

memory

I/O controllers

distributed memory controller (Hub ASIC)

directory memory for cache coherence

CrayLink Interconnect

XIO and Crossbow interfaces

These are linked as shown in Figure 1-7, and all of these are described in this section.

Components in an Origin2000 System

Processor

Origin2000 uses the MIPS R10000, a high-performance 64-bit superscalar processor which supports dynamic scheduling. Some of the important attributes of the R10000 are its large memory address space, together with a capacity for heavy overlapping of memory transactions -- up to twelve per processor in Origin2000.

Memory

Each Node board added to Origin2000 provides additional independantly accessed memory, and each node is capable of supporting up to 4 GB of memory. Up to 64 nodes can be configured in a system, which implies a maximum memory capacity of 256 GB.

I/O Controllers

Origin2000 supports a number of high-speed I/O interfaces, including Ultra, Fast, Wide SCSI, Fibrechannel, 100BASE-Tx, ATM, and HIPPI-Serial. Internally, these controllers are added through XIO cards, which have an embedded PCI-32 or PCI-64 bus. Thus, in Origin2000 I/O performance is added one bus at a time.

Hub

This ASIC is the distributed shared-memory controller. It is responsible for providing all of the processors and I/O devices a transparent access to all of distributed memory in a cache-coherent manner.

Directory Memory

This supplementary memory is controlled by the Hub. The directory keeps information about the cache status of all memory within its node. This status information is used to provide scalable cache coherence, and to migrate data to a node that accesses it more frequently than the present node.

CrayLink Interconnect

This is a collection of very high speed links and routers that is responsible for tying together the set of hubs that make up the system. The important attributes of CrayLink Interconnect are its low latency, scalable bandwidth, modularity, and fault tolerance.

XIO and Crossbow

These are the internal I/O interfaces originating in each Hub and terminating on the targeted I/O controller. XIO uses the same physical link technology as CrayLink Interconnect, but uses a protocol optimized for I/O traffic. The Crossbow ASIC is a crossbar routing chip responsible for connecting two nodes to up to six I/O controllers.

What Makes Origin2000 Different

The following characteristics make Origin2000 different from previous system architectures:

Origin2000 is modular.

Origin2000 uses an interconnection fabric to link system nodes and internal crossbars within the system ASICs (Hub, Router, Crossbow).

Origin2000 has distributed shared-memory and distributed shared-I/O.

Origin2000 shared memory is kept cache coherent using directories and a directory-based cache coherence protocol.

Origin2000 uses page migration and replication to improve memory latency.

Scalability. Origin2000 is easily scaled by linking nodes together over an interconnection fabric, and system bandwidth scales linearly with an increase in the number of processors and the associated switching fabric. This means Origin2000 can have a low entry cost, since you can build a system upward from an inexpensive Origin200 configuration.

In contrast, traditional bus-based systems are only scalable in the amount of its processing and I/O power. The Everest interconnect is the E-bus, which has a fixed bandwidth and is the same size from entry-level to high-end.

Modularity. A system is comprised of standard processing nodes. Each node contains processor(s), memory, a directory for cache coherence, an I/O interface, and a system interconnection. Node boards are placed in one of the following three types of system modules: entry-level, deskside, or rack.

Traditional bus-based systems are not as modular; there is a fixed number of slots in each deskside or rack system, and this number cannot be changed.

System interconnections. Origin2000 uses an interconnection fabric and crossbars. The interconnection fabric is a web of dynamically-allocated switch-connected links that attach nodes to one another. Crossbars are part of the interconnection fabric, and are located inside several of the ASICs--the Crossbow, the Router, and the Hub. Crossbars dynamically link ASIC input ports with their output ports.

In traditional bus-based systems, processors access memory and I/O interfaces over a shared system bus that has a fixed size and a fixed bandwidth.

Distributed shared-memory (DSM) and I/O. Origin2000 memory is physically dispersed throughout the system for faster processor access. Page migration hardware moves data into memory closer to a processor that frequently uses it. This page migration scheme reduces memory latency -- the time it takes to retrieve data from memory. Although main memory is distributed, it is universally accessible and shared between all the processors in the system. Similarly, I/O devices are distributed among the nodes, and each device is accessible to every processor in the system.

Traditional bus-based systems have shared memory, but their memory is concentrated, not distributed, and they do not distribute I/O. All I/O accesses, and those memory accesses not satisfied by the cache, incur extra latencies when traversing the bus.

Directory-based cache coherence. Origin2000 uses caches to reduce memory latency. Cache coherence is supported by a hardware directory that is distributed among the nodes along with main memory. Cache coherence is applied across the entire system and all memory. In a snoopy protocol, every cache-line invalidation must be broadcast to all CPUs in the system, whether the CPU has a copy of the cache line or not. In contrast, a directory protocol relies on point-to-point messages that are only sent to those CPUs actually using the cache line. This removes the scalability problems inherent in the snoopy coherence scheme used by bus-based systems such as Challenge. A directory-based protocol is preferable to snooping since it reduces the amount of coherence traffic that must be sent throughout the system.

Traditional bus-based systems use a snoopy coherence protocol.

Page migration and replication. To provide better performance by reducing the amount of remote memory traffic, Origin2000 uses a process called page migration. Page migration moves data that is often used by a processor into memory close to that processor.

Traditional bus-based systems do not support page migration.

Scalability and Modularity

Origin2000 scalability and modularity allow one to start with a small system and incrementally add modules to make the system as large as needed. An entry-level module can hold from one to four MIPS R10000 processors, and a Origin2000 deskside module can hold from one to eight R10000 processors. A series of these deskside modules can be mounted in racks, scaling the system up to the following maximum configuration:

128 processors

256 GB of memory

64 I/O interfaces with 192 I/O controllers (or 184 XIO and 24 PCI-64)

128 3.5-inch Ultra-SCSI devices and 16 6.25-inch devices

As one adds nodes to the interconnection fabric, bandwidth and performance scale linearly without significantly impacting system latencies. This is a result of the following design decisions:

replacing the fixed-size, fixed-bandwidth bus of traditonal bus-based systems with the scalable interconnection fabric whose bisection bandwidth (the bandwidth through the center of CrayLink Interconnect) scales linearly with the number of nodes in the system.
reducing system latencies by replacing the centrally-located main memory of traditional bus based systems with the tightly-integrated but distributed shared-memory S2MP architecture of Origin2000.

System Interconnections

Origin2000 replaces traditional bus-based system's shared, fixed-bandwidth bus with the following:

a scalable interconnection fabric, in which processing nodes are linked by a set of routers
crossbar switches, which implement the interconnection fabric. Crossbars are located in the following places:
- within the Crossbow ASIC connecting the I/O interfaces to the nodes
- within the Router ASIC forming the interconnection fabric itself,
- within the Hub ASIC which interconnects the processors, memory, I/O, and interconnection fabric interfaces within each node.

These internal crossbars maximize the throughput of the major system components and concurrent operations.

Interconnection Fabric

Origin2000 nodes are connected by an interconnection fabric. The interconnection fabric is a set of switches, called routers, that are linked by cables in various configurations, or topologies. The interconnection fabric differs from a standard bus in the following important ways:

The interconnection fabric is a mesh of multiple point-to-point links connected by the routing switches. These links and switches allow multiple transactions to occur simultaneously.
The links permit extremely fast switching. Each bidirectional link sustains as much bandwidth as the entire Challenge bus.
The interconnection fabric does not require arbitration nor is it as limited by contention, while a bus must be contested for through arbitration.
More routers and links are added as nodes are added, increasing the interconnection fabric's bandwidth. A shared bus has a fixed bandwidth that is not scalable.
The topology of the CrayLink Interconnect is such that the bisection bandwidth grows linearly with the number of nodes in the system.

The interconnection fabric provides a minimum of two separate paths to every pair of Origin2000 nodes. This redundancy allows the system to bypass failing routers or broken interconnection fabric links. Each fabric link is additionally protected by a CRC code and a link-level protocol, which retry any corrupted transmissions and provide fault tolerance for transient errors.

Earlier in this chapter, Figure 1-5 and Figure 1-6 showed how an interconnection fabric differs from an ordinary shared bus. Figure 1-8 amplifies this difference by illustrating an 8-node hypercube with its multiple datapaths. Simultaneously, R1 can communicate with R0, R2 to R3, R4 to R6, and R5 to R7, all without having to interface with any other node.

Datapaths in an Interconnection Fabric

Crossbar

Several of the ASICs (Hub, Router, and Crossbow) use a crossbar for linking on-chip inputs with on-chip output interfaces. For instance, an 8-way crossbar is used on the Crossbow ASIC; this crossbar creates direct point-to-point links between one or more nodes and multiple I/O devices. The crossbar switch also allows peer-to-peer communication, in which one I/O device can speak directly to another I/O device.

The Router ASIC uses a similar 6-way crossbar to link its six ports with the interconnection fabric, and the Hub ASIC links its four interfaces with a crossbar. A logical diagram of a 4-way (also referred to as four-by-four, or 4 x 4) crossbar is given in Figure 1-9; note that each output is determined by multiplexing the four inputs.

Logical illustration of a Four-by-Four (4 x 4) Crossbar

Figure 1-10 shows a 6-way crossbar at work. In this example, the crossbar connects six ports, and each port has an input (I) and an output (O) buffer for flow control. Since there must be an output for every input, the six ports can be connected as six independent, parallel paths. The crossbar connections are shown at two clock intervals: Time=n, and Time=n+1.

Crossbar Operation

At clock T=n, the ports independently make the following parallel connections:

from port 1 to port 5
from port 2 to port 6
from port 3 to port 4
from port 4 to port 2
from port 5 to port 3
from port 6 to port 1

Figure 1-10 shows the source (Input) and target (Output) for each connection, and arrows indicate the direction of flow. When a connection is active, its source and target are not available for any other connection over the crossbar.

At the next clock, T=n+1, the ports independently reconfigure themselves into six new data links: 1-to-5, 2-to-4, 3-to-6, 4-to-1, 5-to-3 and 6-to-2. At clock intervals, the ports continue making new connections as needed. Connection decisions are based on algorithms that take into account flow control, routing, and arbitration.

Distributed Shared Address Space (Memory and I/O)

Origin2000 memory is located in a single shared address space. Memory within this space is distributed amongst all the processors, and is accessible over the interconnection fabric. This differs from a traditional bus-based system, in which memory is centrally located on and only accessible over a single shared bus. By distributing Origin2000's memory among processors memory latency is reduced: accessing memory near to a processor takes less time than accessing remote memory. Although physically distributed, main memory is available to all processors.

I/O devices are also distributed within a shared address space; every I/O device is universally accessible throughout the system.

Origin2000 Memory Hierarchy

Memory in Origin2000 is organized into the following hierarchy:

At the top, and closest to the processor making the memory request, are the processor registers. Since they are physically on the chip they have the lowest latency -- that is, they have the fastest access times. In Figure 1-11, these are on the processor labelled P0.
The next level of memory hierarchy is labelled cache. In Figure 1-11, these are the primary and secondary caches located on P0. Aside from the registers, caches have the lowest latency in Origin2000, since they are also on the R10000 chip (primary cache) or tightly-coupled to its processor on a daughterboard (secondary cache).
The next level of memory hierarchy is called home memory, which can be either local or remote. The access is local if the address of the memory reference is to address space on the same node as the processor. The access is remote if the address of the memory reference is to address space on another node. In Figure 1-11, home memory is the block of main memory on Node 0, which means it is local to Processor 0.
The next level of memory hierarchy consists of the remote caches that may be holding copies of a given memory block. If the requesting processor is writing, these copies must be invalidated. If the processor is reading, this level exists if another processor has the most up-to-date copy of the requested location. In Figure 1-11, remote cache is represented by the blocks labelled ``cache'' on Nodes 1 and 2.

Memory Hierarchy, Based on Relative Latencies and Data Capacities

Caches are used to reduce the amount of time it takes to access memory -- also known as a memory's latency -- by moving faster memory physically close to, or even onto, the processor. This faster memory is generally some version of static RAM, or SRAM.

The DSM structure of Origin2000 also creates the notion of local memory. This memory is close to the processor and has reduced latency compared to bus-based systems, where all memory must be accessed through a shared bus.

While data only exists in either local or remote memory, copies of the data can exist in various processor caches. Keeping these copies consistent is the responsibility of the logic of the various hubs. This logic is collectively referred to as a cache-coherence protocol.

System Bandwidth

There are three types of bandwidths:

Peak bandwidth, which is a theoretical number derived by multiplying the clock rate at the interface by the data width of the interface.

Sustained bandwidth, which is derived by subtracting the packet header and any other immediate overhead from the peak bandwidth. This best-case figure, sometimes called Peak Payload bandwidth, does not take into account contention and other variable effects.

Bisection bandwidth, which derived by dividing the interconnection fabric in half, and measuring the data rate across this divide. This figure is useful for measuring data rates when the data is not optimally placed.

Table 1-1 gives a comparison between peak and sustained data bandwidths of Origin2000.

Table 1-1..............................Peak and Sustained Bandwidths
_______________________________________________________

Interface            Sustained Bandwidth
                     [Peak BW in brackets]
______________________________________________________
Memory               780 MB  per second [780]
______________________________________________________
Node Card	     1.25 GB per second [1.56 GB]
______________________________________________________
Crossbow	     2.5 GB per second [3.12 GB]
______________________________________________________
Module (deskside)    5.0 GB per second [6.24 GB]
______________________________________________________
Rack	    	     80 GB per second [100 GB]
______________________________________________________

Table 1-2 lists the bisection bandwidths of various Origin2000 configurations both with and without Express Links.

Table 1-2..............................System Bisection Bandwidths

_______________________________________________________________________________________
System Size	   Sustained Bisection Bandwidth 	Sustained Bisection Bandwidth
(number of CPUs)   without Express Links [Peak BW]	with Express Links [Peak BW]
_______________________________________________________________________________________
8		   1.25 GB per second [1.56 GB]		2.5 GB per second [3.12 GB]*
_______________________________________________________________________________________
16		   2.5 GB per second [3.12 GB]		5.0 GB per second [6.24 GB]
_______________________________________________________________________________________
32		   5.0 GB per second [6.24 GB]		10 GB per second [12.5 GB]
_______________________________________________________________________________________
64		   10 GB per second [12.5 GB]		N/A
_______________________________________________________________________________________
128		   20 GB per second [25.0]		N/A
_______________________________________________________________________________________
* With Star Router

Table 1-3 lists the bandwidths of Ultra SCSI and FibreChannel devices.

Table 1-3 Peripheral Bandwidths
_______________________________________________
Peripheral Bandwidth
_______________________________________________
Ultra SCSI 40 MB per second
_______________________________________________
FibreChannel 100MB per second
_______________________________________________

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]