[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]


[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

Origin200/2000 Series

Overview and Architecture

Last Change: 11/Oct/2004

As any technical manager of a major computing facility will know, planning for the future can be a nightmare when it comes to cost. Or at least it used to be. Until now, buying a multiprocessor supercomputer of any kind involved an annoying tradeoff when deciding on price/performance:

As a result, one saw people using expensive rack systems with few CPUs (thus wasting the extreme capacity of the system), or they had a smaller system with more CPUs but without the ability to expand for the future. Of course, those who did have big budgets may have had a big system with many CPUs, but the architecture would always means that the CPUs had to compete for resources over a shared bus topology of some kind.

What was needed in multiprocessing systems was the ability to scale the infrastructure that supports the processing as well as the number of processors. Origin offered this funcionality for the first time. One decides on how many processors one needs in the first instance and purchases a system that offers the required number of processors combined with sufficient memory I/O bandwidth to properly support them. The latter aspect is very important: many systems offer large numbers of processors, but these older systems do not scale their memory bandwidth capacity as the number of processors scales. As a result, more and more processors means more competition for the fixed bandwidth available. Origin breaks this bottleneck by allowing one to expand the bandwidth infrastructure as desired.

Fundamentally, one can purchase a small Origin system maximally configured and, at a later date, purchase a second system which is connected to the first using high-speed CrayLink cables. The speed of this connection is so fast that, to the user/programmer/etc., the combined systems are treated as a single system. This allows one to buy added processing power and memory bandwidth as and when one needs it, as opposed to having to cater for future memory/processor requirements in the first instance.

At the low end, one can use the Origin200, offering 1 or 2 R10K/R12K CPUs and up to 2GB RAM. When performance/memory/storage demands increase, one can purchase a second Origin200 and connect the two systems together via CrayLink, giving a single system comprising 4 CPUs and up to 4GB RAM. I ran a dual-180 O200 for several years in a research dept. - they are very capable machines and can easily support an environment comprising many dozens of users.

In the middle and high-end of the range is the deskside Origin2000, offering 1 to 8 CPUs. These can be rack mounted and expanded up to 128 CPUs single-image and then, using metarouter racks, up to thousands of CPUs connected as a cluster of multiple 128-CPU systems, eg. the 6400-CPU Blue Mountain system at Los Alamos.

At this stage, I'll bow to SGI's Engineers' superior technical prosal skills - they have already constructed a detailed architectural description of Origin; what follows is a local copy of the original document.


[Silicon Graphics logo]
 
 

Technical Overview of the Origin Family The guide describes the hardware architecture of the Origin family, and its specific implementations:

The entry-level Origin200 system, consisting of a maximum of two towers (up to four processors) that can be linked together.

The deskside/rackmount Origin2000 system, consisting of from 1 to 128 processors, housed in deskside and rackmount cabinets.

The Origin family is a revolutionary follow-on to the Challenge-class symmetric multiprocessing (SMP) system. It uses Silicon Graphics' distributed shared-memory multiprocessing architecture, called S2MP.

The development path of Silicon Graphics' multiprocessor systems is shown in Figure 1-1.

multiprocessor devel.

Developmental Path of Silicon Graphics Multiprocessing Architectures

Figure 1-2 is a block diagram of an Origin2000 system showing the central Node board, which can be viewed as a system controller from which all other system components radiate.

diagram

Origin2000 Block Diagram

Figure 1-3 is a block diagram of a system with four Node boards; Nodes 1 and 3 connect to Crossbow 1, and Nodes 2 and 4 connect to Crossbow 2. Crossbow 1 connects to XIO boards 1 through 6 and Crossbow 2 connects to XIO boards 7 through 12.

diagram

Block Diagram of a 4-Node System

The components shown in Figure 1-2 are described further in this chapter, in the section titled ``Origin2000 Components.''

As illustrated in Figure 1-4, Origin2000 is a number of processing nodes linked together by an interconnection fabric. Each processing node contains either one or two processors, a portion of shared memory, a directory for cache coherence, and two interfaces: one that connects to I/O devices and another that links system nodes through the interconnection fabric.

diagram

Nodes in an Origin2000 System

The interconnection fabric links nodes to each other, but it differs from a bus in several important ways. A bus is a resource that can only be used by one processor at a time. The interconnection fabric is a mesh of multiple, simultaneous, dynamically-allocable -- that is, connections are made from processor to processor as they are needed -- transactions. This web of connections differs from a bus in the same way that multiple dimensions differ from a single dimension: if a bus is a one-dimensional line, then the interconnection fabric is a multi-dimensional mesh.

diagram

Single Datapath Over a Bus

As shown in Figure 1-5, a bus is a shared, common link that multiprocessors must contest for and that only a single processor can use at a time. The interconnection fabric allows many nodes to communicate simultaneously, as shown in Figure 1-6. Paths through the interconnection fabric are constructed as they are needed by Router ASICs, which act as switches.

diagram

Multi-dimensional Datapaths Through an Interconnection Fabric

Origin2000 is said to be scalable, because it can range in size from 1 to 128 processors. As you add nodes, you add to and scale the system bandwidth. Origin2000 is also modular, in that it can be increased in size by adding standard modules to the interconnection fabric. The interconnection fabric is implemented on cables outside of these modules.

Origin2000 uses Silicon Graphics' new Scalable Shared-Memory Multiprocessor (S2MP) architecture to distribute shared memory amongst the nodes. This shared memory is accessible to all processors through the interconnection fabric and can be accessed with low latency.

 

Origin2000 Components

An Origin2000 system has the following components:

    processor(s)

    memory

    I/O controllers

    distributed memory controller (Hub ASIC)

    directory memory for cache coherence

    CrayLink Interconnect

    XIO and Crossbow interfaces

These are linked as shown in Figure 1-7, and all of these are described in this section.

diagram

Components in an Origin2000 System

Processor

Origin2000 uses the MIPS R10000, a high-performance 64-bit superscalar processor which supports dynamic scheduling. Some of the important attributes of the R10000 are its large memory address space, together with a capacity for heavy overlapping of memory transactions -- up to twelve per processor in Origin2000.

Memory

Each Node board added to Origin2000 provides additional independantly accessed memory, and each node is capable of supporting up to 4 GB of memory. Up to 64 nodes can be configured in a system, which implies a maximum memory capacity of 256 GB.

I/O Controllers

Origin2000 supports a number of high-speed I/O interfaces, including Ultra, Fast, Wide SCSI, Fibrechannel, 100BASE-Tx, ATM, and HIPPI-Serial. Internally, these controllers are added through XIO cards, which have an embedded PCI-32 or PCI-64 bus. Thus, in Origin2000 I/O performance is added one bus at a time.

Hub

This ASIC is the distributed shared-memory controller. It is responsible for providing all of the processors and I/O devices a transparent access to all of distributed memory in a cache-coherent manner.

Directory Memory

This supplementary memory is controlled by the Hub. The directory keeps information about the cache status of all memory within its node. This status information is used to provide scalable cache coherence, and to migrate data to a node that accesses it more frequently than the present node.

CrayLink Interconnect

This is a collection of very high speed links and routers that is responsible for tying together the set of hubs that make up the system. The important attributes of CrayLink Interconnect are its low latency, scalable bandwidth, modularity, and fault tolerance.

XIO and Crossbow

These are the internal I/O interfaces originating in each Hub and terminating on the targeted I/O controller. XIO uses the same physical link technology as CrayLink Interconnect, but uses a protocol optimized for I/O traffic. The Crossbow ASIC is a crossbar routing chip responsible for connecting two nodes to up to six I/O controllers.

 

What Makes Origin2000 Different

The following characteristics make Origin2000 different from previous system architectures:

    Origin2000 is scalable.

    Origin2000 is modular.

    Origin2000 uses an interconnection fabric to link system nodes and internal crossbars within the system ASICs (Hub, Router, Crossbow).

    Origin2000 has distributed shared-memory and distributed shared-I/O.

    Origin2000 shared memory is kept cache coherent using directories and a directory-based cache coherence protocol.

    Origin2000 uses page migration and replication to improve memory latency.

Scalability. Origin2000 is easily scaled by linking nodes together over an interconnection fabric, and system bandwidth scales linearly with an increase in the number of processors and the associated switching fabric. This means Origin2000 can have a low entry cost, since you can build a system upward from an inexpensive Origin200 configuration.

In contrast, traditional bus-based systems are only scalable in the amount of its processing and I/O power. The Everest interconnect is the E-bus, which has a fixed bandwidth and is the same size from entry-level to high-end.

Modularity. A system is comprised of standard processing nodes. Each node contains processor(s), memory, a directory for cache coherence, an I/O interface, and a system interconnection. Node boards are placed in one of the following three types of system modules: entry-level, deskside, or rack.

Traditional bus-based systems are not as modular; there is a fixed number of slots in each deskside or rack system, and this number cannot be changed.

System interconnections. Origin2000 uses an interconnection fabric and crossbars. The interconnection fabric is a web of dynamically-allocated switch-connected links that attach nodes to one another. Crossbars are part of the interconnection fabric, and are located inside several of the ASICs--the Crossbow, the Router, and the Hub. Crossbars dynamically link ASIC input ports with their output ports.

In traditional bus-based systems, processors access memory and I/O interfaces over a shared system bus that has a fixed size and a fixed bandwidth.

Distributed shared-memory (DSM) and I/O. Origin2000 memory is physically dispersed throughout the system for faster processor access. Page migration hardware moves data into memory closer to a processor that frequently uses it. This page migration scheme reduces memory latency -- the time it takes to retrieve data from memory. Although main memory is distributed, it is universally accessible and shared between all the processors in the system. Similarly, I/O devices are distributed among the nodes, and each device is accessible to every processor in the system.

Traditional bus-based systems have shared memory, but their memory is concentrated, not distributed, and they do not distribute I/O. All I/O accesses, and those memory accesses not satisfied by the cache, incur extra latencies when traversing the bus.

Directory-based cache coherence. Origin2000 uses caches to reduce memory latency. Cache coherence is supported by a hardware directory that is distributed among the nodes along with main memory. Cache coherence is applied across the entire system and all memory. In a snoopy protocol, every cache-line invalidation must be broadcast to all CPUs in the system, whether the CPU has a copy of the cache line or not. In contrast, a directory protocol relies on point-to-point messages that are only sent to those CPUs actually using the cache line. This removes the scalability problems inherent in the snoopy coherence scheme used by bus-based systems such as Challenge. A directory-based protocol is preferable to snooping since it reduces the amount of coherence traffic that must be sent throughout the system.

Traditional bus-based systems use a snoopy coherence protocol.

Page migration and replication. To provide better performance by reducing the amount of remote memory traffic, Origin2000 uses a process called page migration. Page migration moves data that is often used by a processor into memory close to that processor.

Traditional bus-based systems do not support page migration.

Scalability and Modularity

Origin2000 scalability and modularity allow one to start with a small system and incrementally add modules to make the system as large as needed. An entry-level module can hold from one to four MIPS R10000 processors, and a Origin2000 deskside module can hold from one to eight R10000 processors. A series of these deskside modules can be mounted in racks, scaling the system up to the following maximum configuration:

As one adds nodes to the interconnection fabric, bandwidth and performance scale linearly without significantly impacting system latencies. This is a result of the following design decisions:

  • replacing the fixed-size, fixed-bandwidth bus of traditonal bus-based systems with the scalable interconnection fabric whose bisection bandwidth (the bandwidth through the center of CrayLink Interconnect) scales linearly with the number of nodes in the system.
  • reducing system latencies by replacing the centrally-located main memory of traditional bus based systems with the tightly-integrated but distributed shared-memory S2MP architecture of Origin2000.

 

System Interconnections

Origin2000 replaces traditional bus-based system's shared, fixed-bandwidth bus with the following:

  • a scalable interconnection fabric, in which processing nodes are linked by a set of routers
  • crossbar switches, which implement the interconnection fabric. Crossbars are located in the following places:

    • within the Crossbow ASIC connecting the I/O interfaces to the nodes

    • within the Router ASIC forming the interconnection fabric itself,

    • within the Hub ASIC which interconnects the processors, memory, I/O, and interconnection fabric interfaces within each node.

These internal crossbars maximize the throughput of the major system components and concurrent operations.

Interconnection Fabric

Origin2000 nodes are connected by an interconnection fabric. The interconnection fabric is a set of switches, called routers, that are linked by cables in various configurations, or topologies. The interconnection fabric differs from a standard bus in the following important ways:

  • The interconnection fabric is a mesh of multiple point-to-point links connected by the routing switches. These links and switches allow multiple transactions to occur simultaneously.

  • The links permit extremely fast switching. Each bidirectional link sustains as much bandwidth as the entire Challenge bus.

  • The interconnection fabric does not require arbitration nor is it as limited by contention, while a bus must be contested for through arbitration.

  • More routers and links are added as nodes are added, increasing the interconnection fabric's bandwidth. A shared bus has a fixed bandwidth that is not scalable.

  • The topology of the CrayLink Interconnect is such that the bisection bandwidth grows linearly with the number of nodes in the system.

The interconnection fabric provides a minimum of two separate paths to every pair of Origin2000 nodes. This redundancy allows the system to bypass failing routers or broken interconnection fabric links. Each fabric link is additionally protected by a CRC code and a link-level protocol, which retry any corrupted transmissions and provide fault tolerance for transient errors.

Earlier in this chapter, Figure 1-5 and Figure 1-6 showed how an interconnection fabric differs from an ordinary shared bus. Figure 1-8 amplifies this difference by illustrating an 8-node hypercube with its multiple datapaths. Simultaneously, R1 can communicate with R0, R2 to R3, R4 to R6, and R5 to R7, all without having to interface with any other node.

diagram

Datapaths in an Interconnection Fabric

 

Crossbar

Several of the ASICs (Hub, Router, and Crossbow) use a crossbar for linking on-chip inputs with on-chip output interfaces. For instance, an 8-way crossbar is used on the Crossbow ASIC; this crossbar creates direct point-to-point links between one or more nodes and multiple I/O devices. The crossbar switch also allows peer-to-peer communication, in which one I/O device can speak directly to another I/O device.

The Router ASIC uses a similar 6-way crossbar to link its six ports with the interconnection fabric, and the Hub ASIC links its four interfaces with a crossbar. A logical diagram of a 4-way (also referred to as four-by-four, or 4 x 4) crossbar is given in Figure 1-9; note that each output is determined by multiplexing the four inputs.

diagram

Logical illustration of a Four-by-Four (4 x 4) Crossbar

Figure 1-10 shows a 6-way crossbar at work. In this example, the crossbar connects six ports, and each port has an input (I) and an output (O) buffer for flow control. Since there must be an output for every input, the six ports can be connected as six independent, parallel paths. The crossbar connections are shown at two clock intervals: Time=n, and Time=n+1.

diagram

Crossbar Operation

At clock T=n, the ports independently make the following parallel connections:

  • from port 1 to port 5

  • from port 2 to port 6

  • from port 3 to port 4

  • from port 4 to port 2

  • from port 5 to port 3

  • from port 6 to port 1

Figure 1-10 shows the source (Input) and target (Output) for each connection, and arrows indicate the direction of flow. When a connection is active, its source and target are not available for any other connection over the crossbar.

At the next clock, T=n+1, the ports independently reconfigure themselves into six new data links: 1-to-5, 2-to-4, 3-to-6, 4-to-1, 5-to-3 and 6-to-2. At clock intervals, the ports continue making new connections as needed. Connection decisions are based on algorithms that take into account flow control, routing, and arbitration.

 

Distributed Shared Address Space (Memory and I/O)

Origin2000 memory is located in a single shared address space. Memory within this space is distributed amongst all the processors, and is accessible over the interconnection fabric. This differs from a traditional bus-based system, in which memory is centrally located on and only accessible over a single shared bus. By distributing Origin2000's memory among processors memory latency is reduced: accessing memory near to a processor takes less time than accessing remote memory. Although physically distributed, main memory is available to all processors.

I/O devices are also distributed within a shared address space; every I/O device is universally accessible throughout the system.

Origin2000 Memory Hierarchy

Memory in Origin2000 is organized into the following hierarchy:

diagram

Memory Hierarchy, Based on Relative Latencies and Data Capacities

Caches are used to reduce the amount of time it takes to access memory -- also known as a memory's latency -- by moving faster memory physically close to, or even onto, the processor. This faster memory is generally some version of static RAM, or SRAM.

The DSM structure of Origin2000 also creates the notion of local memory. This memory is close to the processor and has reduced latency compared to bus-based systems, where all memory must be accessed through a shared bus.

While data only exists in either local or remote memory, copies of the data can exist in various processor caches. Keeping these copies consistent is the responsibility of the logic of the various hubs. This logic is collectively referred to as a cache-coherence protocol.

 

System Bandwidth

There are three types of bandwidths:

Table 1-1 gives a comparison between peak and sustained data bandwidths of Origin2000.

Table 1-1..............................Peak and Sustained Bandwidths
_______________________________________________________

Interface            Sustained Bandwidth
                     [Peak BW in brackets]
______________________________________________________
Memory               780 MB  per second [780]
______________________________________________________
Node Card	     1.25 GB per second [1.56 GB]
______________________________________________________
Crossbow	     2.5 GB per second [3.12 GB]
______________________________________________________
Module (deskside)    5.0 GB per second [6.24 GB]
______________________________________________________
Rack	    	     80 GB per second [100 GB]
______________________________________________________
Table 1-2 lists the bisection bandwidths of various Origin2000 configurations both with and without Express Links.

Table 1-2..............................System Bisection Bandwidths

_______________________________________________________________________________________
System Size	   Sustained Bisection Bandwidth 	Sustained Bisection Bandwidth
(number of CPUs)   without Express Links [Peak BW]	with Express Links [Peak BW]
_______________________________________________________________________________________
8		   1.25 GB per second [1.56 GB]		2.5 GB per second [3.12 GB]*
_______________________________________________________________________________________
16		   2.5 GB per second [3.12 GB]		5.0 GB per second [6.24 GB]
_______________________________________________________________________________________
32		   5.0 GB per second [6.24 GB]		10 GB per second [12.5 GB]
_______________________________________________________________________________________
64		   10 GB per second [12.5 GB]		N/A
_______________________________________________________________________________________
128		   20 GB per second [25.0]		N/A
_______________________________________________________________________________________
* With Star Router
Table 1-3 lists the bandwidths of Ultra SCSI and FibreChannel devices.

Table 1-3Peripheral Bandwidths
_______________________________________________
Peripheral Bandwidth
_______________________________________________
Ultra SCSI40 MB per second
_______________________________________________
FibreChannel100MB per second
_______________________________________________


Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)
[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]
[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]