[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

O2 Architecture

Last Change: 04/Dec/2007

By Ian Mapleson

This page discusses SGI's O2 workstation, covering:

aspects of O2's design and the impact its architecture has on system performance, computational functions, exploitable hardware features, etc.
the various main CPUs available and how one should evaluate CPU differences, their importance, comparisons with other systems, etc.
the effects of screen resolution and colour depth on system performance and how this correlates with the O2's architectural design.
the benefits of O2's design when dealing with 3D scenes that involve complex geometry/lighting calculations, comparisons with other systems, etc.
ICE, the dedicated ASIC which O2 has for accelerating image/video tasks.

If you are researching the O2 system, note that the main index has other relevant information pages, eg. SGI Performance Comparisons, Melting the ICE (image and video processing on O2), comparison between O2 and Indigo2, primitive graphics benchmark comparisons with all other SGI systems, and so on. Just about everything you could want to know really. I recommend reading all relevant material before making any purchasing or upgrade decisions.

The O2 workstation, first released on October 10th 1996, uses a system design entitled Unified Memory Architecture, or UMA for short (not to be confused with NUMA which refers to the Origin server architecture). SGI has their own White Paper [47K PDF] on the subject of UMA design, which I strongly recommend you read. This page is the result of my own extensive investigative work.

Most existing (and older) workstations and computers (pre-1998) are based on 'bus' technologies, where data is moved around the system via a shared bus from one subsystem to another. Subsystems include main memory, CPU, graphics, texture, image, video, I/O connections, networking ports, etc. Sometimes, there is a fast link between CPU and main RAM, but even then the link that connects these two elements to the rest of the system (eg. the bridge + PCI in many PCs) is slow when judged by the demands of today's applications - painfully slow for some tasks.

The problem with the shared-bus design is that, as data bandwidth demands become greater and the nature of tasks increase in complexity, many tasks become difficult to accomplish since they require vast amounts of data to be moved around the system, eg. the use of an incoming video stream as a texture in a 3D model. The normal response to such problems has been to increase the clock speed of the bus or to make it wider, but there comes a point when the bandwidth gained is small and not worth the extra cost.

UMA solves this problem by having just one 'unified' high speed memory block. The heart of the system is no longer the main CPU; instead, the memory/graphics controller becomes the focus of attention. Understanding how O2's UMA system works, and what the memory/graphics controller can do, is the key to comprehending the things O2 can do which other systems such as traditional PCs cannot. Note, however, that as time moves on, traditional designs solve these problems using other methods, eg. the Intel i760 graphics ASIC can take a video stream directly into itself to use as a texture on a 3D model; however, if the main CPU wanted to carry out operations on that video stream, it would need to be copied to main memory first - this is not the case with O2.

UMA doesn't solve every problem, but when it comes to satisfying the demands placed on workstations by users, it is an ideal low-cost solution. However, there comes a point when a task's complexity is so great, eg. rendering 750MByte seismic data sets, that UMA is not an appropriate solution; thus, systems like Octane use a crossbar-based approach which offers massive system and memory bandwidth (Octane can render a 750MByte volumetric data set in two or three seconds). At the high-end, the crossbar concept is combined with advanced interconnection technologies to offer the massively scalable systems known as Origin (file/data/web/media serving, number crunching), Onyx2 (all the power of Origin combined with the fastest graphics designs in the world), and newer equivalent models (Origin3000, Onyx3000, Onyx4, Altix, etc.)

All these concepts share a common approach: the focus is on moving data around in the most efficient way possible, removing the bandwidth bottleneck which has existed for many years in the computing world, and allowing each computer subsystem to operate at its maximum potential.

Here is SGI's simple O2 diagram (note that the small annotated numbers do not refer to bus speeds or bandwidths. I belive they denote ASIC pin counts):

The above figure is rather sparse of details, but it's a good summary. However, for the complete picture with full details, here is SGI's full-blown O2 diagram, which even shows some of the board-level layout features:

So how does UMA work? An example: suppose a video stream is brought in from the O2 digital camera and the data is stored in an area in the main RAM block (termed a 'Digital Media Buffer', or DMbuffer). If one then wished to use the data as a texture for a 3D model, all that needs to be done is to pass a pointer for the data area to the CRM chip, thus saving the need to copy the data as a whole to another part of the system, such as a separate graphics card. Hence, video-as-texture (2.6MB MPEG) is trivial with O2.

Another example is volume rendering. Since there is only a single memory block, one therefore has access to effectively unlimited texture memory (ie. limited only by main RAM size). Thus, one can easily manipulate large textures sets, eg. 256MB of CAT scan data.

Of course, having virtually no limit to texture memory also benefits other areas such as visual simulation (many different textures are needing for landscapes, buildings, trees, etc.) The UMA design ensures fast and reliable access to the data, with 2.1GB/sec peak transfer rate between main RAM and the memory/graphics controller (CRM).

By way of a summary, here is an edited version of comments made by Tom Furlong (SGI's vice-president and general manager for desk-side systems) in an interview about the O2:

: "We got rid of the bus because current bus based systems reached their [bandwidth] limit to CPU and standard I/O; when you add 3-D image processing or audio they start to fall apart, wasting precious bandwidth copying data around. O2 is based on a new unified memory architecture that puts 2.1 gigabytes of system bandwidth right where the computation is done, that's 20 times the bandwidth of today's [1996] fastest PC. 02 doesn't waste any of its bandwidth moving the data around the system. Instead it has multiple computational memory, the CPU coordinates the work of graphics I/O video compression to accomplish the computation without extraneous data movements; with 20 times bandwidth and no wasted data movements it can handle monstrously large data sets, movies, and hugely complicated special effects without missing a beat. To get the massive amounts of data O2 implements standard I/O that includes serial, parallel and embedded CD-ROM, two 40 Megabytes per second SCSI channels, auto sensing, 100 megabit ethernet connection, two channel audio I/O, two video channels, one video output and we have thrown in a 64-bit PCI for anything left out. It's really a technology tour de force."

The Main CPU

Currently, the R5000, R5200, R7000, R10000 and R12000 CPUs can be used in O2. General information on these can be found on the main index (eg. comparisons with other systems, or different R10000 CPUs in one particular system); the following information is more concerned with the specific use of these processors in O2.

Although the CRM ASIC handles most graphics functions in hardware, all geometry and lighting calculations are handled by the main CPU. AT first one might think this is a disadvantage, but its cheaper and also means there is an easy upgrade path to increased performance: just get a faster/newer CPU (there are other unexpected benefits too which are discussed later). To this end, the R5000 is a good solution: it has been specifically designed to handle operations that are typically found in 3D graphics tasks, eg. MADD instructions. Do not dismiss the R5000 just because it isn't an R10000.

Please examine the detailed test results for the use of these CPUs in O2 before forming any conclusions. Also read Byte's article on the R5000, examine the R10000 performance comparison pages for R10K/195 and R10K/250, the SGI Performance Comparisons page, etc.

Here is a SPEC95 performance summary table (weighted means only) for R5000, R10000 and R12000 in O2, ordered by integer performance:

                             SPECint95     SPECfp95

R7000SC 600MHz 1MB L3:           ?            ?
R12000SC 400MHz 2MB L2:        19.30        13.60       []
R12000SC 300MHz 1MB L2:        14.49        10.42
R12000SC 270MHz 1MB L2:        13.10         9.80
R10000SC 250MHz 1MB L2:        12.40         9.71
R10000SC 195MHz 1MB L2:        10.10         8.77
R7000SC 350MHz 1MB L3:           ?            ?
R7000SC 300MHz 1MB L3:           ?           7.50
R10000SC 175MHz 1MB L2:         9.10         6.60
R5200SC 300MHz 1MB L2:          8.04         6.86
R10000SC 150MHz 1MB L2:         7.40         6.20
R5000SC 200MHz 1MB L2:          5.40         5.70
R5000SC 180MHz 512K L2:         4.82         5.42
R5000PC 180MHz (no L2):         3.70         4.55

Some people react with dissapointment when first examining the R10000's/R12000's floating point (fp) performance in O2, compared to other SGI systems which use R10000 such as Octane and Origin (the discussion here will refer to R10000, but the same issues apply to R12000 in O2 aswell). There are several things to be said on this subject:

SPEC95 cannot properly measure the capabilities of O2. If you look at application performance (eg. AIM), the R10000 does quite well, especially for integer (int) tasks (O2 can be just as fast as Origin for some int tasks). If you're interested in an R10000 configuration, you must decide whether the extra cost is worth the performance gain. Make sure you have your particular task tested before you buy. Operations such as off-screen rendering will benefit from an R10K, but don't think that O2 is a number crunching box because it isn't and was never designed to be. If fp number crunching is your main task, then you should be looking at Origin, Octane or Fuel, not O2. Certainly, O2's int performance is greatly improved with an R10K, easily matching Octane, almost always beating older systems such as Power Challenge, Indigo2, etc. and often beating an R10K/180MHz Origin200 (in fact, for m88ksim on SPEC95 running on R10K/250, O2 outperformed an Origin2000!).
R10K was never designed for a memory system like UMA. R10K was designed for a faster memory system than is used in O2, such as is used in the Octane/Origin/Onyx2 line; O2's memory system runs at a lower clock speed and has higher memory latency properties than Origin.
CRM contains memory control circuitry for the R5K, but not for R10K. To compensate for this, R10K O2 systems have an extra ASIC on the R10K daughterboard to handle L2 cache requests. CRM was designed for 32byte cache refills, whereas R10K is designed for 64byte or 128byte refills. The extra ASIC converts R10K cache refill requests into multiple 32byte requests. This means extra time spent during cache misses. As a result, SPECfp95 results on O2 with R10K are much lower than R10K in other modern SGI systems, even though real world application performance can be 60% better than an R5K at a higher clock (SPEC95 punishes cache misses quite heavily). Also, R10K in O2 can only offer 1 outstanding cache miss, compared to 4 in Octane/Origin/Onyx2; cache-sensitive code will suffer because of this. Examine my R10K/195 performance comparison analysis for complete details.
R10K does not help much for 3D graphics tasks that do not involve 64bit processing (eg. Gouraud shading). This is because most 3D graphics tasks require only single precision floating point (fp) computation (this especially applies to lighting and geometry). Thus, a 180MHz R5K will be faster than a 150MHz R10K for non-textured 3D graphics tasks. But at an equal clock speed, R10K will be about 25% faster than R5K for certain graphics tasks. Performance figures at a similar clock are higher for R10K partly because 3D graphics does involve a degree of int processing (eg. pointer chasing, array handling) and such int tasks are faster with R10K.

So how relevant is SPEC? Consider the JPEG compression int test. In O2, this operation can be done in real-time by dedicated hardware (ICE), so the SPEC result is of little value. Also, as far as the R5K is concerned, many SPEC tests employ double precision computation - something the R5K is not optimised for. The R5K does not have any special int optimisations. For R10K in O2, the int performance is very good, so O2 could easily act as a low-cost web server - indeed, I know of several institutions which use O2 for just this function.

Also, as far as I know, none of the SPECfp95 tests use the kind of single precision fp calculations that R5K was specficially designed for, namely MADD-style computation (the matrix math found in 3D graphics). See the Byte article for more information.

One must decide carefully whether the improved performance offered by an R10K is worth the extra cost, though R10K O2 systems have become considerably cheaper in recent years, except perhaps for R12K/400 systems which seem to retain their value quite well. Either way, always have your application tested before making any purchasing decision. It must be said though, R10K/R12K systems are definitely good for integer tasks and 2D work; my main o2 system used to be an R5200/300, but I replaced it with an R12K/400 after tests showed the R12K to be about 100% faster.

There are several additional aspects of the main CPU in O2 that are worthy of discussion.

The Impact of Screen Resolution on CPU Performance

Lower screen resolutions and shallower colour visuals will allow O2 to run some kinds of application faster. For example, on an R5000SC/200MHz O2, changing the screen resolution from 1280x1024 32+32 down to VGA16 improves the STREAM memory benchmark by 13 percent!

This effect may sound bizarre, but the explanation is quite simple and correlates correctly with the O2's UMA design (refer back to the architecture diagrams given earlier for clarification).

The CRM ASIC handles data transfers between itself and the:

main CPU,
Image Compression Engine (ICE),
Display Engine (DE),
and I/O Engine (IOE).

For most users, the vast majority of the available bandwidth from CRM will be used by the DE. A typical 32bit 1280x1024 72Hz display requires a bandwidth of 360MB/sec. While this data transfer is going on, the main CPU and other system components must utilise the remaining bandwidth. Thus, if one decreases the display complexity, there will be less data moving from CRM to DE and hence more opportunities for CRM to service the rest of the system. As a result, memory-intensive applications will speed up, and tasks such as video I/O won't have to compete to the same degree for bandwidth resources. There is always enough bandwidth to handle video I/O at real-time rates; the difference is that the IOE will be more likely to be able to transfer data when first requested (quicker response, better reliability, fewer conflicts with other data being moved around, lower possibility of error, etc.)

In fact, for someone who's main task is video I/O processing where the actual on-screen display isn't important (ie. only the video I/O signals matters, and perhaps parallel, serial, SCSI transfer too), it would be advantageous to be able to shutdown the CRM/DE data transfer completely, allowing other system components to make full use of the available bandwidth, especially the main CPU (eg. offscreen rendering would be quicker). I am currently investigating how this can be achieved. It is likely that using a VT terminal connected to the serial port, instead of using the main monitor, would be one way of achieving this, but at the moment I have no practical data to prove this. When I work out how to use an O2 via a VT terminal and can obtain a suitable VT, I'll run some tests.

Who cares? Why does it matter? Well, consider that a 13% performance increase is, on average, better than the performance increase obtained by upgrading from an 180MHz R5000SC CPU to an 200MHz R5000SC CPU. For some users, it could mean the difference between a task taking 45 minutes instead of 60 minutes. If one has many such tasks to run, that extra saving could be very useful to those with time constraints; for medical people, it could be a life-saver.

Observe how the STREAM bandwidth figures in the following diagram (MB/sec), for a 200MHz R5000SC O2 running IRIX 6.3, gradually improve (ie. increase) as the display complexity decreases:

Display Complexity      Copy      Scale     Add     Triad

1280-1024-32-32-75      69.2      69.1      69.6    70.7
1280-1024-32-32-60      72.1      71.5      72.8    73.8
1280-1024-32-32-50      74.5      73.5      72.7    73.7
1280-1024-32-32-48      75.4      74.5      74.1    75.1
1280-1024-16-16-75      74.1      73.1      73.4    74.9
1280-1024-16-0-75       76.6      75.5      74.3    76.2
1024-768-32-32-75       75.6      75.1      75.1    75.9
1024-768-32-32-60       76.8      76.2      75.8    77.0
1024-768-32-0-60        77.1      76.1      76.1    77.0
1024-768-16-0-60        80.8      78.9      79.3    80.3
800-600-32-32-72        79.5      78.1      76.5    77.9
800-600-32-32-60        80.5      78.6      78.7    80.0
640-480-32-32-60        82.5      80.2      80.8    82.0
640-480-16-16-60        82.7      81.3      82.1    83.0
640-480-16-0-60         83.2      81.2      82.2    83.2

I don't yet have any data for STREAM running on O2 when the display is a VT terminal. I would be most interested to hear from anyone who has run such a test, or from someone who has any idea how the CRM/DE data transfer could be shutdown under software control.

It is possible that these effects apply for every day tasks such 3D modeling, movie conversion, etc. Some tasks may be I/O-disk bound, in which case the display complexity will be irrelevant; other tasks may be compute bound - lowering the display to VGA16 could give a good speed increase.

Note: forcing the monitor to go into power saving mode does not shut down the CRM/DE data transfer.

Geometry/Lighting and Comparing to Hardware Accelerated Systems

How can an R4600PC 100MHz Indy XL outperform an R4400SC 250MHz Indigo2 Elan for a 3D graphics task? Answer: when the 3D scene includes complex geometry and lighting calculations.

O2 does all geometry and lighting calculations in the main CPU. The same is true for Indy XL, Indigo2 XL, or any similar system such as Crimson Entry. At first this may sound like a disadvantage, but as main CPUs have improved in power, we have now entered an era where older systems with good main CPUs and no hardware graphics acceleration can easily outperform older systems with old types of hardware accelerator board (XS24, XZ, Elan and Extreme). When this situation occurs, it doesn't really matter what type of CPU is present in the system that has the hardware acceleration. The key point is that the main CPU in the former system has an effective fp performance that is better than the Geometry Engines (GEs) on the latter system's accelerator board.

The original XZ graphics offered 64MFLOPS of GE power; later revisions (seen as Elan by hinv on Indigo2, and XZ on Indy) offered 128MFLOPS, and Extreme offered 256MFLOPS. The R5000 CPU in Indy offered between 300MFLOPS and 360MFLOPS peak single-precision MADD performance, while the best currently known CPU for O2 (custom fit R7000C/600MHz) offers 1.2GLFOPs peak.

Complex lighting calculations can hit these older accelerator boards (XZ/Elan/Extreme) hard. All older SGIs with hardware acceleration only support one hardware light (compare to InfiniteReality which supports four), so when multiple lights are present, the calculations become too complex, context switches occur because temporary data must be stored somewhere, the graphics board FIFOs fill up because the main CPU is sending in data faster than the board can process it, the CPU has to pause constantly to wait for the FIFOs to drain, and thus the GEs become the main bottleneck. In such situations, the main CPU may be little used - I saw only 2% CPU usage when running such a scenario on my Indigo2 Elan.

On the other hand, systems like XL offload all such calculations onto the main CPU. When things get tough, the main CPU runs as fast as it can just as always, hence the situation with FIFOs filling up, context switches occuring, etc. never happens and the system is ironically able to give a fair performance. That is how an Indy XL can outperform an Indigo2 Elan. It is also the reason why an R5000 Indy XZ can be slower than an R5000 Indy XL (the former must do its geometry/lighting calculations on the XZ board, completely wasting the much higher fp power of the main CPU), although this will often not be the case for any scene that only involves one light source because the presence of the hardware Z buffer in an XZ Indy can be more important than the higher fp speed of the main CPU.

What relevance is this to O2? It means as the CPU speed increases, it's possible that O2 can outperform even a MaxIMPACT for certain types of task - it'll definitely be able to outperform a HighIMPACT or SolidIMPACT anyway. The reason is geometry/lighting: as the main CPUs for O2 improve, the single-precision fp performance will eventually exceed that offered by the GEs of systems like SolidIMPACT (480MFLOPS), High IMPACT (480MFLOPS) and Max IMPACT (960MFLOPS). A 300MHz R12000 would offer 600MFLOPS, so I would expect an R12K/300 O2 to outperform a SolidIMPACT Indigo2 for tasks involving complex geometry and lighting (especially something like multiple spotlights). In theory, the R7K/600 should allow O2 to outperform an Octane/SI (at least 1GFLOP for the O2 compared to half that for the Octane/SI's GEs).

These effects will be important unless the bottleneck becomes something else such as:

pixel fill (much higher on MaxIMPACT),
texture bandwidth,
memory bandwidth/latency,
etc.

These could be important if, for example, the 3D scene contained a very large number of polygons, or a complex dynamic scene. My comments above mainly refer to scenes that involve multiple lights and low polygon counts, eg. VRML worlds, though O2 does have the extra advantage of texture memory capacity being limited only by main RAM size (compared to the small 4MB limit in IMPACT).

For a more thorough investigation and discussion of these issues, please see my HolliDance Benchmark page, which includes a table of example performance results for a typical dynamic 3D real-time scene that contains complex lighting. If you own or have access to an SGI, please consider submitting a set of results as I am convinced that, for relevant tasks, the HolliDance Benchmark results table will be a very useful resource to 2nd-hand buyers and those considering upgrades from older systems. It should also be useful to users of faster systems who may be interested in possible performance degradation when the number of lights reaches a certain threshold (see the benchmark page for more details).

When thinking about O2, these issues may be important if your task involves real-time 3D animation, VRML, low-end visual-simulation, etc. It could be especially relevant if you have an older system, are considering an upgrade, and aren't sure whether to go for something like an Indigo2 Extreme/IMPACT, an O2, or an entry-level Octane. It's quite surprising to think that O2 could gradually be seen to outperform many existing SGIs for tasks that involve complex lighting. However, I doubt this will occur with Onyx2 since IR supports four hardware lights and the GEs offer 2.56GFLOPS of processing power - much greater than even a theoretical 800MHz R14K (unless such a future CPU was able to do 4 fp operations per clock instead of 2 fp operations per clock).

With hindsight, and certainly for particular types of O2 user (eg. anyone doing VRML), the fact that O2 does all geometry/lighting calculations in software could prove very advantageous in terms of much greater performance in the future. Note that this kind of task is very different from the typical 'primitive' level benchmarks shown on technical reports and PR web sites. Such simplistic performance figures (eg. flat tris/sec, or lit, shaded, textured triangles/sec) almost always involve either no lighting whatsoever, or just a single directional light, thus hardware acceleration boards never experience the problem of having to deal with more light sources than can be handled by the hardware at one time. A good example of this that although an R4600PC 100MHz Indy XL outperforms an R4400SC 250MHz Indigo2 Elan for the HolliDance 3D animation program by 8 percent (large window, no texture), if one turns all the lights off then the Indigo2 immediately becomes 158% quicker than the Indy.

What I've tried to highlight here is that you should be very wary of assuming O2 must be better or worse than older systems simply because it's newer, has a better main CPU, etc. The reality may be much more complex because of the way graphics hardware works and how the different components of a system interact, combined with the fact that different systems often work in very different ways.

For example, one might assume that an O2 should outperform an Indigo2 Extreme for Gouraud shaded tasks, and indeed it does on the primitive level benchmarks by a moderate to reasonable margin (between 7 and 65 percent for various CPUs); but what might be a surprise to many is that O2 can completely stomp over an Indigo2 Extreme for a 3D task that involves multiple lights. For the HolliDance benchmark, compared to R4400SC/250MHz Indigo2 Elan, the O2 was 510 percent faster! The primitive level benchmarks would have suggested a difference of around 170%.

But turn off the lights and the difference changes drastically: O2 is now 144% faster than Indigo2 Elan, a figure which correlates much better with the primitives tests. In other words, when the complex lighting is turned off, both systems speed up, but Indigo2 Elan speeds up by a much greater degree (300% compared to 60%) because all the horrible bottlenecks concerning the GEs are removed, though it's still slower overall. Obviously, I would expect the differences between O2 and Indigo2 Extreme for HolliDance to be less, but I reckon O2 would still be at least 200% quicker when the lights are turned on (as opposed to the 20% difference one might expect from the primitives tests).

3D graphics is a strange thing. Yet again, this is more proof, if any were needed, that the only benchmark test one should really trust when making a purchasing decision is one's own application.

ICE (Image Compression Engine)

ICE consists of two parts: 66MHz 64bit R4K-derived control logic unit plus a 66MHz SIMD 128bit MDMX-style central processing unit. The SIMD core can do sixteen 8bit MACs or eight 16bit MACs/clock. Each MAC is 2 operations (multiply + add), so 66M * 2 * 16 = 2billion operations/sec. So, the MAC figure is 1 billion MACs/sec for 8bit integer ops and 500 million MACs/sec for 16 bit integer ops (ICE cannot be used for fp computation). The unit as a whole is designed to handle multiple data streams.

The controller element is programmable, to allow for future video and image formats - this means it's likely that the unit is perfectly capable of doing four 32bit ops or two 64bit ops per clock, but I don't think the current libraries support such operations since today's video/image tasks don't need them.

ICE allows one to do some impressive real-time image and video operations, some of which are shown in the various O2 demo programs. Real-time examples include: edge detection, colour space conversion, luma and chroma keying, etc. For a more thorough description of ICE, please see my main ICE page.

Incidentally, because of the many questions about ICE that I've thrown at people in SGI, a member of SGI's Global Technical Support has begun the process of writing a proper report on ICE for a future issue of Pipeline (a few months' time probably). I will be helping in the creation of the report to a limited degree.

Finally, here is SGI's own description of the ICE system, including comparisons with Indy (note that IRIX 6.5 has a newer API for dealing with O2's digital media features):

The following table lists some key digital media hardware
differences between O2 and Indy:

                         Table: O2 vs. Indy Hardware

               O2                                Indy

   Image and Compression Engine
   (ICE):                             * Motion JPEG video
                                        compression/decompression
   * Built-in motion JPEG video         requires optional Cosmo
     compression/decompression          Compress board
   * Built-in imaging                 * Imaging accerlation not
     acceleration                       available on Indy

                                        Video input (video output
     Video input and output             requires IndyVideo or
                                        IndyVideo 601 option)

     Screen-capture video source        Requires optional Indy
     (graphics screen available as      VideoTM card
     video input device)

     Improved digital video camera      IndyCamTM and external
     with built-in microphone and       microphone
     shutter button

Silicon Graphics is also releasing IRIXTM 6.3 for O2. This updated
OS version has the following new elements:

   * New digital media buffer (DMbuffer) programming interface for
     sharing unified memory among the application, video I/O
     devices, compression, graphics rendering, and graphics
     display
   * New Video Library (VL) programming interface to DMbuffers
   * New digital media image conversion (dmIC) programming
     interface based on DMbuffer for direct data transfer among
     image-conversion algorithms/devices, video I/O, and graphics
   * Hardware-accelerated OpenGL imaging extensions


Audio and Video I/O Ports

The following I/O devices transfer audio samples and video pixels
into and out of main system memory:

   * Camera and camera microphone
   * Two line-level analog stereo outputs and one line-level
     analog stereo input
   * S-video and composite video in/out
   * Headphones out
   * Microphone in (mono)
   * Speaker output
   * Optional CCIR 601 digital video adapter in/out


Digital Media Buffer Architecture

The DMbuffer is a new API for programmatic access to a new IRIX
operating system feature that unifies the memory buffering systems
of live video devices, such as video input and output and image
compression and decompression. Also, OpenGL can both read from and
render to the DMbuffer system, thus enabling completely
programmable video effects: anything that you can render to a
window you can also render offscreen and send directly to video
output or compression. Furthermore, video input and decompression
output are available for graphics display.

The software architecture consists of the following elements:

   * DMbuffer
   * Ability to treat DMbuffer data as pbuffer or texture map data
     in OpenGL
   * VL receive/send DMbuffers to/from video I/O hardware
   * ICE (Image and Compression Engine) uses DMbuffers for input
     and output
   * New Digital Media Library (libdmedia) image conversion API
     (dmIC)


Image Processing Engine

ICE is a chip, and digital media image conversion (dmIC) is a
software interface. Together, these two components enable video
compression/decompression functions; they also allow applications
to display multiple image streams.

The ICE chip contains the following components:

   * MIPS RISC core for program control
   * Integer vector unit capable of 8 multiply-accumulates per clock
   * Bit stream encoder and decoder
   * Intelligent DMA controller

These features are tied together with highly optimized code for
applications such as JPEG encode and decode, general and separable
convolutions, color matrix multiplies, and histogram generation.

Providing the functionality of the Cosmo CompressTM option card
for Indy, ICE is even more flexible than its predecessor. In
addition to handling single streams of live video, ICE is easily
shared between multiple smaller streams (of any size and rate);
for example: 4 quarter-size, full-rate streams are supported as
easily as 1 full-size, full-rate (or 2 half-size, or 3 third-size,
or 2 full-size, half-rate, and so on). Since there is no built-in
video clock or video dimensionality on the ICE chip, you can also
use for non-standard sizes and rates; for example, film aspect
ratio at film rate for film animation preview to the graphics
monitor.

With the Indy, all imaging and compression calculations were done
by the main CPU. ICE, which functions as a separate CPU, now
handles these calculations, which frees the main CPU to handle
other processes. Also with the Indy, you had to purchase dedicated
cards, such as a JPEG card, to handle jobs such as compression.
Silicon Graphics designed O2 with flexibility as a key objective.
Consequently, the system can handle JPEG compression as well as
image-processing functions, without having to purchase dedicated
cards for each process.

The IO Engine (IOE) is a chip that brings video and audio into and
out of the system. Both IOE and ICE feature direct memory access
(DMA) controllers, which enables them to read compressed images
and output the information to a video out channel.

Not only do IOE, ICE, and UMA simplify the sharing of digital
media data between subsystems, their interaction is many times
faster than more common methods of transferring data between
subsystems over a system bus.


New Image Conversion API

The Digital Media Library (libdmedia) that's included with IRIX
6.3 features a new digital media image conversion library (dmIC).
You use this low-level API for memory-to-memory image
compression/decompression and conversion.

dmIC supports the standard software image codecs supported by the
older Compression Library (libcl) interface in IRIX 6.2 and
earlier releases. dmIC also supports the real-time motion JPEG
encode/decode capability of the O2 ICE processor:

   * The dmIC interface makes software image codecs and
     hardware-accelerated memory-to-memory codecs look the same to
     application developers.

   * dmIC operates on image data stored in DMbuffers.
     This makes it possible to share image data between hardware
     or software codecs and OpenGL or the Video Library, without
     copying data.

   * dmIC does not support in-line compression devices that are
     integrated into video capture or playback hardware paths; for
     example, Cosmo Compress or Impact Compress.
     These kinds of devices require a slightly different
     programming model from the model used to send data to and
     receive data from an asynchronous memory-to-memory processor.
     The older libcl continues to provide the applications
     programming interface to these kinds of devices.

   * An application can query dmIC to determine whether the
     current system offers a real-time implementation of a
     particular memory-to-memory codec; for example, JPEG.
     The real-time JPEG codec on O2 supports full-rate
     encode/decode at NTSC/PAL square pixel, CCIR 601/525, and
     CCIR 601/625 video timings. On systems that are not equipped
     with a real-time memory-to-memory codec, an application can
     also use the non-real-time software implementation.

   * The Compression Library functionality offered in IRIX 6.2
     will continue to be supported in IRIX 6.3 and future releases
     in order to ensure backward compatibility for applications.

   * Starting with IRIX 6.3, MPEG audio/video encode and Cinepak
     encode capabilities are bundled with every Silicon Graphics
     system. These software encoders no longer require a Silicon
     Graphics run-time license.

The new dmIC routines are declared in the public header
. The new DMbuffer routines for creating
and manipulating DMbuffers are declared in .


OpenGL Extensions for Image Data

Silicon Graphics created OpenGL extensions for O2, which allow you
to use DMbuffers as either pbuffers or texture maps. The company
also designed an OpenGL extension for rendering YCrCb (4:2:2)
interlaced data, which lets you save video display pixels in a
pixel format, rather than converting them to bits. Using these
extensions, you can also perform hardware color space conversions
from YCrCb to RGB.

In addition to the new OpenGL extensions, O2 provides hardware
acceleration for the following existing extensions:

   * Color scale and bias
   * Color table look-ups
   * Convolutions: 3x3, 5x5, and 7x7 (separable and general)
   * Color matrix multiply
   * Histogram and MinMax

The support of these operations should promote interesting
applications, with real-time feedback (attributable to the
performance increase), in the fields of medical imaging, GIS, and
post production. Moreover, the support of a common API (OpenGL)
enables applications to run across the product line, with
performance gains associated with the platform on which the
applications are running.


DMcolor and OpenGL Color Matrix Extensions

With O2, you can use OpenGL hardware to perform transforms. In
addition, DMcolor can set up transform matrices that the
application can pass to OpenGL. The system also has a software
image color space conversion engine in libdmedia. The system also
has a DMcolor API.


Video Library and DMbuffers

The system has new Video Library (VL) calls for receiving video
data (fields or pairs of fields interleaved to form frames) into
DMbuffers, and for sending video data using DMbuffers. In
addition, the video I/O path can handle mipmap generation for live
video. The older VLbuffer interface is still supported as well.


Audio Library Enhancements

Starting with IRIX 6.3, the Audio Library (AL) is packaged as a
DSO rather than as a static library. The Audio Library adds a
number of new functions and features, however the 6.3 version of
the library is backward-compatible with previous releases.

New features in 6.3 include:

   * The ability to support multiple audio I/O devices in a single
     system.

   * Support for the O2 workstation's ability to lock audio and
     video sample rates together in hardware to prevent drift
     during synchronized audio/video recording or playback.

In addition, IRIX 6.3 introduces a new, generalized version of the
Audio Control Panel, which can automatically configure itself when
you add audio I/O devices to the system.


High-Resolution Timer for Synchronizing Audio and Video Streams

The O2 workstation includes audio/video hardware support for
Silicon Graphics' high-resolution digital media timer, the
unadjusted system time (UST) clock.

UST provides a common time base for timestamping audio samples and
video fields as they enter or leave the system through the
audio/video I/O ports. AL and VL each support timestamps based on
the UST clock. Applications can use this common timebase to
correlate and synchronize outgoing audio and video input/output
streams. Refer to the man pages alGetFrameTime(3dm) and
vlGetUSTMSCPair(3dm) for more information.

The O2 architecture makes the high-resolution UST clock visible to
PCI option cards as well as to the audio/video subsystems that are
standard on the system.


Movie Library Enhancements

Starting with IRIX 6.3, the Movie Library is packaged as a pair of
DSOs rather than as a single static library. The Movie Library API
is backward-compatible with previous IRIX releases:

   * Movie file library (libmoviefile.so) deals with movie file
     reading, writing and editing. This DSO includes the functions
     defined in the public header .

   * Movie playback library (libmovieplay.so) provides high-level
     functions for movie playback with synchronized sound and
     images. This DSO includes the functions defined in the public
     header .

The IRIX 6.3 version of the Movie Library offers the following new
features:

   * Support for Indeo encoding and writing AVI files

   * Support for creating MPEG-1 video and systems bitstreams
     through the movie file library interface

   * Support for full-rate, full-resolution motion JPEG playback
     with synchronized audio by using the real-time JPEG decode
     capabilities of the O2 ICE processor

   * Ability to take advantage of the OpenGL extensions for
     rendering interlaced image data and YCrCb image data on O2


New Audio Conversion API

The Digital Media Library (libdmedia) that's included with IRIX 6.3
features a new digital media audio conversion library (dmAC). You
use this low-level API for memory-to-memory audio sample format
conversion, sample rate conversion, and compression/decompression.

dmAC supports these audio conversion operations:

   * Sample data format conversion (signed, unsigned, float,
     double, scaling)

   * Sample rate conversion (several algorithms)
   * Channel conversion (mono, stereo, 4-channel, and so on)
   * Compression/decompression

IRIX 6.3 supports the following audio compression algorithms:

   * CCITT G.711 mu-law and A-law
   * CCITT G.722
   * CCITT G.726 16, 24, 32, and 40 Kb/sec
   * CCITT G.728
   * GSM
   * Intel DVI ADPCM
   * MPEG audio

All of the audio compression/decompression and conversion
algorithms are implemented in software. No special option hardware
is required to perform these conversions.

The new dmAC routines are declared in the public header
.

Starting with IRIX 6.3, MPEG audio encoding is bundled with all
systems and no longer requires a license from Silicon Graphics.


Audio File Library Enhancements

The new version of the Audio File Library (libaudiofile) included
in IRIX 6.3 offers support for several additional sound file
formats:

   * Amiga IFF/8SVX
   * SampleVision
   * Audio Visual Research
   * Creative Labs VOC
   * Creative Labs SoundFont2

The library now offers transparent sample rate conversion in addition
to transparent sample format conversion and
compression/decompression. You can specify a virtual sample rate from
within your application; for example, 48 kHz. The application can
open sound files that contain data sampled at a variety of rates, and
the library automatically converts between the sample rates used in
the sound files (such as 44.1 kHz, 32 kHz, or 16 kHz) and the virtual
sample rate that the application requests.

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]