[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]


[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

195MHz R10000 Performance Comparison Between O2,
Indigo2, Octane, Origin200, Origin2000 and Power Challenge

Last Change: 09/Aug/1998

SPEC's Introduction to SPEC95

SPECfp95 Analysis

SPECint95 Analysis


(Note: the 2D bar graphs shown on this page for the various SPEC95 tests have been drawn to the same scale)
(the graphs are also to the same scale as those given on the 250MHz R10000 comparison page)

195MHz R10000 SPECfp95 Performance Comparison

Introductory Notes:


Objectives

I compared the test results for the R10000 on various SGI systems. The goal was to discover how the same processor (in this case the 195MHz R10000) behaved on different SGI systems that supported it. It's interesting to see which systems are affected by a greater L2, which are improved by a new architectural design or faster bus, etc., and which hardly vary at all, ie. perhaps they'd be faster simply with a better clock speed, as opposed to higher memory I/O, L2 cache size, etc.

To aid visualisation, I've constructed a 3D Inventor model of the data; screenshots of this are included below. You can download the 3D model (1232bytes gzipped) if you wish: load the file into SceneViewer or ivview and switch into Orthographic mode (ie. no perspective). Rotate the object 30 degrees horizontally and then 30 degrees vertically (use Roty and Rotx thumbwheels) - that'll give you the standard isometric view. I actually found slightly smaller angles makes things a little clearer (15 or 20 degrees) so feel free to experiment. Changing the direction of the headlight can also help. Note that newer versions of popular browsers may be able to load and show the object directly, although I've found such browsers may not offer Orthographic viewing.

All source data for this analysis came from www.specbench.org.

Note: The R10000 used in Origin200 in this analysis is 180MHz, not 195MHz (the latter is not available for Origin200 due to XIO timing issues). The figures for Origin200 given here have not been scaled by 195/180 in an attempt to take account of this, so please bare in mind that Origin200 results are for a lower-clocked R10000. If one does scale Origin200 figures by 195/180 (just over 8%), then the differences between Origin200 and other systems are lessened, but there is little point in comparing published results to something which is not available as a buyable system. Besides, scaling SPEC test results in a linear manner is not recommended.

Given below is a comparison table of the various R10000/195 SPECfp95 test results. Faster systems are leftmost in this table (in the Inventor graph, they're placed at the back). You may need to widen your browser window to view the complete table. After the table and 3D graphs is a short-cut index to the original results pages for the various systems.

Key:

O200 = Origin200 O2000 = Origin2000 PChall = Power Challenge I2 = Indigo2

System:   O2000    Octane    O200    PChall   PChall    I2       O2
L2:        4MB      1MB      1MB      2MB      1MB      1MB      1MB

tomcatv    26.9     25.3     22.2     16.7     16.1     12.1     9.78
swim       41.2     40.6     34.5     23.9     23.9     17.1     13.9
su2cor     11.5     9.64     8.47     8.74     7.35     6.40     4.72
hydro2d    12.6     9.97     7.99     6.36     5.03     4.01     3.17
mgrid      18.8     15.9     14.8     11.4     10.4     8.60     6.95
applu      11.7     11.2     11.0     8.69     8.54     7.44     5.92
turb3d     15.3     13.8     14.3     11.3     11.2     10.3     9.57
apsi       15.6     12.8     11.9     15.5     10.3     9.60     9.77
fpppp      29.6     29.7     28.3     31.1     31.3     31.4     29.3
wave5      25.5     22.4     20.3     21.3     18.4     17.3     11.8

          SPECfp95 Comparison Table for MIPS R10000 195MHz

[Left Isometric View] [Right Isometric View]

(click on the images above to download larger versions of the views shown)

[Test Suite Description | O2000 | Octane | O200 | PChall 2MB L2 | PChall 1MB L2 | Indigo2 | O2]


Next, a separate comparison graph for each of the ten SPECfp95 tests:

tomcatv:

tomcatv comparison graph

swim:

swim comparison graph

su2cor:

su2cor comparison graph

hydro2d:

hydro2d comparison graph

mgrid:

mgrid comparison graph

applu:

applu comparison graph

turb3d:

turb3d comparison graph

apsi:

apsi comparison graph

fpppp:

fpppp comparison graph

wave5:

wave5 comparison graph

Note: for Origin200, the peak apsi SPEC ratio (11.9) is less than the base apsi SPEC ratio (12.6) shown in the submitted results on www.specbench.org. This is apparently due to a mistake in the flags used for the peak test run.

Observations

These are easier to spot from the graphs, which is why I made them in the first place:

Of most note are the performance jumps between Indigo2/IMPACT to Origin200. Even though the Origin200 has a 180MHz R10000 whilst the Indigo2/IMPACT has 195MHz R10000, many of the tasks run much better on the Origin200 by a factor of 2 in many cases. Such tasks are clearly benefiting from the much higher memory bandwidth (STREAM result is 5 times higher than Indigo2/IMPACT), lower memory latency (Origin200's memory latency is about 40% better than Indigo2) and an additional factor that is discussed below. As the complexity of a data set becomes larger (ie. data set size increases), the improvement seen with Origin200 over Indigo2/IMPACT will increase.

However, there is a confusing factor in all this which makes it difficult to come to precise conclusions as to why some tests perform in a particular way on a certain system:

Cache accessing is obviously an important factor when looking at performance. I asked John for a summary description of the way cache systems operate and the relevant issues; he said:

"Any time you execute a Load instruction, the CPU checks in the caches to see if a copy of the data is handy. If the data is in the primary cache, the CPU proceeds at full speed, and the data can be used two cycles after the load (on MIPS processors, anyway). If there is not a copy of the data in the primary cache, the secondary cache is checked. If there is a copy in the secondary cache, it takes about 9-12 cycles (on the R10000) to get the data. If there is no copy of the data in the secondary cache, it takes about 65 cycles to get the data.

If the CPU continues executing other instructions after the Load that missed the primary cache, you have what is called a "non-blocking" cache. Some systems allow more loads to occur while the cache miss of the first load is outstanding, and will continue as long as all the subsequent loads hit in the cache. This is called a "hit under miss" cache. Typically a "hit under miss" cache will stall the CPU when a second cache miss occurs, i.e. the CPU will sit there doing nothing at all until the data for one of the loads gets returned either by the secondary cache or by the memory controller.

More sophisticated processors allow multiple outstanding misses, typically with different numbers of misses allowed at each level of the memory hierarchy. On the R10000, you can have 4 loads miss the cache and still continue executing instructions. The information about these outstanding misses sits in a buffer on the CPU that matches the data coming in from the secondary cache or main memory controller with the corresponding instruction."


Apparently, PowerPC 604/604e systems operate a "hit under miss" cache system.

I asked John for some examples of the types of code which are sensitive to cache-access issues; he said:


Obviously, the precise reasons why a piece of code performs in a particular way can be quite complex.


Summary.

I've done this comparison because people still rely heavily on SPEC95 results when making purchasing and upgrade decisions, as I discovered when helping a colleague with the buying of a new number crunching system - he eventually ordered an Origin200 (full details given below). Thus, I hope my 3D graphs can help people understand a little more about how SPECfp95 is actually behaving on different systems that use identical processors, Origin200/180 not withstanding.

What this shows, once again, is that it's important not to rely on final SPEC averages. More than anything, get your code physically tested on the target system if possible before making a purchasing decision.

Finally, please also read my page on what I call performance difference profiling - a technique that I hope may aid those who are involved in making upgrade decisions.


An Example R10000 Performance Comparison to Indigo2/R4400 and Indy/R5000

A colleague from the University of Central Lancashire's Fire and Explosion Studies Centre asked me for advice on upgrading from Indy/Indigo2 (using R4600, R5000 and R4400) to Indigo2/R10000 or perhaps Origin200.

I arranged for SGI UK to bring along an Indigo2 R10000 195MHz Solid IMPACT. The Centre then conducted tests on the R10000 Indigo2 to compare its performance against their existing 133MHz R4600 Indys and a 200MHz R4400 Indigo2. All the tests involved Fortran77 code using the MIPS Pro 7.0 compilers, performing 64bit floating point calculations. These were pure time-based number crunching tests, ie. no screen output or disk operations were involved.

The results are as follows (factor speedups, ie. how much faster the Indigo2 R10000/195 was compared to the target system):-

Standard problem, 50x50 grid, 100 steps:

11.4 times faster than Indy R5000SC/150

Larger Problem, 150x150 grid, 100 steps:

5.8 times faster than Indigo2 R4400/200
11.4 times faster than Indy R5000SC/150

Standard Problem, 51x51 grid, 100 steps:

2 times faster than Indigo2 R4400/200

Larger Problem, 101x101 grid, 10 steps:

2.6 times faster than Indigo2 R4400/200
3.2 times faster than Indy R5000SC/150


Other test results; each factor is how much faster the R10000/195 Indigo2 was compared to the R4400/200 Indigo2:

Pseudospectral algorithm with Fast Fourier Transform:

[l = 2048]:  5.39 times faster
[l = 32768]: 3.41 times faster

Finite-Difference algorithm in decomposed geometry:

[110x10/50x10/50x10]:       2.13 times faster
[1100x100/500x100/500x100]: 3.10 times faster

Eigenvalues, LAPACK:

300x300 DGEEV (ordinary):      3.79 times faster
300x300 DGEGV (generalised):   2.89 times faster
1000x1000 DGEEV (ordinary):    2.83 times faster
1000x1000 DGEGV (generalised): 2.35 times faster


195MHz R10000 SPECint95 Performance Comparison

The systems examined were the same as for the SPECfp95 comparison given above, including the Origin200 using a 180MHz R10000, so please bare this difference in mind when examining the data. Just as above, you can download a 3D performance graph (gzipped) if you wish: load the file into SceneViewer or ivview and switch into Orthographic mode (ie. no perspective), etc.

The rationale and method for this examination were the same as for SPECfp95. Thus, given below is a comparison table of the various R10000/195 SPECint95 test results. You may need to widen your browser window to view the complete table. After the table and 3D graphs is a short-cut index to the original results pages for the various systems.

Key:

O200 = Origin200 O2000 = Origin2000 PChall = Power Challenge I2 = Indigo2

System:   O2000    Octane    O200    PChall   PChall    I2      O2
L2:        4MB      1MB      1MB      2MB      1MB      1MB     1MB

go:        11.4     11.4     10.5     10.0     10.0     10.0    11.0
m88ksim:   11.3     11.3     10.4     9.15     9.18     9.14    11.1
gcc:       10.4     10.1     9.26     8.25     7.87     7.91    9.02
compress:  11.3     11.3     10.4     10.0     10.0     10.1    10.6
li:        9.57     9.59     8.79     7.79     7.85     7.87    9.42
ijpeg:     10.2     10.1     9.26     8.23     8.29     8.20    9.35
perl:      13.3     13.0     12.1     9.42     9.27     10.3    13.0
vortex:    14.4     11.2     12.4     8.25     7.86     7.97    8.20

         SPECint95 Comparison Table for MIPS R10000 195MHz

[Left Isometric View] [Right Isometric View]

(click on the images above to download larger versions of the views shown)

[Test Suite Description | O2000 | Octane | O200 | PChall 2MB L2 | PChall 1MB L2 | Indigo2 | O2]


Next, a separate comparison graph for each of the eight SPECint95 tests:

go:

go comparison graph

m88ksim:

m88ksim comparison graph

gcc:

gcc comparison graph

compress:

compress comparison graph

li:

li comparison graph

ijpeg:

ijpeg comparison graph

perl:

perl comparison graph

vortex:

vortex comparison graph

It is immediately obvious that the behaviour of these tests is very different from the SPECfp95 suite. Firstly, there is a much lower variance for all the systems. Other observations are:

Vortex is an interesting case. Though O2 does well on all tests, matching Origin and beating older systems handsomely most of the time, vortex is the one test where O2 does not do as well as Origin-based systems. Stranger still, Octane does not do as well as Origin200 for vortex. It's a shame that vortex is the only test in SPECint95 which gives rise to this behaviour; if one's code happens to be like vortex, ascertaining that fact may not be easy. I asked about vortex; he replied:

"vortex in SPECint95 has a lot of trouble with TLB misses, as its memory access patterns cover a large memory space. The lower latency of the main memory system (for getting the new TLB entries) helps the newer machines a fair amount relative to the old machines. Somewhere along the lines we made an O/S modification that allowed this code to use large pages and reduce the TLB miss rate -- this modification would not be reflected in the results on the older machines, but would help them if we went back and ran the tests again."

Note that SGI does not have the time to rerun tests on older systems. Running the base tests is trivial, but it isn't easy to gather together people who can work on finding the best optimising compiler options for the peak tests. SGI is probably busy enough as it is testing newer CPUs on the current systems.

The SPECfp95 discussion above refers to the absence of the R10K's outstanding cache-miss feature on older SGI systems and O2. However, for SPECint95, this appears to be much less of an issue. vortex might be an exception but it's difficult to tell without detailed knowledge of how vortex works. John's view on this was:

"Outstanding cache misses are not relevant on SPECint95 because almost no secondary cache misses occur! With default page sizes, only vortex and gcc seem to show any benefit from going from 1 MB to 4 MB caches."

Asking John why no cache misses were happening, his response was:

"The jobs are working on small data sets."

The next question had to be, what kind of tasks do use large data sets? John's reply was:

"Database stuff, large cpu simulators, integer programming/optimization (like travelling salesman problems, airline scheduling, etc.)"

Finally, given that the 2D graphs above are to the same scale, it's very clear that the R10000 shows a much larger floating point (fp) advantage over the SPEC reference system than an integer (int) advantage and also a greater variance for fp results. However, the problem with this observation is that one has no way of knowing how 'good' the original reference system was for int vs. fp work.

In other words, one has no way of knowing whether the int and fp tests were 'equally' difficult for the reference system. It would have been better perhaps if the tests could have been tailored such that the variance in reference times between the int and fp tests were similar; ie. SPEC95's reference times vary between 1400 and 9600 for the fp tests and between 1700 and 4600 for the int tests - it would have been better if the spread was the same for both test suites. Also, for SPECfp95, there is an odd correlation between tests whose reference times are high and final high target system ratios (swim and fpppp); is this because the reference system ran them rather slowly, or because SPEC made the tests more complex to slow them down? It is difficult to know what to conclude. Just a coincidence? Or were some tests just not tough enough in the first place? Or maybe there were, but the R10000 is just very good for that kind of work anyway? Who knows?

Clearly, SPEC95 must be examined in detail to gain any genuinely useful information.


Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)
[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]
[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]