Unfortunately, it may be difficult or even impossible to identify which test is similar to one's own code, especially since the source code isn't readily available. As a decision maker, one may be making a serious mistake if one made a decision based on apparent code similarities when in fact the data set types or code behaviour were different, rendering one's assumptions useless. Surely, there must exist a more empirical way of using the SPEC figures to find out which is the appropriate test to look at?
In an attempt to offer one possible solution, this study presents an approach that may be useful to those who have access to several different SGI systems, even if this means asking another company or institution (or perhaps SGI) to run some tests on one's behalf.
Each test has what I call a performance profile which denotes:
Taking Indigo2 performance as a baseline, here is a relative SPECfp95 performance table for the other systems compared to Indigo2:
System: O2000 Octane O200 PChall PChall O2 L2: 4MB 1MB 1MB 2MB 1MB 1MB tomcatv 2.22 2.09 1.84 1.38 1.33 0.81 swim 2.41 2.37 2.02 1.40 1.40 0.81 su2cor 1.80 1.51 1.32 1.37 1.15 0.74 hydro2d 3.14 2.49 1.99 1.59 1.25 0.79 mgrid 2.19 1.85 1.72 1.33 1.21 0.81 applu 1.57 1.51 1.48 1.17 1.15 0.80 turb3d 1.49 1.34 1.39 1.10 1.09 0.93 apsi 1.63 1.33 1.24 1.62 1.07 1.02 fpppp 0.95 0.95 0.90 0.99 1.00 0.93 wave5 1.47 1.30 1.17 1.23 1.06 0.68
One should bare in mind that these are relative performance factors compared to a 195MHz R10000 1MB L2 Indigo2, not absolute performance figures. This is why some numbers are less than 1.0 since the Origin200 has a 180MHz CPU instead of 195MHz, plus compiler alterations can make certain tests compile in a slightly different way, affecting performance figures (it is possible that the base result may increase which might mean the peak result decreases, ie. better results for those who use default flags).
So, how does 'performance difference profiling' work?
Suppose one was considering upgrading from a 1MB L2 Power Challenge to an Origin2000. Normally, one would be expected to 'guess' which SPECfp95 test is most like one's own application code. Thus, for example, one might think that 'applu' was most similar and so estimate a raw 46% improvement when upgrading (assuming the CPUs are the same). However, using performance difference profiles, it is possible to get a better insight into the nature of one's code.
If one has access to an Indigo2, one should run the code on both machines, and on a 2MB L2 Power Challenge if possible. Noting the performance differences between the Power Challenges and Indigo2, one then looks at the table above to see which SPECfp95 test has a performance difference profile that is most similar to one's test results (ie. absolute performance is not an issue here). That should give a much better indication of whether an upgrade will help.
Consider hydro2d and apsi.
Looking at the apsi test, compared to Indigo2, the 2MB L2 Power Challenge gives a 52% bigger speedup than the 1MB L2 Power Challenge. So one might think that L2 cache is all important and thus a 4MB L2 Origin would really fly. But the 1MB L2 Power Challenge result is almost identical to the base Indigo2 result (only 7% quicker).
On the other hand, for hydro2d, the 2MB L2 Power Challenge gives a smaller speedup than the 1MB L2 Power Challenge over Indigo2 (27% instead of 52%), showing that L2 obviously matters, but just as important is that the 1MB L2 Power Challenge does 25% better than Indigo2 for hyrdro2d - this shows that memory bandwidth and memory latency may also be important (plus other factors such as multiple outstanding cache miss support), suggesting that an Origin2000 might do alot better than one would otherwise believe because all these factors are much better with Origin2000. In fact, hydro2d on Origin2000 turns out to be 314% faster than Indigo2!
Thus, a smaller performance difference between Power Challenge models may in fact mean a larger overall gain from an upgrade to Origin2000, depending on how the relative performances compare to an Indigo2; ie. one can use performance difference information instead of just absolute performance information to aid in decision making.
So, suppose one thought that one's application code was most like apsi, but in tests the relevant performance difference factors compared to Indigo2 turned out to be 1.4 for a 2MB L2 Power Challenge and 1.2 for a 1MB L2 Power Challenge. This suggests that even though the code may look like apsi, it's actually behaving like hydro2d or mgrid. In other words, this technique enables one to empirically test one's assumptions about which SPECfp95 test most resembles one's application code.
Another example: suppose one has a Challenge DM R4400/250MHz and Indigo2 R4400/250MHz systems. Should one upgrade to Origin2000? Suppose one runs tests and the performance difference profile is similar to fpppp - this test doesn't vary at all between Indigo2 and both types of Power Challenge. Thus, with no real difference between the three systems, one could confidently state that an upgrade would be completely pointless unless the CPU itself was actually running at a significantly faster clock or was a better design, ie. a machine upgrade might not be the right thing to do in this case - a simple CPU upgrade might be better if at all possible, taking into account budget constraints, etc. Thus, if the option is cheaper, one would recommend keeping the same system and upgrading to R10000, not upgrading to a different machine (Origin) with R10000.
For the curious, fpppp behaves in the way it does because it's a tiny data set that fits completely into L2 cache (I reckon most people wouldn't see such behaviour, but some will).
There is, however, a final possibility: what if one sees a performance difference profile that is nothing like any of the SPECfp95 tests? Such an observation would suggest that one's application code doesn't behave like any SPECfp95 test, in which case SPEC may not be a useful or relevant aid to decision making at all. This is a possibility that never seems to occur to most people when discussing SPEC95. SPECint95 can be worse for this since one of the SPECint95 tests is JPEG compression, a function which many modern systems, eg. O2, can do in real-time via accelerated hardware.
Traditionally, people use absolute SPEC95 figures to judge performance levels between vendors, but I believe the available data for a particular vendor can also be used to help make decisions about upgrades in a manner that is somewhat more empirical than trying to guess which SPEC test is most similar to one's own application code.
Hence, when considering upgrades, one should examine the relative performance differences across a vendor's product line for those cases which involve: