SGI Performance Comparisons

Visual Effects with Flame 9.5.14

Last Change: 27/Jul/2010

With the aid of a friend who has some original Discreet Flame systems plus a couple of original systems of my own (with assistance from a studio that has some aswell), and some judicious parts swapping, here I present performance results comparing a range of SGIs running various Flame tests - some are CPU intensive, others graphics intensive, with one test taxing both CPU and graphics. The results are discussed after the main table.

All tests used HD video data (1920x1080, 8bit). In each case the storage array offered plenty of bandwidth: 311MB/sec (53fps) on the Octane, 438MB/sec (74fps) on the Tezro. Alas I cannot yet include screen shots as the tests used commercial material, but suffice to say the clips were very much akin to the typical scenes in TV shows such as Bones, House, CSI, etc. The company that supplied the HD frames hopes to be able to give me some other data which I can make public at a later date.

The tests are as follows:

Here is the main table of results:


                                                Test 1  Test 2  Test 3  Test 4   Test 5   Test 6    Test 7
                                                Spark    RGB    Targa   Colour   Motion  Geometry  Combined
          CPU    CPU    L2   Num        Accum.  GLINT   Import  Import  Correct   Blur   Rotation    Test
System    Type  Clock  (MB)  CPUs  GFX  Buffer  mm:ss     ss     m:ss    m:ss     m:ss     m:ss     h:mm:ss

Tezro     R16K   1000   16    4    V12    SW    02:21     30     3:22    1:46     2:14     1:06     0:36:36
Tezro     R16K   1000   16    2    V12    SW    03:31     33     3:27    2:25     2:39     1:12     0:38:11
Tezro     R16K   1000   16    1    V12    SW    05:49     36     3:28    3:50     3:25     1:18     0:41:25

Tezro     R16K    700    4    4    V12    SW    03:03     39     4:45    2:16     2:27     1:14     0:36:42
Tezro     R16K    700    4    2    V12    SW    04:47     41     4:47    3:09     3:02     1:19     0:39:01
Tezro     R16K    700    4    1    V12    SW    08:35     45     4:57    4:56     4:12     1:26     0:43:55

Octane2   R14K    600    2    2    V12    SW    06:09     44     5:25    3:47     3:53     1:27     0:43:19
Octane2   R14K    600    2    2    V8     SW    06:11     46     5:27    3:36     3:59     1:33     0:50:37
Octane2   R14K    600    2    2    V10    SW    06:05     45     5:27    3:42     3:55     2:20     1:37:44
Octane2   R14K    600    2    2    V6     SW    06:10     45     5:25    3:50     4:03     2:27     1:45:38

Octane2   R12K    400    2    2    V12    SW    08:06     48     7:57    4:50     4:30     1:34     0:46:36
Octane2   R12K    400    2    2    V8     SW    08:06     49     7:57    4:51     4:34     1:40     0:54:35
Octane2   R12K    400    2    2    V10    SW    08:02     48     7:57    4:46     4:31     2:37     1:53:48
Octane2   R12K    400    2    2    V6     SW    08:03     48     7:57    4:52     4:40     2:45     1:58:24
Octane2   R12K    400    2    2    V12    HW    08:06     48     7:57    4:50     5:52     2:33     1:12:39

Octane2   R12K    350    1    2    V12    SW    07:37     49     9:03    5:19     5:07     1:42     0:50:31   [custom modded CPU]

Fuel      R14K    600    4    1    V12    SW      -       42     5:46    5:36     4:35     1:31     0:45:30

                                          Table 1. Main results.

Immediate observations: tests 2, 3, 4 and 5 very much depend on CPU speed. Test 6 depends heavily on graphics speed and available VRAM/TRAM. Test 7 involves both CPU and graphics processing, but the graphics being used is the key bottleneck, specifically the amount of VRAM/TRAM available.

Also, an extra set of dual-400 Octane2 V12 results were done to check whether having the accumulation buffer set to hardware mode was of any use (the row in italics); it was not, ie. tests 5, 6 and 7 ran much slower with a hardware accumulation buffer, so always set the accumulation buffer to software mode in xsetmon.


Conclusions

If you do a lot of Sparks processing, then a good spec 4-CPU Tezro will definitely give a substantial speedup over a dual-400 or dual-600 Octane2. However, for other tasks such as Motion Blur or Colour Correct, the speed improvement with better CPUs, or a greater number of CPUs, is not linear; there is a good gain from 1 to 2 CPUs, but less of a gain moving from 2 to 4 CPUs.

The nasty surprise though is just how much of a bottleneck the V12 graphics is for running Flame, ie. the performance variance for the Geometry Rotation test between a dual-400 Octane2 V12 and a quad-1GHz Tezro V12 is not that much, ie. this operation is highly graphics-bound. As a result, the Combined Test does not greatly benefit from better CPU power since the processing is being held up by the slowest aspect of the operation, namely the geometry rotation. Indeed, the quad-700 Tezro did the Combined Test in basically the same amount of time as the quad-1GHz Tezro, just a few seconds difference which is within the margin of error for these tests.

This situation is akin to Production Management theory of assembly line manufacture: a production process cannot complete any faster than the slowest task in a chain of dependent processes. In other words, the better CPU power in the Tezro configs will definitely be helping to speed up the motion blur and colour correction processing in the Combined Test, but the processing of every frame is held up waiting for the Geometry Rotation aspect to be completed, so for a combined task of this type it means there is little gain from having better CPUs. Any render involving multiple operations which employs an effect that is computed by the graphics hardware will be bottlenecked by the V12. This is unfortunate since most typical effects sequences involve some degree of rotation or other 3D manipulation of footage.

As for the Octane using different VPro graphics options, the results prove conclusively that using V6 or V10 for Flame is a bad idea. The lack of available spare VRAM/TRAM on these options badly slows down tasks such as Geometry Rotation which employ the graphics hardware. As a result, the Combined Test results for V6/V10 systems are much slower. I was once told that Fuel/V10 systems were used as entry Flame systems, but my results imply this is unwise. Fuel may indeed be used as a reasonable basic Flame system (as shown by the single-700 Tezro results which should be almost identical to a Fuel 700/V12) but only if the Fuel is fitted with a V12.

A remarkable revelation though is that V8 is quite good, because despite the slower GE speed compared to V12, it does have the same 128MB total RAM. As a result, it blows the socks off V10 and is not that much slower than V12. V8 was not officially supported for use with Flame and other Discreet apps because it doesn't support the various fancy video timings used for external digital video connections, but used as a standalone setup it works very nicely. V12's 2X faster GE speed does mean it's quicker than V8 for the Combined Test, but only 15% better, while for the basic Geometry Rotation test, V12 is only 6% faster than V8!

So what does this all mean? Simply that SGI made a big mistake in not increasing the total RAM for the Fuel/Tezro/O3K versions of V12. With the same RAM as the Octane version, it suffers from the same limits when used for Flame. The O3K-class version of V12 should have had at least 1GB VRAM, and a higher fill rate to match the increased GE speed. If this had been done, I believe the result would have been to offer major speedups over the Octane/V12 platform. As it stands, for any complex render involving 3D manipulations such as rotation, the better CPU power available with Tezro can easily be hidden by the identical gfx RAM and associated bottleneck. I expect this would apply to any Onyx3K system using V-Bricks aswell, though note I don't think the two V12s in a V-Brick can be used in parallel for Discreet apps (if they could, that might speed things up a little). I can see why IR4 is still said to be quite decent for compositing given its much larger 10GB VRAM and 1GB TRAM.


Note that in case system disk speed was a factor, I performed some tests again using a much faster drive, but there was no difference in results. Having a faster system disk in an SGI/IRIX Flame system will not improve overall processing speed - at least not for the tests shown here.

With respect to desktop systems, the rest of this discussion always assumes the use of V12 graphics.

And so to the obvious question: is a quad-1GHz Tezro worth the extra cost compared to a quad-700 Tezro for using Flame? I guess it depends on what kind of operations one is doing. The quad-1GHz tezro clearly offers good speed improvements for Sparks processing and other tasks such as colour correction and motion blur which depend on CPU speed, but if one is always performing multiple-effect processes which involve using the 3D graphics hardware then much of the benefit of the faster CPUs will not be seen. I expect though that most Flame users do both these things in daily work, ie. processing single effects on some material (such as those used for tests 4 and 5) and processing multiple effects. Of course though, if a multiple-effects process does not use the graphics hardware (eg. colour correction + motion blur, but no geometry rotation), then better CPU power would definitely help. When my friend visits again, I may run a second combined test using just a Colour Correction and Motion Blur to check this idea.

Here is the degree to which the quad-1GHz Tezro was faster than the quad-700 Tezro for each test:

   Sparks (GLINT): 32%
       RGB Import: 30%
     Targa Import: 41%
Colour Correction: 28%
      Motion Blur: 10%
Geometry Rotation: 11%
    Combined Test:  0%

Table 2. Speedup for Quad-1GHz Tezro over Quad-700 Tezro.

Further down the cost scale, how about the speedup for dual-1GHz Tezro over dual-700 Tezro? Remember that a dual-1GHz Origin350 fitted with VPro graphics is the same as a dual-1GHz Tezro (the CPU boards are interchangeable).

   Sparks (GLINT): 35%
       RGB Import: 41%
     Targa Import: 39%
Colour Correction: 30%
      Motion Blur: 15%
Geometry Rotation: 10%
    Combined Test:  2%

Table 3. Speedup for Dual-1GHz Tezro over Dual-700 Tezro.

The speed gain here is slightly better than for quad-1GHz vs. quad-700.

Perhaps the two most common upgrade questions I receive though are how a quad-700 or quad-800 Tezro compares to a dual-600 Octane2, and how a quad-700 or quad-800 Tezro compares to a dual-700 Tezro. I can't test 1/2/4-CPU 800MHz configs just yet, so for the moment here are the speedups for quad-700 Tezro over dual-700 Tezro (the latter configuration is much cheaper, so is the extra cost of a quad-700 worthwhile?...):

   Sparks (GLINT): 55%
       RGB Import:  5%
     Targa Import:  1%
Colour Correction: 39%
      Motion Blur: 24%
Geometry Rotation:  6%
    Combined Test:  6%

Table 4. Speedup for Quad-700 Tezro over Dual-700 Tezro.

Certainly some useful speedups in CPU-bound tasks, but as expected not so much in gfx-bound tasks.

Next, quad-700 Tezro vs. dual-600 Octane2, probably the most often considered upgrade for Flame users when considering newer SGI systems...

   Sparks (GLINT): 102%
       RGB Import:  15%
     Targa Import:  17%
Colour Correction:  67%
      Motion Blur:  59%
Geometry Rotation:  18%
    Combined Test:  18%

Table 5. Speedup for Quad-700 Tezro over Dual-600 Octane2.

Clearly a good speedup for Sparks and tasks that are CPU bound. Gfx-bound tasks also improve, though not by as much as users may have been led to believe in the past.


Next, I was once told by someone at SGI that if working with HD on an Octane2 then it is essential to have a dual-600, but just how much better is a dual-600 over a dual-400?...

   Sparks (GLINT): 30%
       RGB Import:  7%
     Targa Import: 43%
Colour Correction: 28%
      Motion Blur: 16%
Geometry Rotation:  8%
    Combined Test:  8%

Table 6. Speedup for Dual-600 Octane2 over Dual-400 Octane2.

CPU-bound tasks do speedup by a useful amount, but again notice how the complex combined test is bottlenecked by the graphics processing.

So for those using a dual-400 Octane2, what would the speedup be if one moved straight to a quad-700 Tezro? Here is the table:

   Sparks (GLINT): 163%
       RGB Import:  23%
     Targa Import:  67%
Colour Correction: 113%
      Motion Blur:  84%
Geometry Rotation:  27%
    Combined Test:  27%

Table 7. Speedup for Quad-700 Tezro over Dual-400 Octane2.

This time there are useful speedups even for graphics-bound tasks, but obviously major improvements for anything CPU dependent. Given the cost of a dual-600 upgrade for Octane, my suggestion would be, if you're using a dual-400 Octane and want better speed, then move to a quad-CPU Tezro instead of upgrading the Octane, or of course switch to a Linux-based i7 XEON workstation with a Quadro FX card.

Last but not least, what about the best possible upgrade from a dual-600 Octane2? How does it compare to a quad-1GHz Tezro? Does the huge 16MB L2 in the Tezro make much difference? Here is the table, with the 2nd colum showing the speedups for quad-700 Tezro vs. dual-600 Octane2 for comparison:

                    Quad-1GHz    Quad-700MHz
                     Speedup       Speedup

   Sparks (GLINT):    165%          102%
       RGB Import:     50%           15%
     Targa Import:     65%           17%
Colour Correction:    114%           67%
      Motion Blur:     74%           59%
Geometry Rotation:     32%           18%
    Combined Test:     18%           18%

Table 8. Speedup for Quad-1GHz and quad-700MHz Tezro over Dual-600 Octane2.

Again, for anything not dominated by gfx-centric tasks such as 3D rotations, the extra speedup from the quad-1GHz system is significant. But worth the extra cost compared to using a quad-700 instead? Hard to say.


At a later date I intend to obtain test results for Tezro using 800MHz CPUs (I need to borrow a quad-800 board and ask my friend to visit with his systems once more, at which point I intend to run the same tests with Octane MXI/MXE aswell). Likewise, I know someone who uses Onyx2/Onyx3K Inferno systems and so at some point will test various configurations, mostly focusing on different graphics options (IR2E, IR3, IR4), though that's a more involved operation as it involves taking various parts with me (RM boards, etc.)


Caveats

Flame obviously has a vast range of processing functions available. The tests shown here have used a tiny fraction of them. However, I hope they prove useful by covering the main aspect of system performance I wanted to explore, ie. the way in which overall speed varies with different CPU/graphics options and especially the way in which V12 becomes a bottleneck on desktop systems. Is a high-end SGI with IRx graphics better than V12? Does it matter that Onyx2 systems are limited to 500MHz CPUs? Does it help to use an Onyx3K vs. an Onyx2 with the same gfx and similar CPUs? (eg. quad-600 Onyx3200 IR4 vs. quad-500 Onyx2 IR4, both with 2RM11). Time will tell.

One final comment: offhand, I do not know how typical GLINT is as a Spark, ie. the way in which it utilises multiple CPUs. Perhaps the nature of how it processes frames means it does not scale that well with multiple CPUs? Maybe other Sparks scale better? I may test this in the future when my friend visits again.