SGI Performance Comparisons

Blender Benchmark Results

Last Change: 26/Oct/2015

Blender is a very popular 3D animation, modelling and rendering application available for a wide variety of systems. It runs quite nicely on SGIs, so Indigo2/O2/Octane/Fuel systems are popular choices for both learning Blender and getting into the world of SGIs.

Here I present the performance results of running the Blender Render Benchmark on various SGI systems, using Blender V2.44. The rendered test scene looks like this (click on the image for the full-size version):

Blender Test Scene

I am not using Blender V2.45 because V2.44 is 11% faster, for reasons as yet unknown. Feel free to send me your own results! You can download Blender 2.44 from my site here, and also the test data file, test.blend. If I receive results for systems using other versions of Blender, I will include them in separate tables to avoid confusion. For results done by me, all O2, Octane and other newer systems were tested using 6.5.26m when possible, while all older systems were tested with 6.5.22m. I will start including results for V2.48 at some point, for testing systems with lots of CPUs (greater thread limit), but again the data will be in a separate table.

Note that in order to demonstrate CPU scalability, any system with N CPUs that is tested with a number of threads K that is less than N is shown by having its name in italics, ie. only K CPUs in that system are being used.


                        Cores /             Clock    L2/L3                         Time
Ref.  System       CPUs   CPU     Type       MHz    Per Core  Threads   O.S.    hh:mm:ss.ss      Tested By / Notes

 1    Ian's PC #2   1      4     i7 2700K    5000     8MB        8      Win7    00:00:13:09      I.M. Oc'd to 50 x 100, 16GB DDR3/2133 RAM, Win7 Ultimate 64bit, tiles = 16 x 16.
 2    Ian's PC #1   1      4     i7 870      4270     8MB        8      Win7    00:00:18.98      I.M. Oc'd to 203.3 x 21, 4GB DDR3/2030 RAM, Win7 Ultimate 32bit, tiles = 16 x 16.
 3    Dell T7500    1      4     X5570       2930     8MB        8      Win7    00:00:25.71      I.M. Tiles = 16 x 16, system at default settings, Win7/Pro/64Bit.
 4    Onyx300       8      1     R14000       600     4MB        8     6.5.30   00:00:49:66      recondas [hinv]
 5    Tezro         4      1     R16000      1000    16MB        8     6.5.26   00:00:58.71      I.M. Tiles increased to 24 x 24. [hinv]
 6    Origin300     8      1     R14000       500     2MB        8     6.5.26   00:00:59.72      I.M. Tiles increased to 16 x 16. [hinv]
 7    Origin3200    8      1     R14000       500     8MB        8     6.5.26   00:01:01.26      Toby Jennings. Tiles increased to 8 x 8.
 8    Ian's PC #3   1      2     6000+       3215     1MB        8      Win7    00:01:17.52      ASUS M3N-HT Deluxe, Athlon64 X2 6000+ (15x215, DDR2/800 CL5), Win7/Ult/64bit.
 8    Onyx300       4      1     R14000       600     4MB        8     6.5.30   00:01:39.59      recondas
 9    Origin300     4      1     R14000       600     4MB        8     6.5.26   00:01:48.05      I.M. [hinv]
10    Origin300     4      1     R14000       500     2MB        8     6.5.26   00:01:54.39      I.M. Tiles increased to 16 x 16. [hinv]
11    Onyx2         4      1     R14000       500     8MB        8     6.5.26   00:01:55.93      I.M. Tiles increased to 16 x 16. [hinv]
12    Origin350     2      1     R16000      1000    16MB        8     6.5.30   00:01:59.78      bri3d [hinv]
13    Onyx2         4      1     R12000       400     8MB        8     6.5.26   00:02:26.50      I.M. Tiles increased to 8 x 8.
14    Onyx          8      1     R10000       195     2MB        8     6.5.22   00:02:30.40      I.M. Tiles increased to 16 x 16.
15    Tezro         2      1     R16000       700     4MB        8     6.5.26   00:02:51.44      I.M. [hinv]
16    Octane2       2      1     R14000       600     2MB        8     6.5.26   00:03:14.72      I.M.
17    Tezro         4      1     R16000      1000    16MB        1     6.5.26   00:03:48.02      I.M. Tiles increased to 16 x 16. [hinv]
18    Origin350     1      1     R16000      1000    16MB        1     6.5.30   00:03:49.61      bri3d [hinv]
19    Challenge     8      1     R10000       195     1MB        8     6.5.22   00:03:59.85      I.M. (*)
20    Fuel          1      1     R16000       900     8MB        1     6.5.26   00:04:15.75      I.M. [hinv]
21    Fuel          1      1     R16000       800     4MB        1     6.5.26   00:04:42.97      I.M. [hinv]
22    VW540         4      1     PIII         500     2MB        8     Win2K    00:04:44.53      I.M. XEON CPUs. Standard version of Blender V2.44, system had 1GB RAM using all slots.
23    Octane2       2      1     R12000       400     2MB        8     6.5.26   00:04:51.31      I.M.
24    Onyx          4      1     R10000       195     2MB        8     6.5.22   00:04:57.26      I.M.
25    Fuel          1      1     R16000       700     4MB        1     6.5.26   00:05:29.14      I.M.
26    Origin200     2      1     R12000       360     4MB        8     6.5.26   00:05:30.85      I.M.
27    Octane2       2      1     R12000       360     2MB        8     6.5.30   00:05:31.73      I.M. 
28    Octane        2      1     R12000       350     1MB        8     6.5.30   00:05:42.85      I.M. [hinv]
29    Fuel          1      1     R14000       600     4MB        1     6.5.30   00:06:17.11      I.M.
30    Fuel          1      1     R14000       600     4MB        1     6.5.29   00:06:30.53      James Smyth [hinv]
31    Octane2       2      1     R12000       300     2MB        8     6.5.26   00:06:34.32      I.M. Tiles increased to 24 x 24.
32    Octane2       1      1     R14000       550     2MB        1     6.5.26   00:07:14.79      I.M.
33    Octane2       2      1     R10000       250     1MB        8     6.5.26   00:08:17.74      I.M.
34    VW320         1      1     PIII        1120      ?         8     Win2K    00:09:26.13      I.M. Standard verson of Blender V2.44, system had 512MB RAM using all slots. Tiles = 16 x 16.
34    VW320         2      1     PIII         500     512K       8     Win2K    00:09:42.34      I.M. Standard verson of Blender V2.44, system had 1GB RAM using all slots. Tiles = 16 x 16.
35    Octane2       1      1     R12000       400     2MB        1     6.5.26   00:09:50.28      I.M.
36    Octane        2      1     R10000       195     1MB        8     6.5.26   00:10:19.14      I.M.
37    O2            1      1     R12000       400     2MB        1     6.5.26   00:10:24.32      I.M.
38    O2            1      1     R7000        600   256K/1MB     1     6.5.26   00:10:53.15      I.M. Screen set to 800x600 @ 60Hz. [hinv]
39    O2            1      1     R7000        600   256K/1MB     1     6.5.26   00:10:58.76      tomo [hinv]
40    Octane2       1      1     R12000       360     2MB        1     6.5.26   00:10:59.83      I.M.
41    Octane2       1      1     R12000       300     2MB        1     6.5.26   00:13:20.75      I.M.
42    O2            1      1     R12000       300     1MB        1     6.5.26   00:14:18.22      I.M.
43    Octane        1      1     R10000       250     1MB        1     6.5.26   00:15:33.62      I.M.
44    O2            1      1     R12000       270     1MB        1     6.5.26   00:15:50.04      I.M.
45    O2            1      1     R10000       250     1MB        1     6.5.26   00:17:31.13      I.M.
46    O2            1      1     R7000        350   256K/1MB     1     6.5.26   00:18:40.76      I.M.
47    Octane        1      1     R10000       225     1MB        1     6.5.26   00:19:14.72      I.M.
48    Octane        1      1     R10000       195     1MB        1     6.5.26   00:19:26.31      I.M.
49    VW320         1      1     PIII         500     512K       1     Win2K    00:19:27.76      I.M. Standard verson of Blender V2.44, system had 512MB RAM using all slots.
50    O2            1      1     R10000       225     1MB        1     6.5.26   00:19:40.92      I.M.
51    Indigo2       1      1     R10000       195     1MB        1     6.5.22   00:20:02.92      I.M.
52    O2            1      1     R10000       195     1MB        1     6.5.26   00:21:48.37      I.M.
53    O2            1      1     R10000       175     1MB        1     6.5.26   00:24:24.07      I.M.
54    O2            1      1     R5200        300     1MB        1     6.5.26   00:27:11.90      I.M.
55    O2            1      1     R10000       150     1MB        1     6.5.26   00:29:06.06      I.M.
56    O2            1      1     R5000        200     1MB        1     6.5.26   00:40:23.92      I.M.
57    O2            1      1     R5000        180     512K       1     6.5.26   00:46:20.08      I.M.
58    Indy          1      1     R5000        180     512K       1     6.5.22   00:47:14.55      I.M.
59    Indy          1      1     R5000        150     512K       1     6.5.22   00:55:04.42      I.M.
60    O2            1      1     R5000        180     -          1     6.5.26   00:56:39.88      I.M.
61    Indigo2       1      1     R8000         75     2MB        1     6.5.22   01:40:39.19      I.M.

(*) This system actually has 24 CPUs, but only 8 are used for the test of course since Blender
V2.44 can't issue more than 8 threads. This does mean though that if rendering multiple frames,
ie. more than one render instance going on at any one time, then the overall throughput of the
system would be 3X faster, ie. effectively 1 frame every 1 min 20 sec.

PC Reference Example: My Dual-Core Athlon64 X2 3.225GHz 6000+ PC (full spec) does this test in
1 min 14.61 secs, ie. as a rough guide, an Athlon64 X2 6000+ is about the same speed as four or
five R14K/600 CPUs, depending on the task. Thus, clock for clock, MIPS holds up rather well!


Observations

The main results table for all systems on eofw.org shows old SGIs perform rather well for this test, outperforming x86 systems with much higher clock speeds, etc. A reasonable approximation is that four R14K/600 CPUs are about the same speed as a modern dual-core 3GHz Athlon64. Thus, for example, a dual-600MHz Octane2 can beat an old-style 2.4GHz P4, though of course modern dual-core/quad-core x86 CPUs are much faster, especially if using SSEx versions of Blender. Still, given MIPS CPUs do not have SSEx-type instructions, SGIs are not too bad really given their age, and quite nice to work with for a beginner, especially given the high responsiveness of Octane and Fuel (O2 is more useful when it comes to capturing frames, creating final movies, etc. It is significantly less powerful for the main 3D work). Infact, even an old dual-R10K/250 Octane can do this test faster than a sub-2GHz P4, which is quite surprising.

The results clearly show the usefulness of dual-CPUs in Octane, but also reveal how weak the R5000 is in O2, with the R10K being twice as fast as an R5K for this test at the same clock speed. The R7K is a slight improvement, but doesn't really shine until the best 600MHz CPU is used, at which point it's quite good, though still not as fast as the R12K/400 O2. Also, the O2 results show how an R10K or R12K is not as fast as the same CPU in Octane, or even in Indigo2, though of course O2 can use R10K/R12K options that are not available for Indigo2. Perhaps O2's main advantage is its much lower power consumption, eg. even though the best O2 is half the speed of a dual-400 Octane for this test, overall it would use less power to complete the render. However, if power consumption is important then the best systems to use are the newer O3K designs.

The really interesting results are those for older dual-CPU Octanes, eg. systems 8 and 10. Dual-195 and dual-250 Octanes are normally very cheap 2nd-hand, yet for Blender rendering they're only slightly slower than single-CPU Octanes at 400/550MHz respectively. A dual-300 does beat a single-550 and is significantly faster than a single-400. Since dual-CPU Octanes are more responsive in general anyway, this means that (given the low cost) something like a dual-250 SSE is actually quite a nice entry SGI system for fiddling with Blender, though of course SSE doesn't have hardware texture. Those with a budget can usually afford something better anyway, eg. a 400/V6 is a common option (V8 for those who can afford it), but these results do show that for someone who has such a system, getting a very cheap dual-250 SI as an offline renderer would give a faster render box than their main system, yet leaves the main system free to continue modelling on.

For serious render speeds with SGIs though, one can use Onyx, Challenge and the newer Origin3000 series systems, including Fuel and Tezro. Sadly, Blender's 8-thread limit means SGIs with lots of CPUs (eg. Onyx/Challenge racks, newer Origin/Onyx2/Onyx3 systems) will not be faster with more than 8 CPUs, unless running more than one render task at the same time. Bit of a shame really - I'd been looking forward to seeing how well a 24-CPU Onyx rack would do the test. :D In reality, what one can say is that, assuming a typical animation involves rendering multiple frames, the overall throughput of such an Onyx is pretty good, averaging one frame every 80 seconds (that's faster than a quad-600 Origin300). About the Indigo2 R8K/75 entry: I suspect this result is so slow because the program is not remotely compiled to properly take advantage of the R8K design. Blender is built with GCC, but GCC knows pretty much nothing about how to optimise for the R8K. I expect the test would run much faster if Blender was compiled with MIPS Pro using the R8K flags, but this might be difficult. Has anyone been able to compile Blender using MIPS Pro? If so, please contact me. One person told me Blender could be made to run much faster if built using properly optimised math libs like ATLAS, but that's a whole separate problem.

The Origin300 result is interesting: it shows there is some overhead with displaying the Blender application on a remote system, though the ethernet link was only 100Mbit. I might try the test again with a Gbit connection, see if that makes any difference.


Reference Data

I did some initial testing to find out which version of Blender was the fastest, using a dual-300MHz Octane2, checking with all versions of Blender I could find for IRIX. Here are the results, in order of speed:

Blender     Time
Version   mm:ss.ss

 2.44     06:37.37
 2.45     07:21.19
 2.43     07:45.19
 2.42a    08:41.00
 2.40     09:33.73
 2.41     10:02.82

Older versions did not support more than 2 threads anyway, but clearly something has happened since V2.44, which is 11% faster than 2.45, so I am using 2.44 for testing. Thus, unless newer features are more important to you, I would recommend sticking with 2.44 until the performance issue is fixed, whatever it might be.

Next, here is a table showing how performance scales with the number of threads, in this case using a dual-600MHz Octane2.

No. of      Time
Threads   mm:ss.ss

   8      03:14.72
   4      03:16.53
   2      03:23.19
   1      06:16.59

Using the maximum number of threads is clearly the best option on multi-CPU SGIs.

Lastly, some thoughts about how Blender's multithreaded rendering operates, adapted from a post I made on Nekochan about the C-Ray benchmark...

Watching Blender work, it seems like there's a bit of a delay whenever an area is completed and a new one started. Worse, assuming the use of N threads, if there's less than N areas remaining (call it K) then some threads go unused, so the tail end of the rendering is not as fast. Worst case is if the final area happens to be a complex one: only one thread is running and it takes much longer than normal.

What I like about C-Ray's method is the way the remaining unprocessed area continues to be split [with the maximum number of specified threads] as long as it's possible to do so, thus the parallelism remains high right to the very end. With Blender's method, if there are N threads, the parallelism drops off as soon as there are N-1 areas left to render. Unless the overhead kills it, I would have thought it would be better once K < N to halve the width/height of the remaning K areas, which would mean being able to use N threads again. Depending on the resolution of the render, this could be done once or twice and should speed up the rendering of the final N-1 pieces quite a lot.

Example: 8 threads (very common these days with the latest dual/quad-core CPUs). Image split into the default 4 x 4 pieces. When 7 pieces remain, halve the width/height of the pieces, so thus 28 remain. 8 threads can be used again. As before, when only 7 pieces of this smaller size remain, the efficiency will slide, but the final result will be quicker than without. If the image was large enough (eg. HD), a further resolution-halving would still be effective. At some point the thread-management overhead would make resplitting the remaining pieces not worthwhile (perhaps this could be monitored in some way and dealt with automatically), but even 2 splitting stages would be very beneficial I reckon.

Alternatively, start the render with a larger no. of pieces, but Blender's overhead when pieces/threads start/stop looks kinda highish (if so, better to start with say 4 x 4 and then subdivide at the end). Just a thought!