Saturday, 5 September 2009

Ramblings about your own, fast C# Maths library

Why not building your own C# Maths library instead of using the ones included in APIs like DirectX or XNA? That would give you some advantages when developing a game or something similar. Keep reading…
It is late at night. Very late. In fact it is so late, that I should say it’s almost early in the morning… but I´m immersed in a task I have delayed for too much time: extend my own Math library, directly written in C#. What you said? That cannot be fast enough? ha ha… Some FAQs first:

Intro

Why to write my own library?

Because I think it’s good to depend on the Graphics API as less as possible. Why?
  • Because APIs are deprecated from time to time (as happened with Managed DirectX), or evolve, or change, or re-appear in some form (as it seems that is going to happen with some parts of Managed DirectX, via the Windows 7 Code Pack), or you just decide to use another API, or you want your code to be compatible both with XNA and with DirectX.
  • Because I want to be able to debug the inners of my Maths.
  • Because writing your own library (even if you copy most parts of it), you will learn a lot.
  • Because seeing the internals of things, you will realize how much work some tasks take, and you will think it twice before calling them…
Do you need any more reasons?

Wait a minute. Faster code, using no API?

For CPU-related stuff, YES, it is possible.
XNA (or DirectX) makes no magic. They just calculate things as you will do. But of course you must choose the best algorithms for each task.If you do so, you will end up with a code as fast as the one written in the API, but the point is, as you will know how things work in the inners, you will have a much more detailed idea of the cost of things, and you will optimize your code. In addition to that, you will be able to tune things a bit for your specific needs, increasing the speed a bit more.
However, keep in mind that companies like Microsoft have hundreds or thousands of the best engineers working in their products, so if there is a better method of doing something, the API will have it. You need to be that good too!
Let’s see an example of the kind of optimization you need to achieve to make your own Maths library as fast (or faster) than the ones included in the APIs. We will include some matrix operations and make some speed tests with different algorithms.

A note about Affine Transformation Matrices

In 3D graphics applications, transformations are typically built from the combination of elementary matrices (being this elementary matrices: translations, rotations, scales and the identity matrix). Matrices built this way, always result in Affine transformations (matrices with the right column of the matrix being 0,0,0,1 –if using row representation of matrices-).
There are some algorithm optimizations in this article that are only valid for Affine Transformation matrices. You should avoid using them for non-affine matrices. It is the case of Perspective Projection matrices, which DO NOT BELONG to the affine transformations group.

A note about JIT compilation and .NET maths performance

This is a very interesting question: Are local variables faster than member variables?
At first, one could say that they aren’t, as they have to be re-created each time the method is called. BUT, as this interesting blog post points out, while the JIT compiler can en-register local variables, it cannot do the same with member fields. So we should say that YES, THEY ARE, when talking about math calculations. As you can see the in the test, it makes a remarkable difference. So:
If you are developing in .Net and you have a performance critical math. calculation method, you should use local copy of variables to speed up calculations
In fact, if you get a glimpse inside the XNA code, every non-static method inside the Matrix class, which is performance critical, makes a local copy of each member field.

Part #1 Matrix Determinant Calculation

The general Method (Based on Graphics Gems I)

The following is a pretty straight-forward algorithm for calculating the determinant of a matrix. It is mostly designed for non-realtime applications, as it quite slow and uses double-precission. It calculates a 4x4 determinant basing on the 3x3 method, then in the 2x2 and so on. It is a based in an algorithm used in Graphics Gems I for matrix inversion.
        public double Determinant4x4()
        {
            double ans;
            double a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4;
 
            a1 = this.M11; b1 = this.M12;
            c1 = this.M13; d1 = this.M14;
 
            a2 = this.M21; b2 = this.M22;
            c2 = this.M23; d2 = this.M24;
 
            a3 = this.M31; b3 = this.M32;
            c3 = this.M33; d3 = this.M34;
 
            a4 = this.M41; b4 = this.M42;
            c4 = this.M43; d4 = this.M44;
 
            ans = a1 * Determinant3x3(b2, b3, b4, c2, c3, c4, d2, d3, d4)
                - b1 * Determinant3x3(a2, a3, a4, c2, c3, c4, d2, d3, d4)
                + c1 * Determinant3x3(a2, a3, a4, b2, b3, b4, d2, d3, d4)
                - d1 * Determinant3x3(a2, a3, a4, b2, b3, b4, c2, c3, c4);
            return ans;
        }      
        public double Determinant3x3(double  a1,double  a2,double  a3,double  b1,double  b2,double  b3,double  c1,double  c2,double  c3 )
        {
            double ans;
            ans = a1 * Determinant2x2(b2, b3, c2, c3)
                - b1 * Determinant2x2(a2, a3, c2, c3)
                + c1 * Determinant2x2(a2, a3, b2, b3);
            return ans;
        }
        public double Determinant2x2( double a,double  b, double c, double d)
        {
            double ans;
            ans = a * d - b * c;
            return ans;
        }

The general Method (real-time applications optimized)

This second version is also a full, general method, but optimized for real-time applications. It is much faster, because of the calculation and because of the grouping of duplicated mults in numXX variables, saving multiplications and adds.
       public float Determinant()
        {
            float num18 = (M33 * M44) - (M34 * M43);
            float num17 = (M32 * M44) - (M34 * M42);
            float num16 = (M32 * M43) - (M33 * M42);
            float num15 = (M31 * M44) - (M34 * M41);
            float num14 = (M31 * M43) - (M33 * M41);
            float num13 = (M31 * M42) - (M32 * M41);
            return ((((M11 * (((M22 * num18) - (M23 * num17)) + (M24 * num16))) - (M12 * (((M21 * num18) -
                   (M23 * num15)) + (M24 * num14)))) + (M13 * (((M21 * num17) - (M22 * num15)) +
                   (M24 * num13)))) - (M14 * (((M21 * num16) - (M22 * num14)) + (M23 * num13))));
        }
This is the one used in XNA or SlimDX. It is the standard de-facto.

Optimization I (faster Memory Accessing)

As stated in the intro, about the performance of local variables vs. member fields, we´ll make an optimization of the above method using this concept:
    public float Determinant()
        {
            float L11 = this.M11;
            float L12 = this.M12;
            float L13 = this.M13;
            float L14 = this.M14;
            float L21 = this.M21;
            float L22 = this.M22;
            float L23 = this.M23;
            float L24 = this.M24;
            float L31 = this.M31;
            float L32 = this.M32;
            float L33 = this.M33;
            float L34 = this.M34;
            float L41 = this.M41;
            float L42 = this.M42;
            float L43 = this.M43;
            float L44 = this.M44;
            float C1 = (L33 * L44) - (L34 * L43);
            float C2 = (L32 * L44) - (L34 * L42);
            float C3 = (L32 * L43) - (L33 * L42);
            float C4 = (L31 * L44) - (L34 * L41);
            float C5 = (L31 * L43) - (L33 * L41);
            float C6 = (L31 * L42) - (L32 * L41);
            return ((((L11 * (((L22 * C1) - (L23 * C2)) + (L24 * C3))) - (L12 * (((L21 * C1) -
                   (L23 * C4)) + (L24 * C5)))) + (L13 * (((L21 * C2) - (L22 * C4)) + (L24 * C6)))) -
                   (L14 * (((L21 * C3) - (L22 * C5)) + (L23 * C6))));
        }
See the test for performance results.

Optimization II (Based on Graphics Gems II, for Affine transformations only)

The following algorithm is a direct translation of a piece of code found in a Affine Matrix Inversion algorithm, inside Graphics Gems II (see in the next post for the full algorithm). It makes only 12 mults and 7 adds. Keep in mind that this method is only valid for affine transformations (see intro for further information):
        public float Determinant()
        {
            return (this.M11 * this.M22 * this.M33) +
                   (this.M12 * this.M23 * this.M31) +
                   (this.M13 * this.M21 * this.M32) -
                   (this.M13 * this.M22 * this.M31) -
                   (this.M12 * this.M21 * this.M33) -
                   (this.M11 * this.M23 * this.M32);
        }

 

The test

So, it’s time to measure this little boys.

Performance

The test was made calculation a 4x4 affine matrix determinant 100 Million times, in a Intel Core 2 Quad Q9450 2.66Ghz with 3 Gb (non multi-threading).
General method (Graphics Gems I, non real-time)
1 minute: 0.9 secs
General method (real-time)
2.6832 secs
Optimization I (memory access)
2.496 secs
Optimization II (affine transformations)
1.404 secs

Reliability

The reliability test was made with two different matrices.There is no surprises here, all the algorithms work well. The following is the matrices used for the tests (affine transformations) and the results (the determinant) calculated by each algorithm.
Matrix 1:
0.7071068 -0.235702276 0.6666667 0.0
0.0 0.9428091 0.33333334 0.0
-0.7071068 -0.235702276 0.6666667 0.0
0.0 0.0 -15.0 1
General method (Graphics Gems I)
1.0000001641611829
General method (real-time)
1.00000012
Optimization I (memory access)
1.00000012
Optimization II (affine transformations)
1.00000012

Matrix 2:
9.848019 0.10061159 0.0 0.0
-3.50578356 0.282625765 0.0 0.0
0.0 0.0 94.22 0.0
0.0 0.0 0.0 1
General method (Graphics Gems I)
295.47639778127973
General method (real-time)
295.4764
Optimization I (memory access)
295.4764
Optimization II (affine transformations)
295.4764

Part #2: Matrix Inverse Calculation

One of the most expensive Matrix operation is calculating the inverse of a matrix. Will start with the general algorithm for this, and study a couple of optimizations later:

The general Method

This algorithm (explained here), is something like this:
        public Matrix Inverse()
        {
            float determinant = this.Determinant();
            if (determinant == 0)
                throw new System.InvalidOperationException("Matrix has zero determinant (singular matrix). Inverse doesn´t exist");
 
            Matrix b = new Matrix();
            b.M11 = M22 * M33 * M44 + M23 * M34 * M42 + M24 * M32 * M43 - M22 * M34 * M43 - M23 * M32 * M44 - M24 * M33 * M42;
            b.M12 = M12 * M34 * M43 + M13 * M32 * M44 + M14 * M33 * M42 - M12 * M33 * M44 - M13 * M34 * M42 - M14 * M32 * M43;
            b.M13 = M12 * M23 * M44 + M13 * M24 * M42 + M14 * M22 * M43 - M12 * M24 * M43 - M13 * M22 * M44 - M14 * M23 * M42;
            b.M14 = M12 * M24 * M33 + M13 * M22 * M34 + M14 * M23 * M32 - M12 * M23 * M34 - M13 * M24 * M32 - M14 * M22 * M33;
 
            b.M21 = M21 * M34 * M43 + M23 * M31 * M44 + M24 * M33 * M41 - M21 * M33 * M44 - M23 * M34 * M41 - M24 * M31 * M43;
            b.M22 = M11 * M33 * M44 + M13 * M34 * M41 + M14 * M31 * M43 - M11 * M34 * M43 - M13 * M31 * M44 - M14 * M33 * M41;
            b.M23 = M11 * M24 * M43 + M13 * M21 * M44 + M14 * M23 * M41 - M11 * M23 * M44 - M13 * M24 * M41 - M14 * M21 * M43;
            b.M24 = M11 * M23 * M34 + M13 * M24 * M31 + M14 * M21 * M33 - M11 * M24 * M33 - M13 * M21 * M34 - M14 * M23 * M31;
 
            b.M31 = M21 * M32 * M44 + M22 * M34 * M41 + M24 * M31 * M42 - M21 * M34 * M42 - M22 * M31 * M44 - M24 * M32 * M41;
            b.M32 = M11 * M34 * M42 + M12 * M31 * M44 + M14 * M32 * M41 - M11 * M32 * M44 - M12 * M34 * M41 - M14 * M31 * M42;
            b.M33 = M11 * M22 * M44 + M12 * M24 * M41 + M14 * M21 * M42 - M11 * M24 * M42 - M12 * M21 * M44 - M14 * M22 * M41;
            b.M34 = M11 * M24 * M32 + M12 * M21 * M34 + M14 * M22 * M31 - M11 * M22 * M34 - M12 * M24 * M31 - M14 * M21 * M32;
 
            b.M41 = M21 * M33 * M42 + M22 * M31 * M43 + M23 * M32 * M41 - M21 * M32 * M43 - M22 * M33 * M41 - M23 * M31 * M42;
            b.M42 = M11 * M32 * M43 + M12 * M33 * M41 + M13 * M31 * M42 - M11 * M33 * M42 - M12 * M31 * M43 - M13 * M32 * M41;
            b.M43 = M11 * M23 * M42 + M12 * M21 * M43 + M13 * M22 * M41 - M11 * M22 * M43 - M12 * M23 * M41 - M13 * M21 * M42;
            b.M44 = M11 * M22 * M33 + M12 * M23 * M31 + M13 * M21 * M32 - M11 * M23 * M32 - M12 * M21 * M33 - M13 * M22 * M31;
 
            // This last operation involves a float * Matrix multiplication
            return (1f / determinant) * b;
        }
You can use any of the methods presented in Part 1 of this article for calculating the Determinant.

Optimization I (memory access and some operation savings)

This method applies the same concept, but using local variables and packing duplicated calculations in local variables too:
        public static Matrix InvertXNALocalValues(Matrix matrix)
        {
            Matrix outMatrix = new Matrix();
            float L1 = matrix.M11;
            float L2 = matrix.M12;
            float L3 = matrix.M13;
            float L4 = matrix.M14;
            float L5 = matrix.M21;
            float L6 = matrix.M22;
            float L7 = matrix.M23;
            float L8 = matrix.M24;
            float L9 = matrix.M31;
            float L10 = matrix.M32;
            float L11 = matrix.M33;
            float L12 = matrix.M34;
            float L13 = matrix.M41;
            float L14 = matrix.M42;
            float L15 = matrix.M43;
            float L16 = matrix.M44;
            float L17 = (L11 * L16) - (L12 * L15);
            float L18 = (L10 * L16) - (L12 * L14);
            float L19 = (L10 * L15) - (L11 * L14);
            float L20 = (L9 * L16) - (L12 * L13);
            float L21 = (L9 * L15) - (L11 * L13);
            float L22 = (L9 * L14) - (L10 * L13);
            float L23 = ((L6 * L17) - (L7 * L18)) + (L8 * L19);
            float L24 = -(((L5 * L17) - (L7 * L20)) + (L8 * L21));
            float L25 = ((L5 * L18) - (L6 * L20)) + (L8 * L22);
            float L26 = -(((L5 * L19) - (L6 * L21)) + (L7 * L22));
            float L27 = 1f / ((((L1 * L23) + (L2 * L24)) + (L3 * L25)) + (L4 * L26));
            float L28 = (L7 * L16) - (L8 * L15);
            float L29 = (L6 * L16) - (L8 * L14);
            float L30 = (L6 * L15) - (L7 * L14);
            float L31 = (L5 * L16) - (L8 * L13);
            float L32 = (L5 * L15) - (L7 * L13);
            float L33 = (L5 * L14) - (L6 * L13);
            float L34 = (L7 * L12) - (L8 * L11);
            float L35 = (L6 * L12) - (L8 * L10);
            float L36 = (L6 * L11) - (L7 * L10);
            float L37 = (L5 * L12) - (L8 * L9);
            float L38 = (L5 * L11) - (L7 * L9);
            float L39 = (L5 * L10) - (L6 * L9);
            outMatrix.M11 = L23 * L27;
            outMatrix.M21 = L24 * L27;
            outMatrix.M31 = L25 * L27;
            outMatrix.M41 = L26 * L27;
            outMatrix.M12 = -(((L2 * L17) - (L3 * L18)) + (L4 * L19)) * L27;
            outMatrix.M22 = (((L1 * L17) - (L3 * L20)) + (L4 * L21)) * L27;
            outMatrix.M32 = -(((L1 * L18) - (L2 * L20)) + (L4 * L22)) * L27;
            outMatrix.M42 = (((L1 * L19) - (L2 * L21)) + (L3 * L22)) * L27;         
            outMatrix.M13 = (((L2 * L28) - (L3 * L29)) + (L4 * L30)) * L27;
            outMatrix.M23 = -(((L1 * L28) - (L3 * L31)) + (L4 * L32)) * L27;
            outMatrix.M33 = (((L1 * L29) - (L2 * L31)) + (L4 * L33)) * L27;
            outMatrix.M43 = -(((L1 * L30) - (L2 * L32)) + (L3 * L33)) * L27;         
            outMatrix.M14 = -(((L2 * L34) - (L3 * L35)) + (L4 * L36)) * L27;
            outMatrix.M24 = (((L1 * L34) - (L3 * L37)) + (L4 * L38)) * L27;
            outMatrix.M34 = -(((L1 * L35) - (L2 * L37)) + (L4 * L39)) * L27;
            outMatrix.M44 = (((L1 * L36) - (L2 * L38)) + (L3 * L39)) * L27;
            return outMatrix;
        }

Optimization II (for Affine Transformations only -Based on the Kevin Wu article, Graphics Gems II-)

The Graphics Gems books series are a must in any graphics or game developer’s bookshelves. You can by the the book I’m talking about here, or get more info about the Graphics Gems series here. Google have some excerpts of this book in the following link, but one page out of each two is missing: Graphics gems II‎ - Página 342.
So, the algorithm purposed by Mr. Wu looks like the following (ported to C#, original in C++):
   public static bool Inverse(Matrix min, Matrix mout)
        {
            float det_1;
            float pos, neg, temp;
            double PRECISION_LIMIT = 1.0e-15;
            // Calculate determinand
            pos = neg = 0.0f;
            temp = min.M11 * min.M22 * min.M33;
            if (temp >= 0.0)
                pos += temp;
            else
                neg += temp;
            temp = min.M12 * min.M23 * min.M31;
            if (temp >= 0.0)
                pos += temp;
            else
                neg += temp;
            temp = min.M13 * min.M21 * min.M32;
            if (temp >= 0.0)
                pos += temp;
            else
                neg += temp;
            temp = -min.M13 * min.M22 * min.M31;
            if (temp >= 0.0)
                pos += temp;
            else
                neg += temp;
            temp = -min.M12 * min.M21 * min.M33;
            if (temp >= 0.0)
                pos += temp;
            else
                neg += temp;
            temp = -min.M11 * min.M23 * min.M32;
            if (temp >= 0.0)
                pos += temp;
            else
                neg += temp;
            det_1 = pos + neg;
 
            // Is the submatrix A singular? then Matrix M has no inverse
            if ((det_1 == 0.0) || (Math.Abs(det_1 / (pos - neg)) < PRECISION_LIMIT))
                throw new System.InvalidOperationException("Matrix has zero determinant (singular matrix). Inverse doesn´t exist");
 
            // Calculate inverse(A) = adj(A) / det(A)
            det_1 = 1.0f / det_1;
            mout.M11 = (min.M22 * min.M33 - min.M23 * min.M32) * det_1;
            mout.M21 = -(min.M21 * min.M33 - min.M23 * min.M31) * det_1;
            mout.M31 = (min.M21 * min.M32 - min.M22 * min.M31) * det_1;
            mout.M12 = -(min.M12 * min.M33 - min.M13 * min.M32) * det_1;
            mout.M22 = (min.M11 * min.M33 - min.M13 * min.M31) * det_1;
            mout.M32 = -(min.M11 * min.M32 - min.M12 * min.M31) * det_1;
            mout.M13 = (min.M12 * min.M23 - min.M13 * min.M22) * det_1;
            mout.M23 = -(min.M11 * min.M23 - min.M13 * min.M21) * det_1;
            mout.M33 = (min.M11 * min.M22 - min.M12 * min.M21) * det_1;
 
            /* Calculate -C * inverse(A) */
            mout.M41 = -(min.M41 * mout.M11 + min.M42 * mout.M21 + min.M43 * mout.M31);
            mout.M42 = -(min.M41 * mout.M12 + min.M42 * mout.M22 + min.M43 * mout.M32);
            mout.M43 = -(min.M41 * mout.M13 + min.M42 * mout.M23 + min.M43 * mout.M33);
 
            /* Fill in last column */
            mout.M14 = mout.M24 = mout.M34 = 0.0f;
            mout.M44 = 1.0f;
            return true;
        }
Please keep in mind that this algorithm is only valid for affine transformations. Of course, some memory accessing optimization could be done here, using local variables as explained in Part 1.

Optimization III (for matrices composed by Rotations and Translations only)

If you know that your matrix is composed by rotations and translations only, you can apply this (even faster) method, which avoids calculating the determinant, and takes advantage of some properties of this kind of matrices. You can read all the info here.
        public Matrix Inverse()
        {
            Matrix r = new Matrix();
            r.M11 = this.M11;
            r.M12 = this.M21;
            r.M13 = this.M31;
            r.M14 = 0f;
 
            r.M21 = this.M12;
            r.M22 = this.M22;
            r.M23 = this.M32;
            r.M24 = 0f;
 
            r.M31 = this.M13;
            r.M32 = this.M23;
            r.M33 = this.M33;
            r.M34 = 0f;
 
            r.M41 = -(r.M11 * this.M41 + r.M21 * this.M42 + r.M31 * this.M43);
            r.M42 = -(r.M12 * this.M41 + r.M22 * this.M42 + r.M32 * this.M43);
            r.M43 = -(r.M13 * this.M41 + r.M23 * this.M42 + r.M33 * this.M43);
            r.M44 = 1f;
            return r;
        }

Additional Optimizations

If you need it, other optimizations could be included in this algorithms, detecting special matrices whose inverse can be calculated even faster. It is the case of diagonal matrices (where only the diagonal should be inverted using 1/value entries), or Orthogonal matrices (whose inverse is equal to its transpose).

The Test

We made a very simple test, with 50.000.000 (50 Million) inverse calculations of affine transformations, in an Intel Core 2 Quad Q9450 2.66Ghz with 3 Gb (non multi-threading).
General method
11.523 secs
Optimization I (memory accessing and op. savings)
5.4912 secs
Optimization II (affine transformations only)
2.7144 secs
Optimization III (rotations and translations only)
1.326 secs

Conclusions

Searching a bit, it is indeed possible to reach the performance of your graphics API and as mentioned above, the advantages of having (and building) your own library are countless.
Take care!

3 comments:

Ultrahead said...

No veo nada especial en el código del framework de XNA, simplemente previo a hacer los cálculos crean copias locales al método para cada uno de los elementos "Mxy" de la matriz (que a pesar de la nomenclatura los definen en el estructurado como "fields", no "properties"). En tu código "M11", "M12", etc, ¿son campos o propiedades?

Iñaki Ayucar said...

Vaya! estoy tan acostumbrado a que Managed DirectX, SlimDX y similares llamaran directamente a las librerías nativas de DirectX, que dí por sentado que XNA también lo hacía. Nunca hubiera imaginado que hubieran implemetado los métodos directamente en .Net. Para que luego digan de Micrsoft, dejando el código de XNA bien a la vista... je je...

Pues esa diferencia de rendimiento seguramente sea por esos productos que hace antes de la gran operación, porque en mi código los campos tambien son variables miembro (no propiedades, lo cual definitivamente afectaría al rendimiento).

Habran detectado que esos productos se repiten más de una vez y habrán ahorrado así unas cuantas operaciones.

Mañana por la mañana lo miro y actualizo este post con la información

Muchísimas gracias !!!!

Ultrahead said...

De nada Iñaki! me alegra haber contribuido ya que siempre leo tu blog con gran interés. Gracias por la mención, por cierto.

Es interesante lo que uno ve y aprende estudiando los fuentes (hint! hint!). De hecho, algúnas operaciones si utilizan código inseguro y llamadas native, pero en este caso me llamó la atención la diferencia ya que tu código me pareció a primera vista el que arrojaría mejores resultados.

Por cierto, concuerdo plenamente con la conclusión. En muchas ocasiones las implementaciones "custom-made" suelen ser superiores a las "built-in".

Un saludo desde el otro lado del Atlántico!