25 May 2012 By Stuart Archibald
In my previous blog post I wrote about sparse matrix support in the OG-Maths library and how we are working on sparse direct decompositions. This is still true, and they are still in the pipeline. Whilst writing these sparse algorithms we found that having a reference implementation of equivalent dense matrix code massively aided debugging. So we wrote dense implementations of LU and Singular Value (SVD) decompositions, these being archetypal of the two fundamental decomposition strategies, direct permutation and orthogonal transform.
All went well until it came to the question of how we test the code, and the SVD results in particular. The Colt and Apache Commons Math SVDs often gave different answers (in cases when numerical issues wouldn't be present). Clearly, both couldn't be right! So we asked the question, how can we test our SVD code? Mathematically speaking we can do some testing by checking against known results, but that doesn't really catch the edge cases where the more difficult problems invariably occur. So we asked another question: could we compare to the work-hardened LAPACK/BLAS system libraries, and if so, how?
It is these questions that form the basis for an article tuple:
So we face the problem of wanting results from decent implementations of LAPACK/BLAS obtained through Java calls, which clearly screams for native interfacing via the Java Native Interface (JNI). The follow question was then asked, what is the penalty of jumping through the JNI to native code for the type of calls we are making? We therefore set up a simple test on the standard y:=Ax BLAS2 operation (see this post for details) called via the following: JNA, JNI non-critical access, JNI critical access, Java and, as a reference, direct calls from C code. Within this test we were looking for two things: ease of use and therefore implied faster development, and raw performance in terms of time taken to perform the call.
A little bit about each method:
We use the same harness and system as described in this post and test the y:=Ax operation again. We ignore column vs. row major storage differences to ensure the level of work is the same, and we aren't really bothered about the actual answer but rather the time/effort taken to get it. Therefore all data is assumed to be laid out in a manner that is conducive to just running the DGEMV call. We chose the following combinations of calling methods and both ATLAS (Netlib) and Intel MKL BLAS:
| Calling code language | Calling method | BLAS supplied by |
| Java | Java | OGMaths BLAS |
| Java | JNA | Netlib Fedora 16 default build (ATLAS) |
| Java | JNI non-critical calls | Netlib Fedora 16 default build (ATLAS) |
| Java | JNI critical calls | Netlib Fedora 16 default build (ATLAS) |
| C | Direct | Netlib Fedora 16 default build (ATLAS) |
| Java | JNI non-critical calls | Intel MKL 10.3 |
| Java | JNI critical calls | Intel MKL 10.3 |
| C | Direct | Intel MKL 10.3 |
Note: the expected call from JNA to Intel's MKL could not be made to work despite numerous efforts with regards to altering LD_PRELOAD, LD_LIBRARY_PATH, -Djava.library.path and -Djna.library.path. Static linking and loading was also attempted and failed. One day I have little doubt we will get this to work, but given the time cost required, we just wanted some results and so abandoned it. We also do not consider the lack of this result in any way detracts from our findings as they are replicated in full using Fedora 16 ATLAS BLAS.
One of the criteria for choosing the right method to make the jump to native code was the ease of coding to do so. The JNA wins outright here: it can be nicely interfaced and handled rather cleanly with little effort.
We first create an interface that extends the JNA Library class. For clarity we've not put in arguments to the dgemv_() call and have missed out the try{}catch{} on failure to load:
Then the call is made as
LAPACKLibrary.INSTANCE.dgemv_(...);
which is fantastically easy; there are even classes to aid with getting the right pointer types and mangling them coherently! Obviously the underscoring notation of the library needs to be accounted for, but that can be dealt with dynamically.
On the other hand, calls through the JNI require considerably more effort than JNA. First, we write a wrapper to our native library call, for example:
public static native void dgemv_critical_(...);
from which the header containing the corresponding signatures can be generated with the javah binary. We then have to implement the name mangled generated headers as functions, for example (in a very cut-down, pseudo-code-like version):
We then wrap this code into a shared library via:
and then write a static initialiser in our Java class to load the native library wrapper we just created:
Then, if we've done everything correctly, and the system library and java class paths are correct, we should just have to write the call to the function name in our wrapper library from the Java code:
dgemv_critical_(...);
This is evidently a lot more effort than using the JNA. However, in the case of LAPACK and BLAS it's likely the code could mostly be auto-generated (in fact, this is exactly what we do!).
Running the same tests as in this post we obtain the following performance-related results:
The calls labelled "safe" are via the JNI non-critical calls, those labelled unsafe are via the JNI "critical" calls.
What do we learn from this graph?
Note: It is nice that our DGEMV and ATLAS's run at approximately the same speed. This is down to the quality of the instruction stream generated. For the heavy-lifting part, our Java code is running a set of register-rotated scalar SSE instructions. After running objdump on the ATLAS's library, in this case, it appears the ATLAS code is similar at least in the instruction blend (register-rotated scalar SSE), hence similar speed.
What can we infer?
We confirmed what we expected: safe calls from Java to native libraries are expensive, but if done in a certain way, the cost is absolutely minimal. This means that the raw power of something like Intel's MKL can be harnessed from Java at very little overhead cost if systems require the performance. Finally we can conclude with the fact that at OpenGamma we need unit tests that compare against dependable results and so calling native libraries is going to have to happen. My next post will be about ways we test our maths libraries.