|
Contents
Abstract
Computing the product of two matrices is a common requirement in numerical
linear algebra. Consequently, algorithms for the calculations have been incorporated
into the Sun Performance Library software. Significant computational efficiency can
be realized by linking the Library into an application. In fact, implementing
routines from the Library is a virtual necessity for large problems.
A common practice is to hand-code the triply-nested loop for the matrix product into an application. If the size of the problem is small, the hand-coded alternative may be acceptable -- perhaps even preferable -- to the corresponding routine from the Library. However, if the array dimensions are greater than about 25 or 30, the DGEMM (double precision general matrix multiplication) routine is a performance requirement. The Performance Library has been optimized separately for the architectures of the various UltraSPARC processors. Hence, linking an application to the right version is important in many cases. Dynamic linking of a specific shared library to a platform-independent executable can be accomplished with the right parameters. While matrix multiplication using DGEMM is the focus of this document, the general conclusions can be extrapolated to other components of the Sun Performance Library software. Matrix Multiplication and the Basic Linear Algebra Subprograms (BLAS)
The product of two matrices is a computationally intensive task. While the
algorithm is conceptually straightforward, it represents a significant challenge
for high-performance technical computing primarily because of its memory-access
requirements.
The matrix product A B is embedded in the more general expression:
a A B + b C (1)
which occurs frequently in numerical linear algebra. If the variables a and b
are scalars and if the array A is an m by p matrix and B is p by n, then C must
be m by n and the Expression (1) represents an m by n matrix.
As the name implies, the Basic Linear Algebra Subprograms (BLAS) are a package of computational tools for numerical linear algebra. The BLAS include the SGEMM, DGEMM, CGEMM, and ZGEMM routines, which are defined specifically for evaluating Expression (1).
Components of the BLAS do exist for specific types of matrices. However,
the performance of the routines that are designed for the various symmetric or
sparse matrices is not part of the subject matter for the current report.
The BLAS are more than a software package. The specifications for the components of the BLAS include the definitions of the essential procedures that constitute numerical linear algebra. See Sun ONE Studio 7: Sun Performance Library Reference Manual for a complete list and for the specifications of the C, C++, and Fortran interfaces. It is the responsibility of hardware vendors to supply the users of their platforms with optimized code for the algorithms in the linear-algebra subprograms. All of the BLAS are implemented in the Sun Performance Library. In addition, Sun maintains different versions of the Library optimized for the instruction-set architecture (ISA) of several UltraSPARC processors. A table of the ISAs can be found in Appendix A, Section A.2.99, of the Sun ONE Studio 7: C++ User's Guide. An explanation and examples of how to use the versions of the Performance Library can be found in the section below called "Construction of the Code." With each advance in the design of Sun's hardware, more and better resources are built into the UltraSPARC processor. Corresponding additions to the instruction set are required for Sun's compilers to be able to take advantage of the improvements. The various versions of the Performance Library incorporate the appropriate instructions. In their build machinery, developers must specify the optimal version of the Library in order to realize the benefits on the platform for their applications. The purpose of the current report is to gauge the significance of matching the version of the Performance Library to the hardware when the computational task is matrix multiplication. The material presented in this paper supports the following conclusions:
Dynamic linking of a specific shared library to a platform-independent
executable can be accomplished with the right parameters.
The algorithm for the product of two matrices is representative of the routines in the Performance Library. Thus, the general conclusions stated above for the DGEMM subprogram also apply to other components of the Library. However, to get precise performance data, it would still be necessary to repeat the current investigation for any of the other routines. The Nested Loops in the Scientific Languages
The algorithm for Expression (1) is a triply-nested loop. In the syntax for
C++, the following code stores the entries for the result back in the array c:
for( int j = 0; j < n; j++ )
{
for( int i = 0; i < m; i++ )
{
double cji = a[0][i] * b[j][0];
for( int k = 1; k < p; k++ )
{
cji += a[k][i] * b[j][k];
}
c[j][i] *= beta;
c[j][i] += alfa * cji;
}
}
The array names a, b, and c in the source code correspond to the matrices A, B, and C, in Expression (1), respectively. The version written in the C language is essentially the same as the above.The justification for storing the result back in the c array is to reduce memory requirements. If memory limitations are not a problem, then additional compiler optimizations would be enabled by storing the values in a fourth array. The effect would be a more highly-tuned executable. The storage scheme for the arrays a, b, and c is the same for C and C++. In both languages, the pointers a[k], b[j], and c[j] are references to the kth and the jth columns of the corresponding matrices. Thus, the columns of the matrices are stored in the rows of the arrays. In the equivalent Fortran code, the indices are such that the columns of the matrices are stored in the columns of the arrays:
DO J = 1, N
DO I = 1, M
CIJ = A(I,1) * B(1,J)
DO K = 2, P
CIJ = CIJ + A(I,K) * B(K,J)
END DO
C(I,J) = ALFA * CIJ + BETA * C(I,J)
END DO
END DO
The indexing of the A and B arrays in the innermost loop of the calculations illustrates the memory-access difficulties associated with a matrix product. The loop counter can reference contiguous data in memory for only one of the two matrices. In the memory-allocation schemes for both the C++ (or C) and the Fortran codes, the innermost loop must stride across memory for entries in one or the other of the matrices A and B.
Operation Counts
By inspection of the code for the triply-nested loop, it can be seen that the operation count for Expression (1) includes
* p multiplies and p - 1 adds for each of the m n entries in A times B, * m n multiplies for a times the product of A and B, * m n multiplies for b times C, and * m n adds for the sum of a A B and b Cfor a total of * m n ( p + 2 ) scalar multiplicationsand * m n p scalar additionsThus, for real variables, the count is * 2 m n ( p + 1 ) floating-point operationsthat is, just the sum of the scalar operations enumerated above. If the scalars and the entries in the matrices are complex numbers, then the floating-point operation count is naturally much higher. Each addition requires two floating-point adds and each multiplication requires four floating-point multiplies plus two floating-point adds. Thus, the count totals * 4 m n ( 2 p + 3 ) floating-point operationsfor complex variables. From the formulas above, the computational performance is given in MFLOPS (millions of floating-point operations per second) by 2 m n ( p + 1 ) / ( T 10^6 ) (2)and 4 m n ( 2 p + 3 ) / ( T 10^6 ) (3)for the calculation with real numbers and complex numbers, respectively, where T is the elapsed time in seconds and the factor of 10^-6 converts the count per second to millions per second. Serial Performance
Given values for the processing time T, the Formula (2) can be used as a measure for the speed of the DGEMM routine in the Sun Performance Library. The timing measurements have been made for a range of values of the size of the arrays. The results on two of Sun's processors, the UltraSPARC-IIICu and the UltraSPARC-II, are displayed for calculations done in serial mode (see Figures 1 through 4).
The performance data were collected on a Sun Fire 6800 with the UltraSPARC-IIICu processor and on a Sun Enterprise 10,000 with the UltraSPARC-II processor. The timing runs were repeated multiple times (at least three) for each of the problem sizes, that is, for each value of the order of the matrices. All of the points are recorded on the graphs. The graphs were done with gnuplot, and the curves connecting the data points were plotted with the smooth unique option. The v9b and v8plusb versions of the Performance Library are for the 64-bit and 32-bit address spaces, respectively. They are optimized for the architecture of the UltraSPARC-IIICu processor. The v9a and v8plusa versions are the 64-bit and 32-bit Libraries for the generic UltraSPARC architecture. The legends on the plots show the correspondence between the data and the versions of the Library.
Compiler and Linker Options
There is an important distinction to be made between the choice of compiler options and the selection of a version of the Performance Library. With the
-xarch=v9 or -xarch=v9a options, the compiler generates 64-bit code for the the UltraSPARC processors. However, if a program is compiled with the -xarch=v9b option, the code will incorporate instructions that are specific to the UltraSPARC-III processors. The executable will not run on UltraSPARC-II platforms. Similarly, in the 32-bit address space, code compiled with the -xarch=v8plus or -xarch=v8plusa options is built for the UltraSPARC processors. If a program is compiled with the -xarch=v8plusb option, it will not run on an UltraSPARC-II
processor.While the components of the Sun Performance Library are optimized for the UltraSPARC-III processor, they were not compiled with the -xarch=v9b or the -xarch=v8plusb options. Thus, the UltraSPARC-III Libraries will still run on the UltraSPARC-II machines although suboptimally as discussed in the following section.
First and Subsequent Calls
Various initialization tasks are not repeated after the first invocation of the DGEMM routine. Hence, the initial call takes more CPU time than subsequent calls. The computation rates for the first call and for all successive calls are plotted separately in Figures 1 and 2 for the UltraSPARC-IIICu processor and in Figures 3 and 4 for the UltraSPARC-II processor.
For small arrays, the plots show that the requirements for the initialization can have a dramatic effect. As the dimensions of the arrays increase, the effect of the initialization penalty becomes less significant. In Figures 1 and 2, the performance graphs display rates that approach double the clock speed with increasing order of the matrices for the UltraSPARC-IIICu processor.
The higher computation rates for the v9b and v8plusb versions of Sun Performance Library are attributable to the optimization for the UltraSPARC-IIICu processor. The US-IIICu has a clock rate of 900 MHz.
In Figures 3 and 4, the performance graphs display computation rates in excess of the clock speed for the UltraSPARC-II processor.
The UltraSPARC-II processor has a clock rate of 400 MHz. The data indicate that optimization for the US-IIICu processor has a significant negative impact on performance on the US-II platform.
The calculations were done with square arrays in order to keep the number of variables in the investigation to a minimum. Thus, on the horizontal axis for each of the plots, the order N of the matrices is the value of N = m = p = n. The overall performance may be somewhat different if, for example, the inside dimension p were allowed to get much greater than m and n. However, the overall conclusions are still valid. That is, the relative values for the various performance measurements would remain pretty much the same.Performance measurements for the DGEMM routine were made with both Fortran and C interfaces. No differences were observed. Hence, only the data for the Fortran version are shown in the figures. A version of the driver routine was written in C++. The testing verified the implementation of the C interface in the C++ code. The oscillations in the graphs on Figures 1 through 4 are a distraction for which an explanation was not originally within the scope of this paper. However, the behavior certainly demands some attention. The plot in Figure 5 is an expansion of a section of the graph in Figure 1 for the 64-bit results on the UltraSPARC-IIICu. The peaks in performance occur when the order of the matrices is a multiple of four.
The pattern for the oscillations in Figure 5 is characteristic of cache-loading effects on performance. The peaks occur when the order of the matrices is a multiple of four. The values decay as the order is incremented up to the next multiple. Without further analysis, we might conclude that the maximum performance probably corresponds to the optimum utilization of cache on the US-III chip because the cache line is 512 bytes of data. Hence, each load and store between memory and cache consists of 64 double-precision words.
When a selection of data is not a multiple of four, the cache line will contain unused data locations which means less than optimum efficiency. However, it turns out that the oscillations in the DGEMM performance are not directly related to cache size. The behavior is attributable to cleanup code incorporated by the software engineers for the Performance Library. By design, the performance tuning for DGEMM involved blocking the inner computational kernels by multiples of four. Thus, the performance peaks track the structure of the code. When the dimensions of the problem are not a multiple of four, performance is degraded due to CPU resources spent in the cleanup code.
Hand Coding the Triply-Nested Loop
The performance data for the triply-nested loop are plotted along with two of the DGEMM curves in Figure 6. If an application contains a single matrix product, then a hand-coded version might be preferred for matrices of order N < 120. Of course, computational efficiency is probably not an issue for such a small task.
When the computation of two or more products is required, then the performance of the code in DGEMM is generally as good as the performance of the hand-coded matrix product for all values of N. For matrices of order N < 30, there is a significant DGEMM performance advantage.
Construction of the Code
The driver programs for the results presented in the previous section were written to compute values of T for the Formula (2). The code incorporates the high-resolution
gethrtime() procedure for the timing measurements. The Fortran source code for the program was compiled with the following code for the 64-bit or 32-bit address space, respectively:
f95 -V -c -xarch=v9a \
-o time_dgemm.o time_dgemm.f
or
f95 -V -c -xarch=v8plusa \
-o time_dgemm.o time_dgemm.f
The program was linked with the command
f95 -V time_dgemm.o pinfo_interface.o \
-xarch=$v -xlic_lib=sunperf \
-o time_dgemm.x
where $v was assigned the values for the various versions of the Performance Library as shown in the legends on the plots. The build commands were implemented in a script for the 64-bit executable and in another script for the 32-bit executable.As explained in the previous section, the xarch option has different meanings in the compile and link commands. On the one hand, the processor-specific instruction-set architecture is incorporated in compiled code. On the other hand, the option is used to specify the version of the Performance Library on the link command. The -xarch=v9b and -xarch=v8plusb versions of the Library are tuned for the UltraSPARC-IIICu processor but do not incorporate processor-exclusive instructions. Thus, while the Library runs on both platforms, if the driver routine were compiled with the -xarch=v9b or the -xarch=v8plusb option, the executable would not run on the UltraSPARC-II Platform.For convenience, the processing was automated with a script to invoke the executable, and two more C-shell scripts to loop through all of the cases shown on the above plots for the the 32-bit executable and the 64-bit executable. The Fortran source code for the driver was the same on both platforms. A subprogram written in C provided the interface to the system routine that returns information about the platform on which the program is running. In Figures 1 and 2, it is clear that the v9b and v8plusb versions of the Performance Library are real necessities for performance on the UltraSPARC-IIICu machine. However, in Figures 3 and 4, the data show that the optimization done for the US-III has a negative effect when the code is run on a US-II platform. As shown above, the source code time_dgemm.f was compiled with an xarch option that implements the generic 64-bit or 32-bit instruction set. With the xarch = v9b or xarch = v8plusb option, the link command specifies a particular version of the Sun Performance Library. Thus, at runtime, the time_dgemm.x executable is linked to the Library optimized for the US-III platform.With the Sun ONE Studio 7 Compiler Collection, it is possible to embed a platform-dependent name into the executable. At runtime, a symbolic link can be created with the name pointing to the appropriate shared library. The executable will be built for the generic 64-bit architecture when a code is compiled with: -xtarget=generic64and linked with: -xtarget=generic64 -xlic_lib=sunperfAdding -xarch=v9a to both the compile-xtarget=generic64 -xarch=v9aand the link -xtarget=generic64 -xarch=v9a -xlic_lib=sunperfoptions incorporates UltraSPARC extensions, including the visual-instruction set (VIS) that the compiler uses to enhance floating-point performance. The executable generated with the above compiler options should run on any UltraSPARC platform. In order to choose the optimum version at runtime, the Performance Library has to be linked in dynamically. That is, perflib cannot be statically linked with the -staticlib or -Bstatic options. To incorporate a shared library at runtime, a variable name can be specified with the -R option in the link step. The PLATFORM variable is useful because it is unique to each of the UltraSPARC processors. The Build script illustrates the process for a 64-bit code that will run on any US-III or US-II platform with the optimum choice for the version of the Performance Library.As explained in the text Techniques for Optimizing Applications, High Performance Computing, Rajat P. Garg and Ilya Sharapov, Prentice Hall (2002), starting on page 228, the LD_OPTIONS and PLATFORM variables provide one means for selecting a particular library. When the LD_OPTIONS variable is defined to be -R./$PLATFORM in the link step, the run script looks for a link in the local run directory. The name for the link is given by the PLATFORM variable. On the UltraSPARC machines, the PLATFORM variable contains the name of the processor, which can be found with this command:uname -iThe final step is to create links with all of the US-III platform names pointing to the v9b perflib.At runtime, if the link exists in the local directory, then the code will pick up the v9b library. If there is no such link matching the value of PLATFORM, then the executable will use the v9a library. (The v9 library name is just a link to the v9a library.) Thus, all of the names of the US-III processors should be included on a list in the run script. If the code is executing on a US-II platform, there will be no local link and it will run with the v9a perflib.The above approach is illustrated with the script in the attached Run file. The ./Run command executes the timing program discussed in the previous sections.Specifying the links to the optimum perflib will no longer be necessary with the next release of the compiler. In the Sun ONE Studio 8 build environment, if the executable is built dynamically with the right xtarget and xlic_lib options, the library corresponding to the platform on which the code is running will be picked up automatically. While it is important to keep up with Sun's current software, we cannot plan on the S1S8 compilers for production builds yet.The Build and Run scripts are illustrations of a working approach to linking the optimum version of perflib. Modifications may be necessary to make the commands work in a particular developer's environment. The Fortran source code for the Formula (1) contains a triply-nested loop. The program was compiled with the following commands:
f95 -V time_mm.o pinfo_interface.o \
-xarch=$v -xlic_lib=sunperf \
-o time_mm.x
and
cc -V -c \
-xarch=$v \
-o pinfo_interface.o pinfo_interface.c
The executable was built with this command:
f95 -V time_mm.o pinfo_interface.o \
-xarch=$v -xlic_lib=sunperf \
-o time_mm.x
As before, the variable $v was assigned the various values of the xarch option shown in the legends on the plots.
Parallel Performance
The components of the Sun Performance Library are multithreaded. If the environmental variable
PARALLEL is set equal to n, then n threads will be spawned to do the computational work.With no changes to the source code, the time_dgemm.x executable was compiled and built with the commands:
cc -V -c \
-xarch=v9a \
-o pinfo_interface.o pinfo_interface.c
f95 -V -c \
-xarch=v9a \
-o time_dgemm.o time_dgemm.f
f95 -V time_dgemm.o pinfo_interface.o \
-xautopar -stackvar \
-xarch=$v -xlic_lib=sunperf \
-o time_dgemm.x
The data points in Figures 7 and 8 were generated with these commands:
setenv PARALLEL K
time_dgemm.x < time_dgemm.inp
where the input file time_dgemm.inp contained the values of K, N, and R, the number of threads, the order of the matrices, and the number of repetitions for timing the call to DGEMM, respectively.
The scaling data were collected on a Sun Enterprise 10,000 Server with 64 PEs (processing elements, that is, CPUs) and on a Sun Fire 6800 with 16 PEs. The program was run in an iterative loop that invoked a
Build script and a Run script for the many data points on the plots. Thus, the performance results for the various numbers of processors were collected by setting PARALLEL, followed by running the executable.
Scripts and Source Code Examples
The Sample Code is being made available by Sun merely as a convenience to you. The Sample Code below is provided "AS IS." Sun makes no warranties of any kind whatsoever with respect to the Sample Code. ALL EXPRESS OR IMPLIED
CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY WARRANTY OF NON-INFRINGEMENT OR IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, ARE HEREBY DISCLAIMED AND EXCLUDED TO THE EXTENT ALLOWED BY APPLICABLE LAW. IN NO EVENT WILL SUN BE LIABLE FOR ANY LOST REVENUE, PROFIT OR DATA, OR FOR DIRECT, SPECIAL, INDIRECT, CONSEQUENTIAL, INCIDENTAL, OR PUNITIVE DAMAGES HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY ARISING OUT OF THE USE OF OR INABILITY TO USE THE SAMPLE CODE, EVEN IF SUN HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Fortran Source Code
Script for the 64-Bit Sample Code
Script for the 32-Bit Sample Code
Script to Run the Sample Code
Loop for the 32-Bit Sample Code
Loop for the 64-Bit Sample Code
Subprogram Written in C
Build to Use
PLATFORM Variable
Run Using the PLATFORM Variable
Source Code for the Formula (1)
Script with an Iterative Loop
Build Script
Run Script
| |||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||