Bryan Rasmussen's web page for CSE260 - Parallel Computation


separator

Project

For the class project, I developed an efficient parallel method for multiplying two 4-tensors with two contractions (thus producing another 4-tensor). Here are some files associated with it: You'll notice that the slides have a 4:3 aspect ratio and are very small. To print using Acrobat Reader, in the print dialog box under Scaling, set to "Fit to Printer Margins". Alternatively, you may print multiple pages on a sheet using the same dialog.


separator

Assignment 1

Problem #1: About myself

I am a postdoc working at Los Alamos National Laboratory. I have a BSE and MSE in mechanical engineering (1997, 1999 resp.) from the University of Alabama in Huntsville, and I have a PhD in mathematics from Georgia Tech (2003). My dissertation research was in numerical dynamical systems. (Click here for some personal information.)

I work in the newly-formed High-Performance Computing division at Los Alamos, but most of my research involves data analysis on relatively small scales, particularly image data. Thus, the class will not apply immediately to my current research, although I think it will help my research progress in the future.

As of now, I do not have a partner, nor do I have any ideas for a class project. I am confident in my basic programming skills, and I am also confident in my ability to learn new languages quickly. I do not have any experience with parallelization, but I do have significant experience with vectorization in scientific programming packages such at Matlab, MathCAD, Mathematica, etc. Hopefully, this experience will carry over to make learning the art of parallelization a little bit easier. Finally, I believe that I can work well with a team to accomplish a complex goal.

Problem #2: Application description: SuperLU

SuperLU is a set of C language routines for solving large, sparse, linear systems using direct decompositions. It comes in three flavors:

Recall that one common way to solve the linear system, Ax=b, is to decompose the system into the form PA=LU, where P is a permutation matrix (stored as a vector), L is unit lower-triangular, and U is upper-triangular. This decomposition requires O(2n3/3) floating-point operations for a general system, where n is the number of rows or columns in the matrix. Essentially, it is equivalent to Gaussian elimination, where the row-swaps stored in P reflect the process of "partial pivoting," which is necessary to prevent numerical instabilities that occur when a diagonal element of U may be small compared to the other elements of the matrix.

Given such a decomposition, one simply solves the triangular systems Ly=Pb, and Ux=y to obtain the solution to the equation. These require only O(n2) operations, so the majority of the work is in the decomposition.

The actual algorithm implemented in SuperLU is, as expected, much more sophisticated, especially for the distributed-memory version. In that version, the decomposition is both row- and column-permuted and scaled: PTcPrDrADcPc=LU, where Pr,c are permutation matrices and Dr,c are diagonal preconditioners that increase the numerical stability of the system. Unlike traditional pivoting which produces the row permutations on-the-fly, the SuperLU algorithm chooses Pr before the decomposition—a process that the authors call "static pivoting". If a small number appears on the diagonal during the decomposition, then the algorithm will simply replace the number with a larger one rather than swapping entire rows. This necessitates a post-processing step to refine the solution and estimate error bounds.

The algorithm also chooses the column permutations, Pc, before performing the decomposition. The motivation for computing the columns is twofold: 1) increase the sparsity of the triangular factors, and 2) increase the parallelism. Effect 2) is the reason for the pre-multiplication by PTc. (This feature is unique to the distributed-memory version.)

The primary advantage of static pivoting is that it eliminates the need to re-assign processors dynamically once the decomposition begins. The strategy therefore trades exactness for scalability. As mentioned above, the code uses MPI to distribute the computational workload. It allows the user to specify both the number of processors and their distribution, although from the user manual it appears that the topology is always block-cyclic. Additionally, the code takes advantage of "super-nodes" to speed the sparse computations regardless of whether the computation is on a sequential or parallel machine.

As for scalability, a series of tests done on a set of standard test matrices in 2002 indicates that the algorithm scales relatively well. In one experiment, the authors try an 11-point discretization of the Laplacian on a 3D grid. They increase the grid size at roughly the same rate as the number of processors, starting with one processor and going up to 128. The results are better for cubic grids than rectangular ones. Due to overhead, the computation time increases in both cases, but for cubic grids it increases about 26%, while for rectangular grids it increases about 68%. The authors list several future improvements to the algorithm, including the parallelization of some of the pre-processing steps and implementation of a topology other than 2D block cyclic.

In general, then, SuperLU seems to be a useful set of routines for solving large, sparse matrices on distributed machines. It has worked on several large-scale, practical problems in the past, and a quick Google search turns up several installations at academic and government facilities. A new version (2.0) came out in 2003. Beyond these simple checks, it is not clear if the routines have found frequent use for any particular application.

The most common techniques for large, sparse systems on sequential machines are actually not direct techniques at all. Rather, iterative techniques such as GMRES, BiCG, and, increasingly, multi-grid methods have been the most popular ways to deal with very large matrices. Multi-grid methods in particular should apply very directly to 2D multi-processor arrays, while distributed versions of GMRES have been around for as long as SuperLU. Therefore, direct solution techniques in parallel computers computers—while they have the advantage of being very general and very accurate—probably yield to iterative techniques when an optimal algorithm is necessary for a specific problem.


separator

Assignment 2

If you would like the writeup and/or code for this assignment, please email me: bryanras(at)lanl(dot)gov.


separator

Assignment 3

Same goes for Assignment 3 ...


separator

Assignment 4

and Assignment 4.


separator


Bryan Rasmussen's main page


Updated 1 December 2006