v53n1a5aB

Appendix B

In this section, we show our implementations of BiCGStab [33]. This iterative Krylov-subspace method requires a matrix A and a left-side vector b, and returns an approximate vector solution x of the system Ax = b. The algorithm (2) presents the general steps of the BiCGStab method.

Algorith 2 BiCGStab

1: r₀ ← b − Ax₀

2: r*₀ arbitrary

3: p₀ ←r₀

4: for j= 0, 1... j_max do:

5: α_j ← (r_j, r*₀)/(Ap_j, r*₀)

6: s_j ← r_j − α_j Ap_j

7: w_j ← (As_j, s_j)/(As_j, As_j)

8: x_j+1 ← x_j + α_j p_j + w_j s_j

9: r_j+1 ←s_j− w_jAs_j

10: β_j ← (r_j+1, r*₀)/(r_j, r*₀) × α_j /w_j

11: p_j+1 ← r_j+1 + β_j ( p_j + w_j Ap_j )

12: end for

The corresponding CUDA implementation used in this work is presented below.

init_res = cublasDnrm2(N,d_r,1);
if(init_res!=0.0) init_res=1./init_res;
else return false;

cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N,-1.,
descr, d_val, d_row, d_col, d_x, 1., d_r);
res = cublasDnrm2(N,d_r,1);
res *= init_res;
if (res < tol) return true;

cublasDcopy(N,d_r,1,d_rtilde,1);

while ( res > tol && k <= max_iter) {
ri_0 = cublasDdot(N,d_rtilde,1, d_r,1);
if (ri_0 == PrecType(0)) {
return false;
}

if (k!=1){
if (omega == 0.) {
return false;

}
beta = (ri_0 / r_ant) * (alpha / omega);
cublasDaxpy(N,-omega,d_Ap,1,d_p,1);
cublasDscal(N,beta,d_p,1);
cublasDaxpy(N,PrecType(1),d_r,1,d_p,1);
} else {
cublasDcopy(N,d_r,1, d_p,1);
}
cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, 1.,
descr, d_val, d_row, d_col, d_p, 0., d_Ap);
alpha = ri_0 / cublasDdot(N,d_Ap,1, d_rtilde,1);
cublasDcopy(N,d_r,1,d_s,1);
cublasDaxpy(N,-alpha,d_Ap,1,d_s,1);
if ( cublasDnrm2(N,d_s,1)*init_res < tol ) {
cublasDaxpy(N,alpha,d_p,1,d_x,1);
return true;
}
cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, 1.,
descr, d_val, d_row, d_col, d_s, 0., d_As);
omega = cublasDdot(N,d_As,1,d_s,1) / cublasDdot(N,d_As,1,d_As,1);
cublasDaxpy(N,alpha,d_p,1,d_x,1);
cublasDaxpy(N,omega,d_s,1,d_x,1);
cublasDcopy(N,d_s,1,d_r,1);
cublasDaxpy(N,-omega,d_As,1,d_r,1);
r_ant = ri_0;
res=cublasDnrm2(N,d_r,1)*init_res;
k++;
}
if(res < tol && k <= max_iter) return true;
else return max_iter+1;

The original matrix A was stored in the d_val, d_row and d_col arrays (format CRS), the left-side vector b in the d_r array, and the vector solution x was stored in the d_x array. For the PETSc implementation, only minimal changes must be done, changing for example, line with the matrix vector product in CUDA

cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N,-1.,
descr, d_val, d_row, d_col, d_x, 1., d_r);

by the line

ierr = MatMult(A,x,r);

where A, x, r are PETSc matrix and vectors.