Appendix B

 

In this section, we show our implementations of BiCGStab [33]. This iterative Krylov-subspace method requires a matrix A and a left-side vector b, and returns an approximate vector solution x of the system Ax = b. The algorithm (2) presents the general steps of the BiCGStab method.

Algorith 2 BiCGStab

1: r0b − Ax0

2: r*0 arbitrary

3: p0r0

4: for j= 0, 1... jmax do:

5: αj ← (rj, r*0)/(Apj, r*0)

6: sj rjαj Apj

7: wj ← (Asj, sj)/(Asj, Asj)

8: xj+1xj + αj pj + wj sj

9: rj+1sjwjAsj

10: βj ← (rj+1, r*0)/(rj, r*0) × αj /wj

11: pj+1rj+1 + βj ( pj + wj Apj )

12: end for

The corresponding CUDA implementation used in this work is presented below.

init_res = cublasDnrm2(N,d_r,1);
if(init_res!=0.0) init_res=1./init_res;
else return false;

cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N,-1.,
descr, d_val, d_row, d_col, d_x, 1., d_r);
res = cublasDnrm2(N,d_r,1);
res *= init_res;
if (res < tol) return true;

cublasDcopy(N,d_r,1,d_rtilde,1);

while ( res > tol && k <= max_iter) {
ri_0 = cublasDdot(N,d_rtilde,1, d_r,1);
if (ri_0 == PrecType(0)) {
return false;
}

if (k!=1){
if (omega == 0.) {
return false;

}
beta = (ri_0 / r_ant) * (alpha / omega);
cublasDaxpy(N,-omega,d_Ap,1,d_p,1);
cublasDscal(N,beta,d_p,1);
cublasDaxpy(N,PrecType(1),d_r,1,d_p,1);
} else {
cublasDcopy(N,d_r,1, d_p,1);
}
cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, 1.,
descr, d_val, d_row, d_col, d_p, 0., d_Ap);
alpha = ri_0 / cublasDdot(N,d_Ap,1, d_rtilde,1);
cublasDcopy(N,d_r,1,d_s,1);
cublasDaxpy(N,-alpha,d_Ap,1,d_s,1);
if ( cublasDnrm2(N,d_s,1)*init_res < tol ) {
cublasDaxpy(N,alpha,d_p,1,d_x,1);
return true;
}
cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, 1.,
descr, d_val, d_row, d_col, d_s, 0., d_As);
omega = cublasDdot(N,d_As,1,d_s,1) / cublasDdot(N,d_As,1,d_As,1);
cublasDaxpy(N,alpha,d_p,1,d_x,1);
cublasDaxpy(N,omega,d_s,1,d_x,1);
cublasDcopy(N,d_s,1,d_r,1);
cublasDaxpy(N,-omega,d_As,1,d_r,1);
r_ant = ri_0;
res=cublasDnrm2(N,d_r,1)*init_res;
k++;
}
if(res < tol && k <= max_iter) return true;
else return max_iter+1;

The original matrix A was stored in the d_val, d_row and d_col arrays (format CRS), the left-side vector b in the d_r array, and the vector solution x was stored in the d_x array. For the PETSc implementation, only minimal changes must be done, changing for example, line with the matrix vector product in CUDA

cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N,-1.,
descr, d_val, d_row, d_col, d_x, 1., d_r);

by the line

ierr = MatMult(A,x,r);

where A, x, r are PETSc matrix and vectors.