/*************************************************************************/
/*                                                                       */
/* Licensed Materials - Property of IBM                                  */
/*                                                                       */
/*                                                                       */
/* (C) Copyright IBM Corp. 2009, 2010                                    */
/* All Rights Reserved                                                   */
/*                                                                       */
/* US Government Users Restricted Rights - Use, duplication or           */
/* disclosure restricted by GSA ADP Schedule Contract with IBM Corp.     */
/*                                                                       */
/*************************************************************************/

================================================================================
OVERVIEW         
================================================================================

Name: Jacobi Iterative Solver
   This example application illustrates how to use OpenCL (TM)
   together with OpenMPI.  It is assumed that each MPI rank in the
   cluster is the same and each rank works on part of the answer
   using the same OpenCL kernel.

   The problem to be solved is the standard 2D Laplace equation, an
   elliptical second order partial differential equation with two
   variables.  Dirichlet (fixed) boundary conditions are used on a
   unit square "plate".  The boundary conditions used are:
       u(x,0) = sin(pi * x)
       u(x,1) = sin(pi * x) * pow(e, -pi)
       u(0,y) = 0
       u(1,y) = 0

   Although any boundary conditions could be specified the
   advantage of these conditions is that they have a known
   analytical solution as follows:
       u(x,y) = sin(pi * x) * exp(-pi * y)

   For the purposes of demonstrating OpenCL, this application uses
   a Jacobi iteration method which stops after the error for every
   node is less than a given tolerance.  The Jacobi method uses a
   simple 5 point finite-difference stencil where the new node is
   the average of the 4 neighboring nodes i.e.
       unew(i,j) = 0.25 *(u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1))

   For the purposes of demonstrating the use of MPI, the number of
   compute nodes can be specified in both the x and y dimensions
   which allows for horizontal strips, vertical strips, and
   rectangles.

   A bitmap representation of the final matrix is output in PPM
   format.  You can use netbpm to convert the file to a different
   format or view the output.

================================================================================
PREREQUISITES    
================================================================================

   IBM OpenCL Dev Kit version 0.3 or later is required to run this sample.

   Also, OpenMPI 1.3.2 or greater is required.
        
   This example is built the same way as other OpenCL applications by 
   linking with the OpenCL library.  In addition the example needs 
   the header and library files for MPI.  The simplest method for 
   Linux is to install the RPMs for openmpi and openmpi-devel.  

   Note, the makefiles attempt to locate the mpi libraries and header files
   in known standard paths. These may not be correct for all OS installations
   and may need to be corrected.
    
   If you are planning to execute this example in a MPI cluster, 
   you may need to configure MPI first for the cluster.

================================================================================
HOW TO BUILD     
================================================================================

   To build 32-bit binary, 
   cd to the ppc directory in the sample and type "make".

   To build 64-bit binary, 
   cd to the ppc64 directory in the sample and type "make".

================================================================================
HOW TO RUN       
================================================================================

Stand-alone:

   This example can be executed either as a stand-alone executable or 
   using the mpirun command which can start one or more MPI 
   processes.  On Linux the executable needs to dynamically link with 
   the MPI library and this is most easily done by setting the 
   LD_LIBRARY_PATH to /usr/lib/openmpi/1.3.2-gcc/lib (for example; 
   the correct path may differ on your system).  

   To run on the accelerator (Cell SPU):
     ./jacsolver --accel

   To run on the CPU:
     ./jacsolver --cpu

With MPI:

   The -p (--pdim) and -q (qdim) options set the number of MPI ranks 
   in the X and Y dimensions respectively.  The number of MPI ranks 
   for mpirun need to match the total number of ranks specified by 
   the p and q options (multplied together).  

   The MPI implementations for Power and CBEA do not setup a
   default configuration and therefore this needs to be specified
   on the mpirun command line. For example:
       mpirun -np 4 --mca btl tcp,self ./jacsolver -r -p2 -q2
   executes the reference implementation on 4 MPI ranks.

   The default number of ranks in each dimension is (1,1). This is the
   same as executing without MPI.

================================================================================
COMMAND LINE SYNTAX
================================================================================

   A simple 2D iterative Jacobi solver.
   Usage: ./jacsolver [-h|--help] [-a|-c|-g|-e|-r] [OPTIONS...] [OPENCL OPTIONS...]

   Examples:
     ./jacsolver --accel -x128 -y128        Compute 128x128 array on accelerator
     mpirun -np 4 ./jacsolver -r -p2 -q2    Run reference computation on 4 MPI ranks

   Computation type (choose one, default OpenCL device is used if not specified):
     [ -a  | --accel ]     use OpenCL CBEA accelerator for compute
     [ -c  | --cpu ]       use OpenCL host CPU for compute
     [ -g  | --gpu ]       use OpenCL GPU for compute
     [ -e  | --exact ]     compute using analytical implementation (i.e. no OpenCL)
     [ -r  | --reference ] compute using reference implementation (i.e. no OpenCL)
 
   Options:
     [ -i# | --iter=# ]    number of iterations, default is 1000000
     [ -x# | --xdim=# ]    number of elements in x dimension, default is 8
     [ -y# | --ydim=# ]    number of elements in y dimension, default is 8
     [ -p# | --pdim=# ]    number of MPI ranks in x dimension, default is 1
     [ -q# | --qdim=# ]    number of MPI ranks in y dimension, default is 1
     [ -v  | --verify ]    verify computation against the reference implementation
 
   OpenCL specific options:
     [ -f  | --fullcopy ]  full copy of device memory, default is ghost cells only
     [ -m# | --mdim=# ]    size of OpenCL workblock in x dimension, default is 8
     [ -n# | --ndim=# ]    size of OpenCL workblock in y dimension, default is 8
 
   Notes:
   1. Computation types/devices (-a|-c|-g|-e|-r) are mutually exclusive.
   2. Data values are only shown if the size of the array is less than 14.
   3. Some MPI installations may require the following parameters to mpirun:
           --mca btl tcp,self
 
   The -a, -c, or -g options are used to select the OpenCL device 
   type where -a means accelerator (e.g.  Cell/B.E.  SPE), -c means 
   CPU (e.g.  Power6, 7 or Cell/B.E.  PPE) and -g means GPU.  If -r is 
   specified then the reference implementation is used instead of the 
   OpenCL implementation.  If -e is specified then the exact 
   analytical implementation is used.  Specifying -v optionally 
   compares the result with the reference implementation and -r -v is 
   a valid although not very useful combination of options.  

   The -x and -y options are used to specify the total size of the 
   rectangular array.  The minimum and default value for x and y is 
   8.  The -p and -q options can be used to specify the MPI nodes in 
   each dimension and the product of p and q must be the same as the 
   number of nodes specified for the mpirun -np parameter.  The p and 
   q options should be multiples of the x and y dimensions.  For 
   example -x256 -y128 -p2 -q1 specifies that a 128x128 square is 
   calculated on two MPI nodes that are stacked horizontally.  If you 
   specify an OpenCL device (-a, -c, -g) then the -m and -n options 
   specify the workgroup size in the x and y dimensions and these 
   values need to be a multiple of the dimensions for each MPI node.

   The -i option specifies the number of iterations of the Jacobi 
   solver.  The iterations end either when the iteration count is 
   reached or the maximum difference between two iterations is less 
   than the tolerance.  

   The -f option determines how to read and write the data between 
   the host and the device when exchanging ghost cells with MPI 
   neighbors.  See the included documentation for details on MPI 
   copies.  

================================================================================
END OF TEXT
================================================================================
