/*************************************************************************/
/*                                                                       */
/* Licensed Materials - Property of IBM                                  */
/*                                                                       */
/*                                                                       */
/* (C) Copyright IBM Corp. 2009, 2010                                    */
/* All Rights Reserved                                                   */
/*                                                                       */
/* US Government Users Restricted Rights - Use, duplication or           */
/* disclosure restricted by GSA ADP Schedule Contract with IBM Corp.     */
/*                                                                       */
/*************************************************************************/

================================================================================
OVERVIEW 
================================================================================

   Name:  Black-Scholes Options Pricing Sample
   This sample demonstrates the computation of European put and call options on 
   non-dividend-paying stocks, based on equations and formulae found in chapters 12 
   and 13 of the textbook "Options, Futures and Other Derivatives," fifth edition, 
   by John C. Hull.

   These option computations take six inputs and produce one output.

   The six inputs (and their variable names in the code) are:
      * Stock price at time zero (S0), a floating-point value
      * strike price (K), a floating point value
      * continuously compounded risk-free rate (r), a floating-point value
      * the stock price volatility (sigma), a floating-point value
      * time to maturity, in units of years (T), a floating-point value
      * flag to specify call or put option (cpflag), a boolean or integer value
   The output is the floating-point fair market price of the option (answer).

   Each of these seven values (six input and one output) are in the form of large 
   linear arrays of data, and this OpenCL sample shows various ways to approach 
   this large problem.

   The sample program does the following: 
      * parses user command line parameters to determine the different runtime 
        parameters.
      * sets up OpenCL environment using CLU APIs. This includes acquiring the 
        platform, creating context, device(s), and command queue(s).
      * initializes the input buffers and creates the appropriate OpenCL buffer 
        objects.   
      * decides which kernel is going to be used.
      * creates the appropriate OpenCL kernel.
      * runs the OpenCL kernel which processes the six input arrays and produces 
        the output array.
      * verifies correctness of the output array by comparing the kernel outputs
        with host computed outputs. 
      * frees up resources.
      * computes and displays performance statistics.

================================================================================
MOTIVATION 
================================================================================

   Black-Scholes is a computationally intensive and highly data parallelizable 
   workload which makes it an ideal example for demonstrating OpenCL. In this 
   sample, we want to show different ways programmers can use OpenCL to solve
   a computationally intensive problem. 

================================================================================
PREREQUISITES
================================================================================

   IBM OpenCL Dev Kit version 0.3 is required to run this sample.

================================================================================
HOW TO BUILD 
================================================================================

   To build 32-bit binary, cd to the ppc directory in the sample and type "make".

   To build 64-bit binary, cd to the ppc64 directory in the sample and type "make".

================================================================================
HOW TO RUN   
================================================================================

   The binary is "bsop" and it will be in the directory where you typed "make".

   Type "bsop --help" to see useful information, detailed below.

================================================================================
COMMAND LINE OPTIONS
================================================================================

   Kernel options:

   The sample contains five different kernels which approach the problem in five
   different ways:

   * "rangeLS" kernel: this is a NDRange kernel that treats each set of 
     inputs as a work-item. The kernel loads the input data from global
     memory into registers, computes the ouput, and stores the output
     back to global memory. 

   * "rangeAWGC" kernel: this is also a NDRange kernel that treats each set
     of inputs as a work-item. The kernel reads the input data from global
     memory into local memory using "async_work_group_copy" for the entire
     work-group, computes the outputs, and writes the outputs back to 
     global memory using "async_work_group_copy".  
   
   * "taskDB" kernel: this is a Task kernel. The input arrays and output array 
     are divided into smaller subsections that are owned by a task.  The host 
     code looks at the number of compute units on the device to determine the 
     number of tasks that are going to run. Each task then iterates over 
     a smaller chunk of data that fit into local memory. Input and output data 
     are copied into and out of local memory using async_work_group_copy in a 
     double-buffered fashion. 

   * "taskSB" kernel: this is a Task kernel. The input arrays and output array 
     are divided into smaller subsections that are owned by a task.  The host 
     code looks at the number of compute units on the device to determine the 
     number of tasks that are going to run. Each task then iterates over 
     a smaller chunk of data that fit into local memory. Input and output data 
     are copied into and out of local memory using async_work_group_copy in a 
     single-buffered fashion. 

   * "taskLS" kernel: this is a Task kernel. The input arrays and output array 
     are divided into smaller subsections that are owned by a task.  The host 
     code looks at the number of compute units on the device to determine the 
     number of tasks that are going to run. Each task then iterates over 
     each data input in its subsection. The kernel loads the input data from 
     global memory into registers, computes the ouput, and stores the output
     back to global memory. 

   The two NDRange kernels tend to perform better on devices that support large 
   local work-group sizes since one work-item contains a small amount of computation.
   Users can experiment with the --lwgsize input parameter in conjunction with the
   --rangeLS and --rangeAWGC input parameters to determine the best combination.  

================================================================================
Vector width:
================================================================================

   OpenCL supports vector data types of different sizes, for example, float, 
   float2, float4, float8, float16. In this sample, we show how we can use the
   same computational code for different vector sizes. We use various #define 
   statements to instantiate generic "FLOAT" and "FIXED" types, making the kernels 
   very easy to read, and yet very versatile.
 
   Users can experiment with vectors of different sizes using the --vectorwidth input
   parameter. Different devices might support a particular vector size better 
   than others.  By  default, the sample queries the OpenCL device for the preferred
   vector width that the device supports. 

================================================================================
Buffering scheme:
================================================================================

   This sample demonstrates five (5) different methods of using buffers in OpenCL. 
   Each of these buffereing schemes contain characteristics that might make the 
   overall performance of the sample better or worse on a particular platform.  
   The buffering schemes are as followed:

   * "none": In this buffering scheme, the OpenCL memory objects are allocated
     without any indication of where the objects are allocated. For example:

        cl_mem mem_object = clCreateMemBuffer (context, CL_MEM_READ_WRITE, 
                                               size, NULL, &err); 
     Buffers are allocated on the device and can be accessed via 
     clEnqueueReadBuffer() and clEnqueueWriteBuffer() only. 

     In this buffering scheme, host sample code allocates and initialize temporary 
     buffers, the OpenCL runtime allocates OpenCL memory objects and initializes them 
     by writing from the temporary buffers into them, and the sample host code 
     frees the temporary buffers.


   * "use": In this buffering scheme, the OpenCL memory objects are allocated 
     with the flag set with CL_MEM_USE_HOST_PTR. For example:
         
     cl_mem mem_object = clCreateMemBuffer (context, 
                                            CL_MEM_READ|CL_MEM_USE_HOST_PTR,
                                            size, host_ptr, &err);

     Buffers are allocated and initialized on host space and can be accessed
     via clEnqueueMapBuffer(). 

     In this buffering scheme, host sample code allocates and initializes buffers 
     on the host space and calls clEnqueueMapBuffer() before using them and 
     clEnqueueUnmapBuffer() after the host is done using them. 

   * "alloc": In this buffering scheme, the OpenCL memory objects are allocated
     with the flag set with CL_MEM_ALLOC_HOST_PTR. For example:

        cl_mem mem_object = clCreateMemBuffer (context,
                                               CL_MEM_READ|CL_MEM_USE_HOST_PTR, 
                                               size, NULL, &err);
    
     In this buffering scheme, buffers are allocated in device memory and can be 
     accessed via clEnqueueReadBuffer(), clEnqueueWriteBuffer(), and clEnqueueMapBuffer().
     This is the default mode. 
   
   
   * "copy" or "alloc_copy": In this buffering scheme, the OpenCL memory objects
     are allocated with the flag set with CL_MEM_COPY_HOST_PTR or 
     CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR. For example:

     cl_mem mem_object = clCreateMemBuffer (context, 
                                            CL_MEM_READ|CL_MEM_ALLOC_HOST_PTR|CL_MEM_COPY_HOST_PTR,
                                            size, host_ptr, &err);

     In this buffering schmere, buffers are allocated in device memory and can be 
     accessed via clEnqueueReadBuffer(), clEnqueueWriteBuffer(), and for the buffers
     created with CL_MEM_ALLOC_HOST_PTR, clEnqueueMapBuffer(). The host sample code 
     allocates and initializes temporary host buffers, creates OpenCL mem objects,
     and initializes them by coping from the temp buffers into them.

================================================================================
NUMA
================================================================================

   The OpenCL device_fission and migrate_memobject extensions are used (when 
   available) to demonstrate how to utilize NUMA within OpenCL.   

   Using the IBM OpenCL runtime, one can see the effect of NUMA most clearly
   when running the sample with a large array, with fast math enabled (through the
   --fastmath command line input parameter) using the taskDB kernel on the 
   accelerator. 

================================================================================
Single or Double Precision
================================================================================

   This sample is coded to enable either single precision or double precision
   floating point computations.  The default is single precision.  You enable
   double precision computations by specifying "--double" on the command line.

   Note that "--fastmath" is not applicable to double precision computations
   and will be ignored if both are specified.

================================================================================
Performance measurement:
================================================================================

   Performance is measured and reported for five different phases of the code
   execution:
      * setup
      * buffer and data intialization
      * kernel program compilation
      * kernel execution
      * data verification
      * shutdown

================================================================================
Wrapper Script - bsop_wrapper.pl
================================================================================

   This script runs through all the different command line parameter options
   that the Black-Scholes sample offers and determines which of these option
   combinations would produce the best performance run on a particular 
   OpenCL platform. 
   
   The output of this script provides performance information on the following:
       * The best and worst command and the measured time for setting up the sample. 
         This includes parsing the command line arguments and setting up the OpenCL 
         platform, context, command queue(s), and device(s). 
   
       * The best and worst command and the time measured for initializing data 
         for the sample. This includes allocating the necessary data buffers on the 
         host (if any), creating the OpenCL memory buffer objects, and intialize 
         the buffers with a set of reasonable starting values
   
       * The best and worst command and the time measured for executing the kernel. 
         This includes enqueueing the kernel on the appropriate OpenCL Command 
         queue(s) and wait for the execution to finish
   
       * The best and worst command and the time measured for shuting down the sample. 
         This includes freeing up resources and releasing OpenCL resources.
   
       * The best and worst overall command and the accumulated
         time for running the entire sample. 
   
   The result does not include the time for kernel compilation and the time for
   data verification since those two steps tend to be uniform across all command line 
   options.  
   
   Parameters:
   
   The script accepts two parameters, the first one is mandatory and the 
   second one is optional
   
   first parameter - device_type:  the device type input parameter can be one
   of the followings:
      cpu -     different variations of bsop program will be executed on the
                first cpu device found on the platfom
   
      gpu -     different variations of bsop program will be executed on the 
                first gpu device found on the platfom
   
      accel -   different variations of bsop program will be executed on the 
                first accel device found on the platfom
   
   second parameter - output_filename: All output will be logged in this file. 

================================================================================
COMMAND LINE SYNTAX 
================================================================================

   Usage: bsop [DEVICE] [KERNEL] [OPTIONS]

   Device Type:

      -a, --accel              use CBEA Accelerator for compute
                               (default for Cell/B.E. machines)
      -c, --cpu                use CPU for compute
      -g, --gpu                use GPU for compute
                               (default for GPU-equiped machines)

   Kernel Type:

      --rangeLS                Use the NDRange kernel which directly indexes
                               into main memory.
      --rangeAWGC              Use the NDRange kernel which accesses the main
                               array using asynchronous workgroup copies.
                               (Intended for local work group sizes greater
                               than one.)
                               (default for GPU-equipped machines)
      --taskDB                 Use the Task kernel which performs double-buffered
                               copies of data into local memory.
                               (default for Cell/B.E. machines)
      --taskSB                 Use the Task kernel which performs single-buffered
                               copies of data into local memory.
      --taskLS                 Use the Task kernel which directly indexes into
                               main memory.

   General Kernel Options:

      -A N, --arraysize=N      Use N for the arraysize, where N is a power of 2
                               between 1 and 16777216.
                               (default: 524288)
      -w N, --vectorwidth=N
                               Number of elements (1, 2, 4, 8, or 16) to process
                               per kernel (or per loop within a Task).
                               (default: device perferred width)

      -u, --buffer [none|use|alloc|copy|alloc_copy] Selects the buffer scheme:
                               none - the sample host code allocates and initializes
                                  temporary buffers, the OpenCL runtime allocates OpenCL
                                  memory objects and initializes them by writing from the
                                  temporary buffers into them, and the sample host code frees 
                                  the temporary buffers.
                               use - the sample host code allocates and initialize buffers on
                                  the host space and the OpenCL runtime uses them directly.
                               alloc - the OpenCL runtime allocates buffers, and the
                                  sample host code maps, initializes, and unmaps them. (default)
                               copy or alloc_copy - the sample host code allocates and initializes
                                  temporary host buffers, the OpenCL runtime allocates buffers and
                                  initializes them by coping from the temp buffers into them,
                                  and the sample host code frees the temp buffers.

      --fastmath/--nofastmath  Enable fast math version or not. The fastmath version will
                               enable the cl-fast-relaxed-math build option and the
                               native versions of the math functions
                               (default: nofastmath)

   NDRange Kernel Options: (valid only with --rangeLS or --rangeAWGC)

      -l N, --lwgsize=N        Use a local workgroup size of N, where N is in:
                               { 1, 2, 4, 8, 16, 32, 64, 128, 256 }.
                               (default: 64)
   General Options:
      --numa / --nonuma        Enable NUMA on system that supports the extension
                               or not
                               (default: nonuma)

      --double / --single      Use double precision or single precision
                               (default: --single)
      --verify / --noverify    Verify or skip verification of computed output
                               (default: --verify)
      --verbose / --noverbose  Produce verbose output messages.
                               (default: --noverbose)
      -h, --help               This usage message

================================================================================
END OF TEXT               
================================================================================
