AstroPulse v7
-------------

Mac OS X, 64bit :
This executable works on 10.9 or newer OS versions.
Usability on older kernel versions is not given.
NO additional libraries should be necessary to run this application.


AstroPulse OpenCL application currently available in 3 editions: for AMD/ATi, nVidia and Intel GPUs. (MacOSX 64bit)
AstroPulse OpenCL application currently available in 2 editions: for AMD/ATi, nVidia.                (Linux 64bit)
It's intended to process SETI@home's AstroPulse v7 tasks.

Source code repository: https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt
Build revision: r2696
Date of revision commit: september 2014

Available command line switches:

-v N :sets level of verbosity of app. N - integer number. -v 0 disables almost all output. Default corresponds to -v 1. 
   Levels from 2 to 5 reserved for increasing verbosity, higher levels reserved for specific usage. -v 2 enables all signals output.
   -v 6 enables delays printing where sleep loops used.
   -v 7 enables oclFFT config printing for oclFFT fine tune.

-ffa_block N :sets how many FFA's different period iterations will be processed per kernel call. N should be integer even number less than 32768.

-ffa_block_fetch N: sets how many FFA's different period iterations will be processed per "fetch" kernel call (longest kernel in FFA).
   N should be positive integer number, should be divisor of ffa_block_N.

-unroll N :sets number of data chunks processed per kernel call in main application loop. N should be integer number, minimal possible value is 2.

-skip_ffa_precompute : Results in skipping FFA pre-compute kernel call. Affects performance. Experimentation required if it will increase or decrease performance   on particular GPU/CPU combo. 

-exit_check :Results in more often check for exit requests from BOINC. If you experience problems with long app suspend/exit use this option.
   Can decrease performance though. 
   (not available on 64bit linux or MacOSX)

-use_sleep :Results in additional Sleep() calls to yield CPU to other processes. Can affect performance. Experimentation required.

-initial_ffa_sleep N M: In PC-FFA will sleep N ms for short and M ms for large one before looking for results. Can decrease CPU usage. 
	Affects performance. Experimentation required for particular CPU/GPU/GPU driver combo. N and M should be integer non-negative numbers.
	Approximation of useful values can be received via running app with -v 2 and -use_sleep switches enabled and analyzing stderr.txt log file.
	(not available on 64bit linux or MacOSX)

-sbs N :Sets maximum single buffer size for GPU memory allocations. N should be positive integer and means bigger size in Mbytes. 
	For now if other options require bigger buffer than this option allows warning will be issued but memory allocation attempt will be made.

-hp : Results in bigger priority for application process (normal priority class and above normal thread priority). 
	Can be used to increase GPU load, experimentation required for particular GPU/CPU/GPU driver combo. (not available on 64bit linux or MacOSX)

	On Linux and MacOSX :
	Due to OS permission setting rules you have the chance to achieve normal priority by setting <no_priority_change>1</no_priority_change> in <options>
	section of your BOINCs "cc_config.xml" file. Check BOINC manuals/wiki for details how to set this up.

-cpu_lock : Enables CPUlock feature. Results in CPUs number limitation for particular app instance. Also attempt to bind different instances to different CPU cores will be made.
	Can be used to increase performance under some specific conditions. Can decrease performance in other cases though. Experimentation required.
	Now this option allows GPU app to use only single logical CPU. 
	Different instances will use different CPUs as long as there is enough of CPU in the system.
	To use CPUlock in round-robin mode GPUlock feature will be enabled. Use -instances_per_device N option if few instances per GPU device are needed.
	(not available on 64bit linux or MacOSX)

-cpu_lock_fixed_cpu N : Will enable CPUlock too but will bind all app instances to the same N-th CPU  (N=0,1,.., number of CPUs-1).
  (not available on 64bit linux or MacOSX)

-gpu_lock :Old way GPU lock enabled. Use -instances_per_device N switch to provide number of instances to run.
  (not available on 64bit linux or MacOSX)

-instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances). 
	N - integer number of allowed instances. 
  (not available on 64bit linux or MacOSX)

These 2 options used together provide BOINC-independent way to limit number of simultaneously
executing GPU apps. Each SETI OpenCL GPU application with these switches enabled will create/check global Mutexes and suspend its process
execution if limit is reached. Awaiting process will consume zero CPU/GPU and rather low amount of memory awaiting when it can continue execution.

-disable_slot N: Can be used to exclude N-th GPU (starting from zero) from usage. 
	Not tested and obsolete feature, use BOINC abilities to exclude GPUs instead.
	(not available on 64bit linux or MacOSX)

These 2 options used together provide BOINC-independent way to limit number of simultaneously
executing GPU apps. Each SETI OpenCL GPU application with these switches enabled will create/check global Mutexes and suspend its process
execution if limit is reached. Awaiting process will consume zero CPU/GPU and rather low amount of memory awaiting when it can continue execution.

Advanced level options (some app code reading and understanding of algorithms used is recommended before use, not fool-proof even in same degree as 
options above):
-tune N Mx My Mz : to make app more tunable this param allows user to fine tune kernel launch sizes of most important kernels.
	N - kernel ID (see below)
	Mxyz - workgroup size of kernel. For 1D workgroups Mx will be size of first dimension and My=Mz=1 should be 2 other ones.
	N should be one of values from this list:
	FFA_FETCH_WG=1,
	FFA_COMPARE_WG=2
	For best tuning results its recommended to launch app under profiler to see how particular WG size choice affects particular kernel.
	This option mostly for developers and hardcore optimization enthusiasts wanting absolute max from their setups.
	No big changes in speed expected but if you see big positive change over default please report.
	Usage example: 
			-tune 2 32 1 1  (set workgroup size of 32 for 1D FFA comparison kernel).

-oclFFT_plan A B C : to override defaults for FFT 32k plan generation. Read oclFFT code and explanations in comments before any tweaking.
	A - global radix
	B - local radix
	C - max size of workgroup used by oclFFT kernel generation algorithm (check stderr.txt for your GPUs max size)
	Usage example: 
			-oclFFT_plan 64 8 256 (this corresponds to old defaults); 
			-oclFFT_plan 0 0 0 (this effectively means this option not used, hardwired defaults in play).


These switches can be placed into the file called ap_cmdline.txt also.

For examples of app_info.xml entries look into text file with .aistub extension provided in corresponding lunatics package.
	(not available on 64bit linux or MacOSX)

Known issues:
- With 12.x Catalyst drivers GPU usage can be low if CPU fully used with another loads.
  Same applies to NV drivers past 267.xx and to Intel SDK drivers. 
- If you see low GPU usage of zero blanked tasks try to free one or more CPU cores. * 
- For overflowed tasks found signal sequence not always matches CPU version.
- On Linux : If OpenCL reports your GPU to have only half the GPU-RAM than it actually has,
  try to add the following to the ".profile" (hidden text file) of the account that BOINC runs on :
  export set GPU_MAX_ALLOC_PERCENT=90
  export set GPU_MAX_HEAP_SIZE=90
  Logout and back into that account to take over the new settings for the Catalyst driver or reboot host.


Best usage tips:

For best performance it is important to free 2 CPU cores running multiple instances.
Freeing at least 1 CPU core is necessity to get enough GPU usage.*

*: As alternate solution try to use -cpu_lock / -cpu_lock_fixed_cpu N options (if available).
   This might only work on fast multicore CPU`s.

commandline parameters :
Command line switches can be used either in app_info.xml or ap_cmdline_x86_64-apple-darwin_SSE3_OpenCL_Intel.txt.
Params in ap_cmdline*.txt will override switches in <cmdline> tag of app_info.xml.
_______________________

You do not have to set any commandline parameters to run this app. 
It will autoconfigure to some default settings depending on your GPU device.

High end cards (more than 30 compute units)

-unroll 18 -ffa_block 16384 -ffa_block_fetch 8192 

* Bigger unroll values < 20 doesn`t necessarily result in better run times.

Mid range cards (l12 - 24 compute units)

-unroll 12 -ffa_block 12288 -ffa_block_fetch 6144  

entry level GPU (less than 6 compute units)

-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024 


-tune switch

possible values:

-tune 1 256 1 1
-tune 1 128 2 1
-tune 1 64 4 1
-tune 1 32 8 1
-tune 1 16 16 1

Intensive testing highlighted -tune 1 64 4 1 and -tune 1 32 8 1 to be fastest on HD 7970 and R9 280X.
Further testing required for other GPU`s.

-oclFFT_plan switch

 Use at your own risk !
------------------------

FFT kernels are processed in 8 point fft kernels by default.
Using different fft kernel planning can speed up processing significantly.
In most cases 16 point fft kernels are fastest for Astropulse V7.

-oclFFT_plan 256 16 256

Example:

High end cards
-unroll 18 -oclFFT_plan 256 16 256 -ffa_block 16384 -ffa_block_fetch 8192 -tune 1 64 4 1 -tune 2 64 4 1 

Mid range cards
-unroll 12 -oclFFT_plan 256 16 256 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1 -tune 2 64 4 1 

Your mileage might vary. 
-----------------------------------------------------

App instances.
______________

If you experience screen lags reduce unroll factor and/or ffa_block_fetch value.


Addendum:
_________

Running multiple cards in a system requires (eventually) freeing another CPU core.
