AstroPulse OpenCL application currently available in 3 editions: for AMD/ATi, nVidia and Intel GPUs.
It's intended to process SETI@home AstroPulse v7 tasks.

Source code repository: https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt
Build revision:2721
Date of revision commit: 2014/10/06 16:50:40

Available command line switches:

-v N :sets level of verbosity of app. N - integer number. -v 0 disables almost all output. Default corresponds to -v 1. 
	Levels from 2 to 5 reserved for increasing verbosity, higher levels reserved for specific usage. 
	-v 2 enables all signals output.
	-v 3 additionally to level 2 enables output of simulated signals corresponding current threshold level (to easely detect near-threshold validation issues).
	-v 6 enables delays printing where sleep loops used.
	-v 7 enables oclFFT config printing for oclFFT fine tune.

-ffa_block N :sets how many FFA's different period iterations will be processed per kernel call. N should be integer even number less than 32768.

-ffa_block_fetch N: sets how many FFA's different period iterations will be processed per "fetch" kernel call (longest kernel in FFA).
	N should be positive integer number, should be divisor of ffa_block_N.

-unroll N :sets number of data chunks processed per kernel call in main application loop. N should be integer number, minimal possible value is 2.

-skip_ffa_precompute : Results in skipping FFA pre-compute kernel call. Affects performance. Experimentation required if it will increase or decrease performance on particular GPU/CPU combo. 

-exit_check :Results in more often check for exit requests from BOINC. If you experience problems with long app suspend/exit use this option.
	Can decrease performance though.

-use_sleep :Results in additional Sleep() calls to yield CPU to other processes. Can affect performance. Experimentation required.

-initial_ffa_sleep N M: In PC-FFA will sleep N ms for short and M ms for large one before looking for results. Can decrease CPU usage. 
	Affects performance. Experimentation required for particular CPU/GPU/GPU driver combo. N and M should be integer non-negative numbers.
	Approximation of useful values can be received via running app with -v 6 and -use_sleep switches enabled and analyzing stderr.txt log file.

-initial_single_pulse_sleep N : In SingleFind search will sleep N ms before looking for results. Can decrease CPU usage. 
	Affects performance. Experimentation required for particular CPU/GPU/GPU driver combo. N should be integer positive number.
	Approximation of useful values can be received via running app with -v 6 and -use_sleep switches enabled and analyzing stderr.txt log file.

-sbs N :Sets maximum single buffer size for GPU memory allocations. N should be positive integer and means bigger size in Mbytes. 
	For now if other options require bigger buffer than this option allows warning will be issued but memory allocation attempt will be made.

-hp : Results in bigger priority for application process (normal priority class and above normal thread priority). 
	Can be used to increase GPU load, experimentation required for particular GPU/CPU/GPU driver combo.

-cpu_lock : Enables CPUlock feature. Results in CPUs number limitation for particular app instance. Also attempt to bind different instances to different CPU cores will be made.
	Can be used to increase performance under some specific conditions. Can decrease performance in other cases though. Experimentation required.
	Now this option allows GPU app to use only single logical CPU. 
	Different instances will use different CPUs as long as there is enough of CPU in the system.
	To use CPUlock in round-robin mode GPUlock feature will be enabled. Use -instances_per_device N option if few instances per GPU device are needed.

-cpu_lock_fixed_cpu N : Will enable CPUlock too but will bind all app instances to the same N-th CPU  (N=0,1,.., number of CPUs-1).

-gpu_lock :Old way GPU lock enabled. Use -instances_per_device N switch to provide number of instances to run.

-instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances). 
	N - integer number of allowed instances. 
These 2 options used together provide BOINC-independent way to limit number of simultaneously
executing GPU apps. Each SETI OpenCL GPU application with these switches enabled will create/check global Mutexes and suspend its process
execution if limit is reached. Awaiting process will consume zero CPU/GPU and rather low amount of memory awaiting when it can continue execution.

-disable_slot N: Can be used to exclude N-th GPU (starting from zero) from usage. 
	Not tested and obsolete feature, use BOINC abilities to exclude GPUs instead.


Advanced level options for developers (some app code reading and understanding of algorithms used is recommended before use, not fool-proof even in same degree as 
options above):
-tune N Mx My Mz : to make app more tunable this param allows user to fine tune kernel launch sizes of most important kernels.
	N - kernel ID (see below)
	Mxyz - workgroup size of kernel. For 1D workgroups Mx will be size of first dimension and My=Mz=1 should be 2 other ones.
	N should be one of values from this list:
	FFA_FETCH_WG=1,
	FFA_COMPARE_WG=2
	For best tuning results its recommended to launch app under profiler to see how particular WG size choice affects particular kernel.
	This option mostly for developers and hardcore optimization enthusiasts wanting absolute max from their setups.
	No big changes in speed expected but if you see big positive change over default please report.
	Usage example: -tune 2 32 1 1  (set workgroup size of 32 for 1D FFA comparison kernel).
-oclFFT_plan A B C : to override defaults for FFT 32k plan generation. Read oclFFT code and explanations in comments before any tweaking.
	A - global radix
	B - local radix
	C - max size of workgroup used by oclFFT kernel generation algorithm
	Usage example: 	-oclFFT_plan 64 8 256 (this corresponds to old defaults); 
			-oclFFT_plan 0 0 0 (this effectively means this option not used, hardwired defaults in play).


These switches can be placed into ap_cmdline_win_x86_SSE2_OpenCL_Intel.txt also.

For examples of app_info.xml entries look into text file with .aistub extension provided in corresponding package.

Known issues:

Best usage tips:

For best performance it is important to free 2 CPU cores running multiple instances.
Freeing at least 1 CPU core is necessity to get enough GPU usage.*

* As alternate solution try to use -cpu_lock / -cpu_lock_fixed_cpu N options.
   This might only work on fast multicore CPU`s. Further testing required.

command line parameters.
Command line switches can be used either in app_info.xml or ap_cmdline_win_x86_SSE2_OpenCL_Intel.txt.
Params in ap_cmdline*.txt will override switches in <cmdline> tag of app_info.xml.
_______________________

High end cards (more than 12 compute units)

-unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 

Mid range cards (less than 12 compute units)

-unroll 10 -ffa_block 6144 -ffa_block_fetch 1536

entry level GPU (less than 6 compute units)

-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024 

 
Your mileage might vary.
-----------------------------------------------------

App instances.
______________
 If you experience screen lags reduce unroll factor and ffa_block_fetch value.

Addendum:
_________

Running multiple cards in a system requires freeing another CPU core.

