mbcuda.chg for x41zc question

edwardpf · April 10, 2013, 04:54:01 AM

Where can I get info on the meaning of:

" pulsefinding blocks per multiprocessor"

and

" pulsefinding blocks per multiprocessor"?

Understanding what they mean MAY help in setting them to "correct" values.

Thanks

Ed F

Jason G · April 10, 2013, 06:54:00 AM

Quote from: edwardpf on April 10, 2013, 04:54:01 AM
Where can I get info on the meaning of:

" pulsefinding blocks per multiprocessor"

and

" pulsefinding blocks per multiprocessor"?

Understanding what they mean MAY help in setting them to "correct" values.

Thanks

Ed F

Hi Ed,

"pulsefinding blocks per multiprocessor"
Before the technical part, As this particular (non-critical) value only has 16 possible settings, after the defaults or recommendations from others, pretty much either trial and error or even better testing with the same task under bench is the best way, as every card model & system responds a little differently. Keep in mind these are only fine-tuning parameters, as opposed to 'you must have this tweaked correctly'

IOW there isn't necessarily a right or wrong here, it depends on your usage & goals (as well as the system).

The defaults, 1 for Pre-Fermi & 4 for Fermi & Newer, have been been chosen to be conservative for mid-lowend cards, to keep usability of the display at or around the prior 6.09 application impact. A setting of around 4 to 8 would perform on Fermi class cards, around 15 for newest (Kepler/Kepler2) GPUs, and ~1 or maybe 2 for Pre-Fermi's would give (very) small cummulative performance improvement that shouldn;t impact usability much. It may though be noticeable in times on many systems.

As for the technical explanation of what it does, broadly speaking it determines how much work is stuffed into each multiprocessor in the GPU at once, specifically in the pulsefinding. As newer GPU driver models optimise for multiple blocks being executed in quick succession, and newer GPUs have some superscalar execution capabilities, these are both forms of latency hiding that will not show on earlier generation cards or drivers.

Expressed more succinctly: Launching more pulsefinding blocks at once allows some video memory latency hiding, at risk of running too long, or too many causing different forms of thrashing.

" pulsefinding periods per launch":
As for the other pulsefinding setting, that's more usability oriented, directly affecting display lag/stuttering. Assuming a relatively recent Fermi or Kepler class card, getting the process priority right for your system would likely have a more noticeable throughput & utilisaion impact, though all three parameters tuned 'correctly' would indeed make small throughput or usability gains that combine.

HTH,

Jason

mbcuda.chg for x41zc question

edwardpf

April 10, 2013, 04:54:01 AM

Jason G

April 10, 2013, 06:54:00 AM #1 Last Edit: April 10, 2013, 07:00:40 AM by Jason G