Crunchers Anonymous

Help Desk => Questions => Topic started by: tommcg on May 20, 2014, 01:09:28 PM

Title: 64-bit package still runs 32-bit cuda50
Post by: tommcg on May 20, 2014, 01:09:28 PM
Both the 0.41 64-bit windows installer package, and the individual 64-bit cuda50 package contain only 32-bit executable.  Where can I find the real 64-bit package or binaries?

Thx.
Title: Re: 64-bit package still runs 32-bit cuda50
Post by: arkayn on May 20, 2014, 03:27:17 PM
During testing, it was found that the 64-bit executables were slower than the 32-bit versions. The choice was made to release 32-bit versions for all packages.
Title: Re: 64-bit package still runs 32-bit cuda50
Post by: William on May 20, 2014, 06:32:13 PM
The difference is in the app_info.xml file created by the installer, as it contains entries expected by 64-bit BOINC and necessary to retain work in progress.
Title: Re: 64-bit package still runs 32-bit cuda50
Post by: tommcg on May 20, 2014, 10:56:40 PM
Quote from: arkayn on May 20, 2014, 03:27:17 PM
During testing, it was found that the 64-bit executables were slower than the 32-bit versions.

That seems really odd, unless large portion of in-memory data contains mostly pointers, like pointer-based b-tree index or such.  Or, if the code has x86-specific asm instead of using SSE intrinsics that work on both platforms.  I've written compression code using SSE intrinsics, and it is at least 30% faster as 64-bit app vs 32-bit app.

Is the source code available somewhere to browse?

Thx.

Title: Re: 64-bit package still runs 32-bit cuda50
Post by: Claggy on May 20, 2014, 11:17:47 PM
Quote from: tommcg on May 20, 2014, 10:56:40 PM
Quote from: arkayn on May 20, 2014, 03:27:17 PM
During testing, it was found that the 64-bit executables were slower than the 32-bit versions.
That seems really odd, unless large portion of in-memory data contains mostly pointers, like pointer-based b-tree index or such.  Or, if the code has x86-specific asm instead of using SSE intrinsics that work on both platforms.  I've written compression code using SSE intrinsics, and it is at least 30% faster as 64-bit app vs 32-bit app.

Is the source code available somewhere to browse?

Thx.
For Cuda it's the extra address space that makes Cuda64 apps slower,

Stock is in seti_boinc, Optimised and xbranch in is in branches/sah_v7_opt:

Porting and optimizing SETI@home (http://setiathome.berkeley.edu/sah_porting.php)

https://setisvn.ssl.berkeley.edu/trac/browser (https://setisvn.ssl.berkeley.edu/trac/browser)

Claggy
Title: Re: 64-bit package still runs 32-bit cuda50
Post by: Jason G on May 26, 2014, 07:20:22 PM
Quote from: Claggy on May 20, 2014, 11:17:47 PM
Quote from: tommcg on May 20, 2014, 10:56:40 PM
Quote from: arkayn on May 20, 2014, 03:27:17 PM
During testing, it was found that the 64-bit executables were slower than the 32-bit versions.
That seems really odd, unless large portion of in-memory data contains mostly pointers, like pointer-based b-tree index or such.  Or, if the code has x86-specific asm instead of using SSE intrinsics that work on both platforms.  I've written compression code using SSE intrinsics, and it is at least 30% faster as 64-bit app vs 32-bit app.

Is the source code available somewhere to browse?

Thx.
For Cuda it's the extra address space that makes Cuda64 apps slower,

Stock is in seti_boinc, Optimised and xbranch in is in branches/sah_v7_opt:

Porting and optimizing SETI@home (http://setiathome.berkeley.edu/sah_porting.php)

https://setisvn.ssl.berkeley.edu/trac/browser (https://setisvn.ssl.berkeley.edu/trac/browser)

Claggy

Correct.  Simply put, With a lot of memory bound operations at this time (meaning mostly pointer arithmetic), and few latency hiding mechanisms used, pointers being double the size means double the size of code.  Since loading code induces various latencies, and larger pointers sap precious GPU register space... 32 bit GPU code is just faster On Windows (Linux a different special case where 32 bit won't build due to OS and Cuda toolkit limitations). 

As with everything though, things can change and evolve.  As we have no use whatsoever for huge amounts of GPU memory within one application instance ( Yet! ), focussing on making native 64 bit Cuda binaries for Windows isn't high on any priority list.  That will possibly change as newer hardware, drivers, toolkjits, and latency hiding techniques become employed.

In general though, bear in mind that using huge amounts of memory (either host or GPU) tends to be an indicator of poor optimisation, not good optimisation.