Hardware question on cooling

Started by Richard Haselgrove, August 16, 2013, 02:07:29 AM

Previous topic - Next topic
Anyone mind if I ask for suggestions on this problem, please?

I mentioned recently that I was suffering from 'desktop freezing' on my big machine (about a year old this weekend): I see BOINC failing to update on my BoincView remote, and when I look at the machine locally, I see the Windows desktop (intact without artefacts or error messages), but the system clock not updated in half an hour or whatever.

I built the machine with a single GPU (GTX 670), but added a second identical one. I think I've narrowed the problem down to the first GPU overheating.

The cards are "Gainward Phantom" models. They have a full-size heatsink with three heat pipes, and two enclosed fans which suck in air from underneath, and exhaust it through the rear of the case through a slotted mounting bracket.

The GPUs are about 2.5 slots thick (with a double mounting bracket), and the notherboard has the PCIe connectors three case-slots apart. So the upper GPU only has about half a slot's worth of space to get cool air to its fan intakes - and that air passes over the exposed PCB surface of the lower GPU - so it'll heat up a lot on the way.

I have plenty of space in a big HAF case, and plenty of fans oriented correctly to cool the CPU - but how can I best get extra cool air to the upper GPU?

I'll try to take a photograph and mark the existing airflows, while I test the theory by running with a single GPU for a while.

Any ideas welcome, but preferably not ones which require advanced machining/fabrication skills!

August 16, 2013, 02:38:23 AM #1 Last Edit: August 16, 2013, 02:40:53 AM by Richard Haselgrove
OK, here's the beast.



Main fans and airflow (I did the 'tissue test'):

Huge intake fan at bottom right, sucks room air in through the front panel.
CPU fans both blow towards the left, as does a rear case fan level with the CPU.
There's an upper case fan above the CPU, blowing CPU heat up through the roof.
The PSU has its own intake grille underneath, and exhausts to the left through the back panel. Shouldn't be any heat escape from the PSU into the case.

The current GPU is in position 2 - This is the one I've removed from the upper slot, next to the CPU.




I have 2 fans located in my side panel that put additional air directly onto the GPU's and CPU.

Quote from: arkayn on August 16, 2013, 02:40:38 AM
I have 2 fans located in my side panel that put additional air directly onto the GPU's and CPU.

Ah - good point. My side panel is drilled and grilled, too - 2x150 mm, or 1x180mm, if I've got my measurements right.

Another possibility is a second fan in the roof, at the front blowing down to set up a clockwise circulation. Just have to remember not to put a pile of papers down on the top of the box...

My mother's machine had some similar freezes when I first replaced the dead motherboard (Both Gigabyte brand, midrange, Z77 replacing Z68 chipset).  The replacement would hard freeze as described about once every 3-4 days.  I had temperature monitors in the tray too, so could see it wasn't temps in my case.

What got rid of it completely ( been no freezes for a couple of months) was some combination of the following:
- Moherboard BIOS update & reset BIOS to defaults.
- Force update Intel Chipset drivers, checking dates & versions on each Intel(r) device really does update, because installing the inf isn't enough.
Get the zip form, unzipit & point the driver updatesto the All folder
- Updated video driver for the sake of it. (unlikely culprit though)

probably a few more minor things in there, like updating some other drivers.
It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change.
Charles Darwin
---
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz

Thanks - mine's a Gigabyte Z77 too, as you probably saw. I got the latest BIOS when I built the machine, but there might have been an update since then - I'll check.

The thing that led me to question an overheated GPU was a succession of freezes, and finally this morning discovering that GPU 0 was running, but had downclocked (running 2x x41zc/cuda50). I cleared that, but went on with my planned experiment of re-trying GPUGrid, which puts a much more continuous strain on the GPUs - something like 8 hours continuous, without even a break between tasks. It froze again within five minutes.

When I only had one card in the machine (and again this afternoon while I only have one for testing), it seems to run fine. I've also tried moving it into a potentially cooler location, but I think the weather has conspired against that.

August 16, 2013, 03:36:43 AM #6 Last Edit: August 16, 2013, 03:44:01 AM by Jason G
Yeah, temps are still definitely a first likely suspect, though turned out a red herring in my case. Setting eVGA precision to show theGPU temp in the tray confirmed it wa cool running.

If the temps issue doesn't turn out to be the culprit, then next most likely is specifically the PCI express drivers component of the Intel Chipset inf update utility. Forcing that to update is a bit of a dance, but ultimately dropped  DPC latencies as well, implying improved driver quality.

Alll my 'system devices' marked Intel (R) now show:
Driver Date: 9/07/2013
Driver version:  9.1.9.1004

except for one legacy PCI bridge,that uses an MS driver.
It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change.
Charles Darwin
---
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz

I was running TThrottle - for monitoring, rather than control. Usually showed GPU 0 was 15 - 20*C warmer than GPU 1, though I thought still within tolerance.

Quote from: Richard Haselgrove on August 16, 2013, 03:42:20 AM
I was running TThrottle - for monitoring, rather than control. Usually showed GPU 0 was 15 - 20*C warmer than GPU 1, though I thought still within tolerance.

< 70 degres C ? or substantially warmer ?
It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change.
Charles Darwin
---
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz

The single GPU running GPUGrid now (with the case side panel off, so air flow not so well guided) is showing 63.0*C on TThrottle. Form memory, the previous readings were well into the 70s for GPU 0, high 50s for GPU 1

What fans would you suggest for the side panel? Manual confirms ready for 2x 140mm, or 1x 200mm (I think I'd go for 2x140). I think I'm out of motherboard power points, so they would need a Molex connector or adapter - do most manufacturers include those? (some mention them, most don't). I'd want to preserve the current - very quiet - running: that was another reason for the move, I literally couldn't hear it running in the workroom with other computers. And I can still hardly hear it in a room by itself!

Yeah, 70 degres C is where the turboboost mechanism takes over, fiddling with clocks, so for max  output you want to stay under that. (working with it instead of against it)

I personally like Noctua Fans, because they are quiet (terrible colour though), but any decent 120-140mm choice should really be  pretty quiet unless going for ridiculuous CFM / rpm models.

For numbers of bigfans I'd avoid using the mobo power altogether,  just atad easier on the mobo voltage regulators, which tendto be the first thing to fail (if anything) these days.  Depending on fan models/brands they come with some sortof molex adaptors... at least my last 120mm & 140mm noctua ones did.   Find a good picture of box contents & they should a clear picture showing the screws/adaptors & optional silicon shock mount alternative things.

It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change.
Charles Darwin
---
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz

August 16, 2013, 06:45:30 PM #11 Last Edit: August 16, 2013, 09:39:39 PM by Richard Haselgrove
It's a cooler day today, so the solo card is showing 57*C. Once the coffee's kicked in, I'll put the second card back in, and - assuming I haven't broken it - take some more readings, mess about with the airflow, try an external room fan - you know the sort of thing.

Local supplier carries Noctua, but doesn't have any true 140mm in stock - and I don't like the colour, either. I'll look around, following you tips - ta.

Later - put the original card back in, and the temps are around 78*C / 58*C (open case). I tried sliding an air baffle (sheet of corrugated cardboard) between the two cards, to prevent the heat from the lower one reaching the upper one - but the upper temp went to 82*C and rising, so I pulled it out quickly. I think the obstructed air flow made things worse instead of better.

So, extra fans it is, then. Supplier has a brace of http://www.corsair.com/cpu-cooling-kits/air-series-fans/air-series-af140-quiet-edition-high-airflow-140mm-fan.html in stock, and a cable adapter - fetching now. I'll let you know how I get on.

(oh, and the motherboard does have another full-length PCIe slot, and the case would take a double-width card there - but I'd be obstructing audio and USB connectors, and the second card would be drawing in air from very close to the PSU - not worth it just to get a better airflow gap for the upper card. And the slot is only wired x4, anyway)

August 17, 2013, 12:54:31 AM #12 Last Edit: August 17, 2013, 01:15:05 AM by Jason G
Yeah, I've heard also a rubber stick on foot or similar increasing the separation a tiny bit might help too, depending how much you can stretch things with those massive custom heatsinks.

[Later:] looking again at the pics, I'm inclined to beleive you might be under a vacuum as well, as opposed to the desirable positive pressure situation.  There are the GPUs, rear fan & PSU all exhausting, and one intake at the front (?).  Adding more intake to the point where it overtakes exhaust should help quite a bit, since less air mass moves when rarified, and a denser mass of air will carry more heat.
It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change.
Charles Darwin
---
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz

Well, she's been running for a couple of hours now with those two extra side-panel fans, both blowing inwards to create overpressure. The key temperature has dropped a couple of degrees - the GPU cards themselves have active fan control, and the fan speeds have dropped too, so there's probably more headroom than those two degrees imply.

I doubt there was ever anything approaching vacuum - the case leaks like a sieve (literally - the only solid wall is the side panel behind the motherboard). There's a 200mm fan in the lower front panel (bottom right of photograph), blowing air over the hard disk cage and into the case. We can leave the PSU out of the pressure equation, because it takes inlet air direct from the room (underneath - outside the case), and exhausts to the rear (again outside the case).

I think my next task is to tie back those four 6-pin GPU power cables, so they can't obstruct the air flow.

Sounds good :) yeah the fans dropping's a good sign.  Personally I'd make a custom fan profile (in evgaprecision or equivalent) slightly more aggressive than the default, due to knowing there are two stacked cards. Depends on how much air rushing sound you can tolerate though.  Sustaining an overclock on my (single) 480 sometimes gave the feeling of sitting in a learjet taking off, while the 680 is near ambient noise levels at all times.  With evga's custom cooler on the 780, at stock frequencies, I do hear some slight white noise, but really only if I stay still enough to focus on it.
It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change.
Charles Darwin
---
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz

I think I have at least 2 slots between my 2 cards, the 660 runs at 58 and the 670 is around 63.

OK, so it's started doing it again - lockup with static system clock, no response to mouse/keyboard, no BOINC activity on remote BoincView or Manager. I don't think it's heat this time - weather is getting cooler, and the 'frozen' desktop display this time showed TThrottle's display of the heat-prone GPU at 75*, the cooler one 55*-ish.

Been doing some error-recovery app testing at GPUGrid recently - no lockup problems during that. Tried switching back to production-mode, but one of those failed so horribly this morning - multiple driver restarts on reboot, ending up with a black screen and repeated system boot beeps - that I suspended that task in safe mode, and aborted it when I got the machine back up. Since then, been running SETI only (x41zc) on the GPUs - but it still happened again. Before that, I got three of those false 'bad workunit header' job errors - http://setiathome.berkeley.edu/results.php?hostid=7070962&state=6 - though that didn't kill either BOINC or Windows.

Host has SSD boot drive, and two sata HDs as Raid 1 mirrored data drive, using the Intel motherboard Raid controller and drivers. Every time the host crashes, Intel schedules a disk consistency check, and I have a *slight* suspicion the next crash is correlated with Intel reporting that the consistency check was successful, and no errors were found. Nevertheless, usually one or two tasks rewind to zero %age progress (but significant elapsed time), suggesting that the checkpoint files couldn't be read after the crash.

The other significant factor maybe that I installed the September windows security updates last night, and the problems (re-)started after that. These are the ones I installed:



Note that Office 2007 updates 2760411 and 2760588 were (successfully) installed twice each - Microsoft have acknowledged this to be a known problem, and they're working on it. But is anything else in that list known (or likely) to have started causing stability problems)?

September 13, 2013, 04:05:59 AM #17 Last Edit: September 13, 2013, 04:19:48 AM by Claggy
As it happens my i7-2600K has been having Stability problems (again) too, my system eithier locks up totally, mouse doesn't move, etc,
then Blue Screens with a Hardware problem, or a 'A clock interrupt was not received on a secondary processor within the allocated time interval' Blue screen,
It normally happens if the CPU load is near or above 100Watts, cleaning the Corsair cooler generally returns it to stability, (I've just done that)
all summer I've been limiting it to only 50% of cores too, at the moment it's doing CPU Astropulse, that nudges the CPU power usage up to 100 Watts through.
(Possible solutions for me, a Bios update (but i'm locked into this Bios if I want to keep my overclock), better PSU (I doubt it), better M/B VRM cooling (I doubt it),
already tried a Better Corsair cooler, not made any real difference)

Claggy

September 13, 2013, 04:17:45 AM #18 Last Edit: September 13, 2013, 04:25:21 AM by Richard Haselgrove
I'm running 6 cores out of 8, and TThrotle temps are 75 - 78 - 75 - 68 as I type. Don't have an easy input power monitor to hand for CPU only - she's drawing 440W from the wall, on a 1200AX PSU Where did you see 100W?

CPU-Z thinks my max should be 77W:


September 13, 2013, 04:24:36 AM #19 Last Edit: September 13, 2013, 04:29:17 AM by Claggy
Quote from: Richard Haselgrove on September 13, 2013, 04:17:45 AM
I'm running 6 cores out of 8, and TThrotle temps are 75 - 78 - 75 - 68 as I type. Don't have an easy input power monitor to hand for CPU only - she's drawing 440W from the wall, on a 1200AX PSU Where did you see 100W?
CPUID Hardware Monitor has Power Monitoring on my M/B, with two AP and one Seti v7, Package is at ~91Watts, IA Cores at ~85Watts, GT at 0.17Watts, UnCore at ~5.5Watts.

CPU-Z reports Max TDP as 95Watts here.

SIV also displays the same values on the bottom right as a pop up.

Claggy

Powered by EzPortal