temp warning (too high and too cold)

Started by JStateson, May 25, 2010, 05:15:03 AM

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

JStateson

I have had collatz tasks hang, twice in the last week, on a pair of 9800gtx+ GPUs (two tasks both hung).  It was not obvious because one was about 75 pct done and the other %50 when both hung.  I finally noticed that the temps for both GPUs were in the low 50's instead of the mid to high 70's.  In both cases several days went buy before I noticed the problem.

Anyway, it would be nice to be able to highlight gpu temps that are either too high or too low.  It is true that some CUDA projects use very little of the GPU so there might be some false positives

Pepo

Quote from: BeemerBiker on May 25, 2010, 05:15:03 AM
Anyway, it would be nice to be able to highlight gpu temps that are either too high or too low.

There could be an additional problem that BT could be monitoring X machines and these could contain 1..n GPUs, while each of them could have a very different idea of "too high or too low". Thus the highlight temps would have to be stored in BT per-GPU.
Peter

fred

Quote from: BeemerBiker on May 25, 2010, 05:15:03 AM
I have had collatz tasks hang, twice in the last week, on a pair of 9600gtx+ GPUs (two tasks both hung).  It was not obvious because one was about 75 pct done and the other %50 when both hung.  I finally noticed that the temps for both GPUs were in the low 50's instead of the mid to high 70's.  In both cases several days went buy before I noticed the problem.

Anyway, it would be nice to be able to highlight gpu temps that are either too high or too low.  It is true that some CUDA projects use very little of the GPU so there might be some false positives
You can setup TThrottle to execute a batch file or email. If that's Windows of course.
I made an entry in the todo list. Warning for GPU temperature low or high.

fred

Quote from: Pepo on May 25, 2010, 08:46:50 AM
Quote from: BeemerBiker on May 25, 2010, 05:15:03 AM
Anyway, it would be nice to be able to highlight gpu temps that are either too high or too low.

There could be an additional problem that BT could be monitoring X machines and these could contain 1..n GPUs, while each of them could have a very different idea of "too high or too low". Thus the highlight temps would have to be stored in BT per-GPU.
Hmm, you have a point there. Makes it somewhat more difficult.
I may have to change the warning rules, make it a row/column list with additional options like temperature and computer.

JStateson

Quote from: Pepo on May 25, 2010, 08:46:50 AM
Quote from: BeemerBiker on May 25, 2010, 05:15:03 AM
Anyway, it would be nice to be able to highlight gpu temps that are either too high or too low.

There could be an additional problem that BT could be monitoring X machines and these could contain 1..n GPUs, while each of them could have a very different idea of "too high or too low". Thus the highlight temps would have to be stored in BT per-GPU.

This could be solved by adding rules or scripting capability to BT much like rules in TT.  The user could then come up with project unique actions.  For example, assuming the BT "Tasks" column titles are objects that could be parsed by a scripting mechanism built into BT, then the user could come up with very complex rules such as

Rule 1:  On (Temperature.gpu.value < 55 && Temperature.gpu.samples > 10) && Project.value=="Collatz Conjecture" then launch(whatever.exe) && html(mailto:someone@example.com?subject=temp too low&body=Collatz) && stop_processing_rule

Rule 2:  If (Temperature.gpu.value < 55 && Temperature.gpu.samples > 10) && Project=="Collatz Conjecture" then Temperature.gpu.highlight=WARNING else Temperature.gpu.highlight=NORMAL


This could be done as an add-in to BT assuming an API was available.

fred

Quote from: BeemerBiker on May 25, 2010, 01:54:33 PM
Quote from: Pepo on May 25, 2010, 08:46:50 AM
Quote from: BeemerBiker on May 25, 2010, 05:15:03 AM
Anyway, it would be nice to be able to highlight gpu temps that are either too high or too low.

There could be an additional problem that BT could be monitoring X machines and these could contain 1..n GPUs, while each of them could have a very different idea of "too high or too low". Thus the highlight temps would have to be stored in BT per-GPU.

This could be solved by adding rules or scripting capability to BT much like rules in TT.  The user could then come up with project unique actions.  For example, assuming the BT "Tasks" column titles are objects that could be parsed by a scripting mechanism built into BT, then the user could come up with very complex rules such as

Rule 1:  On (Temperature.gpu.value < 55 && Temperature.gpu.samples > 10) && Project.value=="Collatz Conjecture" then launch(whatever.exe) && html(mailto:someone@example.com?subject=temp too low&body=Collatz) && stop_processing_rule

Rule 2:  If (Temperature.gpu.value < 55 && Temperature.gpu.samples > 10) && Project=="Collatz Conjecture" then Temperature.gpu.highlight=WARNING else Temperature.gpu.highlight=NORMAL


This could be done as an add-in to BT assuming an API was available.
Something like that, but complex rules == a lot of work, or more overhead. The first is no problem, but I would like to avoid the last.

JStateson

#6
Quote from: fred on May 25, 2010, 09:46:32 AM
Quote from: BeemerBiker on May 25, 2010, 05:15:03 AM
I have had collatz tasks hang, twice in the last week, on a pair of 9600gtx+ GPUs (two tasks both hung).  It was not obvious because one was about 75 pct done and the other %50 when both hung.  I finally noticed that the temps for both GPUs were in the low 50's instead of the mid to high 70's.  In both cases several days went buy before I noticed the problem.

Anyway, it would be nice to be able to highlight gpu temps that are either too high or too low.  It is true that some CUDA projects use very little of the GPU so there might be some false positives
You can setup TThrottle to execute a batch file or email. If that's Windows of course.
I made an entry in the todo list. Warning for GPU temperature low or high.


The problem with this is that TThrottle has to be set up on each system and TThrottle only runs on windows.  Now, if TThrottle could accept packets from other systems running TThrottle (or my modification to linux's sensors-applet) then the user needs to set up rules for only one system and TThrottle displays temps from all systems that sent it packets.

Currently, TThrottle monitors port 31417 and looks for "BT\0" then routes, for example,

"<TThrottle><PV 1.74><AC 0><TC 1><TG 1><DC 5><DG 45><CT0 72.0><CT1 73.0><CT2 74.0><CT3 81.0><GT0 79.0>\0"

back to the source ip (BT)


If it could look for and process, for example,
"TT<TThrottle jys2x290><PV 1.74><AC 0><TC 1><TG 1><DC 5><DG 45><CT0 72.0><CT1 73.0><CT2 74.0><CT3 81.0><GT0 79.0>\0"

it would know that jys2x290 was the hostname of the system who's IP address was the source of the incomeing TCP packet and that the temperatures that follow were collected using TThrottle <PV 1.74> or some application like linux's sensors-applet <SA 1.74> for example.

TThrottle would treat the incoming temperatures as if it has measured them itself and display them on the graph.  The hostname would have to be a new property for TThrottle to maintain.

I suspect this would best be done in BT and not TThrottle but TThrottle has all the rules processing and temp graph capability.


fred

Quote from: BeemerBiker on May 25, 2010, 04:30:53 PM
Quote from: fred on May 25, 2010, 09:46:32 AM
Quote from: BeemerBiker on May 25, 2010, 05:15:03 AM
I have had collatz tasks hang, twice in the last week, on a pair of 9600gtx+ GPUs (two tasks both hung).  It was not obvious because one was about 75 pct done and the other %50 when both hung.  I finally noticed that the temps for both GPUs were in the low 50's instead of the mid to high 70's.  In both cases several days went buy before I noticed the problem.

Anyway, it would be nice to be able to highlight gpu temps that are either too high or too low.  It is true that some CUDA projects use very little of the GPU so there might be some false positives
You can setup TThrottle to execute a batch file or email. If that's Windows of course.
I made an entry in the todo list. Warning for GPU temperature low or high.


The problem with this is that TThrottle has to be set up on each system and TThrottle only runs on windows.  Now, if TThrottle could accept packets from other systems running TThrottle (or my modification to linux's sensors-applet) then the user needs to set up rules for only one system and TThrottle displays temps from all systems that sent it packets.

Currently, TThrottle monitors port 31417 and looks for "BT\0" then routes, for example,

"<TThrottle><PV 1.74><AC 0><TC 1><TG 1><DC 5><DG 45><CT0 72.0><CT1 73.0><CT2 74.0><CT3 81.0><GT0 79.0>\0"

back to the source ip (BT)


If it could look for and process, for example,
"TT<TThrottle jys2x290><PV 1.74><AC 0><TC 1><TG 1><DC 5><DG 45><CT0 72.0><CT1 73.0><CT2 74.0><CT3 81.0><GT0 79.0>\0"

it would know that jys2x290 was the hostname of the system who's IP address was the source of the incomeing TCP packet and that the temperatures that follow were collected using TThrottle <PV 1.74> or some application like linux's sensors-applet <SA 1.74> for example.

TThrottle would treat the incoming temperatures as if it has measured them itself and display them on the graph.  The hostname would have to be a new property for TThrottle to maintain.

I suspect this would best be done in BT and not TThrottle but TThrottle has all the rules processing and temp graph capability.


A remote feature is planned, but I'm not sure to include it in BT or TThrottle.
Adding the host name is in the todo list, but BT knows what computer the package is coming from.