need help making up a rule

Started by JStateson, August 24, 2019, 11:08:24 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

JStateson

This is a hardware or software problem but it would be nice if I could spot the problem when it first occurs.  Going to post over at BOINC also as possibly the problem could be debugged better if I knew more about what was happening.

---once every couple of days----

On a 5 GPU rig, one of the GPUs crunches for 4-5 seconds then goes on to another work unit.  A queue of "waiting to run" starts building up.  Because there are 4 other working GPUs. they pull from this queue so the queue grows only slowly.  After about an hour or two there might be 40 items in the queue.

sudo /etc/init.d/boinc-client restart  => does not always work
sudo shutdown now => looks like it works but I generally cycle the power after a few minutes of waiting

When the system boots back up I run a script to set the fans to %100 else temps get up past 80 for a pair of gtx1060

I failed to make a note of which GPU had the problem if indeed the problem is a single gpu.  The only way to tell is to stop the fan and see which one reports 0 speed and then look up the bus id and see which GPU it matches in coproc-info.xml.  Have not done this yet but will the next time this happens. It would be nice if BOINC reported the same GPU# that nvidia reports on their diagnostics.  BOINC assigned 0 to best (like 1070 or gtx 2080) and larger numbers to weaker GPUs.  Not sure why they bother to rank GPUs in the first place.

---back to the rule---

The most obvious thing is to see if there are more than X items in the "waiting to run" queue and then run a script that sends me a text message.  I already have a script that does that but there is no "waiting to run" and I am pretty sure the %cpu was 99 percent so I cant use that as a trigger.  However, the CPU% is always 99 because I need to run "-nobs" to force the system to dedicated a thread %1l00 to the GPU.  So possilbly the cpu is really idle and the 99 is simply a "busy polling all the time" symptom which is a feature of the "-nobs" parameter.