This is a hardware or software problem but it would be nice if I could spot the problem when it first occurs. Going to post over at BOINC also as possibly the problem could be debugged better if I knew more about what was happening.
---once every couple of days----
On a 5 GPU rig, one of the GPUs crunches for 4-5 seconds then goes on to another work unit. A queue of "waiting to run" starts building up. Because there are 4 other working GPUs. they pull from this queue so the queue grows only slowly. After about an hour or two there might be 40 items in the queue.
sudo /etc/init.d/boinc-client restart => does not always work
sudo shutdown now => looks like it works but I generally cycle the power after a few minutes of waiting
When the system boots back up I run a script to set the fans to %100 else temps get up past 80 for a pair of gtx1060
I failed to make a note of which GPU had the problem if indeed the problem is a single gpu. The only way to tell is to stop the fan and see which one reports 0 speed and then look up the bus id and see which GPU it matches in coproc-info.xml. Have not done this yet but will the next time this happens. It would be nice if BOINC reported the same GPU# that nvidia reports on their diagnostics. BOINC assigned 0 to best (like 1070 or gtx 2080) and larger numbers to weaker GPUs. Not sure why they bother to rank GPUs in the first place.
---back to the rule---
The most obvious thing is to see if there are more than X items in the "waiting to run" queue and then run a script that sends me a text message. I already have a script that does that but there is no "waiting to run" and I am pretty sure the %cpu was 99 percent so I cant use that as a trigger. However, the CPU% is always 99 because I need to run "-nobs" to force the system to dedicated a thread %1l00 to the GPU. So possilbly the cpu is really idle and the 99 is simply a "busy polling all the time" symptom which is a feature of the "-nobs" parameter.