Pass useful parameters like stuck GPU # to a batch file

Started by JStateson, November 17, 2019, 03:29:40 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

JStateson

Occasionally a GPU gets hung and never finishes a job, or it can reject a job within seconds of receiving it.  These events are quickly discovered using the rules mechanism.  Currently, a batch file can be executed and an email or text message can easily be sent.  However, it would be advantageous to the project and the user, to be able to handle the situation automatically.  This can only be implemented if identifying parameters can be passed from BoincTasks to the handler.  At a minimum, the following parameters might be needed

$temp---------temperature of the device assuming tthrottle running or "none"
$device-------device id of GPU (D0, D1, etc) or just "CPU" if not a co-processor
$ip_address---need to know which system has problem
$port---------if needed to communicate with client and some systems have multiple clients
$password-----if needed to communicate with the client
$rule_name----the name of the rule could have an identifying phrase useful to the handler
$computer-----name of the system
$platform-----handler might need to know which OS: Linux, mac, windows
$project------name of project would be useful to handler
$app----------name of app
$rule_count---number of times rule has been applied

Example of rule usage

if Elapsed time > 5 minutes,  project "SETI@home",  app "8.01 setiathome_v8 (cuda90)", run program:
d:\ProgramData\boinc\scripts\HandleRule.bat $rule_name $ $ip_address $device

With these additions, more useful rules can be contributed as well as 3rd party scripts or apps such as resetting the GPU, excluding it from use by the Boinc client, or shutting down the client or system.

There is a discussion back in jan 2019 by Boinc principals here where they are considering adding xml files that basically duplicate a few of the BoincTasks rules.  Their xml includes, for example, instructions to a particular nvidia board to enable or disable.
This functionality is partially present in BoincTasks but is missing the parameters required to identify the device and system having the problem.  Even if their "Computing prefs 2.0" is implemented it would required those XML file to be present on each system.

The device_id can be 0, 1, 2 etc for each type of GPU so it must include a type such as nvidia, intel, amd, etc
Need to be consistent with naming used by the exclude_gpu which appear to be
  [<type>NVIDIA|ATI|intel_gpu</type>]