Rules & Actions

Started by John C, June 01, 2010, 11:56:06 PM

Previous topic - Next topic

0 Members and 3 Guests are viewing this topic.

John C

Monitoring is great, but I'd love to have rules in BoincTask to automate managing the farm in addition to watching it.

1.  The ability to take an action (such as run a script) when a computer loses it's connection or goes a certain amount of time with no work.  When a computer freezes up, I want to be able to issue an SNMP command to the PDU to reboot it (power off & on).  If BoincTasks could do that, great!  But if that's too much to ask for, then let it at least run a batch/script and I'll embed the SNMP command there.

2.  If a task goes over a threshhold (more than 150% done, for example), then I'd love to have a rule that aborted it.  Collatz has a bad habit of not erroring out and having a task consume every GPU forever.  At a minimum, allow a rule where we can suspend a task that runs longer than a given maximum allowable do that we can later decide if it should be aborted or allowed to continue.

I am sure there are also other benefits of rules.  I'd love to be able to send myself an email (or a text using the email gateway) if a machine goes too long with no work.  But the two above are the ones that I most need.

BIG THANKS!

fred

Quote from: John C on June 01, 2010, 11:56:06 PM
Monitoring is great, but I'd love to have rules in BoincTask to automate managing the farm in addition to watching it.

1.  The ability to take an action (such as run a script) when a computer loses it's connection or goes a certain amount of time with no work.  When a computer freezes up, I want to be able to issue an SNMP command to the PDU to reboot it (power off & on).  If BoincTasks could do that, great!  But if that's too much to ask for, then let it at least run a batch/script and I'll embed the SNMP command there.

2.  If a task goes over a threshhold (more than 150% done, for example), then I'd love to have a rule that aborted it.  Collatz has a bad habit of not erroring out and having a task consume every GPU forever.  At a minimum, allow a rule where we can suspend a task that runs longer than a given maximum allowable do that we can later decide if it should be aborted or allowed to continue.

I am sure there are also other benefits of rules.  I'd love to be able to send myself an email (or a text using the email gateway) if a machine goes too long with no work.  But the two above are the ones that I most need.

BIG THANKS!

Ok it's on the todo list.

fred

#2
Quote from: John C on June 01, 2010, 11:56:06 PM
If a task goes over a threshhold (more than 150% done, for example), then I'd love to have a rule that aborted it.  Collatz has a bad habit of not erroring out and having a task consume every GPU forever.
Have you seen this in BoincTasks? It should be impossible as the time left is adjusted all the time.

John C

Nope.  Have seen it in the regular boinc manager, but never in BT.  In BT, the task is pegged at 100% complete but then just keeps running.  I assumed you were getting the original value and then truncating, but that might have been an errant assumption.  But boinc manager is somehow calculating tasks as 200% complete.

jjwhalen

#4
Quote from: fred on June 04, 2010, 07:27:37 AM
Quote from: John C on June 01, 2010, 11:56:06 PM
If a task goes over a threshhold (more than 150% done, for example), then I'd love to have a rule that aborted it.  Collatz has a bad habit of not erroring out and having a task consume every GPU forever.
Have you seen this in BoincTasks? It should be impossible as the time left is adjusted all the time.


My 2 cents:

1) It's true that BT doesn't count CPU % past 100%.  What does happen, however, is that "Time Left" may reach zero (-) and the task just keeps on truckin'.  A rule could certainly be set to detect this condition and place a curfew on the task after <interval>.

2) John C is correct that Collatz has a problem with run-on tasks.  I've aborted dozens of Collatz results (out of ~700) for this sort of behavior.  For one thing, Collatz does not like to be suspended and restarted, although that's not the only reason.

3) On the other hand, there are several projects whose tasks may legitimately run beyond "Time Left=0" (and theoretically >100%) and NOT be broken.  Rosetta is one, Climate Prediction is another.  Some of the wrapper applications behave this way, since they may not checkpoint very well, if at all.  (Yoyo's Evo comes to mind.)


fred

Quote from: jjwhalen on June 04, 2010, 09:02:43 AM
Quote from: fred on June 04, 2010, 07:27:37 AM
Quote from: John C on June 01, 2010, 11:56:06 PM
If a task goes over a threshhold (more than 150% done, for example), then I'd love to have a rule that aborted it.  Collatz has a bad habit of not erroring out and having a task consume every GPU forever.
Have you seen this in BoincTasks? It should be impossible as the time left is adjusted all the time.


My 2 cents:

1) It's true that BT doesn't count CPU % past 100%.  What does happen, however, is that "Time Left" may reach zero (-) and the task just keeps on truckin'.  A rule could certainly be set to detect this condition and place a curfew on the task after <interval>.

2) John C is correct that Collatz has a problem with run-on tasks.  I've aborted dozens of Collatz results (out of ~700) for this sort of behavior.  For one thing, Collatz does not like to be suspended and restarted, although that's not the only reason.

3) On the other hand, there are several projects whose tasks may legitimately run beyond "Time Left=0" (and theoretically >100%) and NOT be broken.  Rosetta is one, Climate Prediction is another.  Some of the wrapper applications behave this way, since they may not checkpoint very well, if at all.  (Yoyo's Evo comes to mind.)
1) A rule on the progress % or a max time rule should catch these.
2) Did you see time left < 0 in BoincTasks. A rule on progress % should catch these as well.

jjwhalen

#6
Quote from: fred on June 04, 2010, 10:04:17 AM
Quote from: jjwhalen on June 04, 2010, 09:02:43 AM
Quote from: fred on June 04, 2010, 07:27:37 AM
Quote from: John C on June 01, 2010, 11:56:06 PM
If a task goes over a threshhold (more than 150% done, for example), then I'd love to have a rule that aborted it.  Collatz has a bad habit of not erroring out and having a task consume every GPU forever.
Have you seen this in BoincTasks? It should be impossible as the time left is adjusted all the time.


My 2 cents:

1) It's true that BT doesn't count CPU % past 100%.  What does happen, however, is that "Time Left" may reach zero (-) and the task just keeps on truckin'.  A rule could certainly be set to detect this condition and place a curfew on the task after <interval>.

2) John C is correct that Collatz has a problem with run-on tasks.  I've aborted dozens of Collatz results (out of ~700) for this sort of behavior.  For one thing, Collatz does not like to be suspended and restarted, although that's not the only reason.

3) On the other hand, there are several projects whose tasks may legitimately run beyond "Time Left=0" (and theoretically >100%) and NOT be broken.  Rosetta is one, Climate Prediction is another.  Some of the wrapper applications behave this way, since they may not checkpoint very well, if at all.  (Yoyo's Evo comes to mind.)
1) A rule on the progress % or a max time rule should catch these.
2) Did you see time left < 0 in BoincTasks. A rule on progress % should catch these as well.

1) Agreed.
2) Nope--Time Left decrements to zero & thereafter just indicates  "-", the same as a completed task waiting to be reported.  But ET & (CPU Time) keep incrementing.

By coincidence I was watching a CPDN task finish about 3 days ago on my quad, and it ran well past "CPU %=100" and "Time Left='-'".  I believe they run until the climate cycle in progress finishes, however long that takes.  I figure in their case a few extra minutes on a WU that runs 10~12 CPU-DAYS isn't all that much.  But in the case of Collatz I've watched a GPU task that's supposed to finish in 35~37 minutes run 2 hours with no end in sight.  Those need to get trashed.  My "rule" is that Rosetta & CPDN I trust to run long, but Collatz I do not.  So ideally I would tell BT that if a Collatz task reaches "CPU %=100" AND "Time Left='-'" AND "Elapsed Time=00:45:00" the task gets suspended pending operator review.  (The first AND might be an OR).  That way I could go out to dinner and not have my GPU crunching away on garbage.

Best wishes.