Identify "stuck" tasks

Started by frankwall, January 21, 2010, 11:26:07 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

frankwall

Occasionally Boinc WU tasks get "stuck" and make no progress for several hours.
Could BoincTasks "Warnings" be adapted to flag a task that makes no progress for a specified period in minutes?
The CPU% column still shows 100% for these even though they are actually consuming 0% CPU according to Windows Task manager.

fred

#1
Quote from: frankwall on January 21, 2010, 11:26:07 PM
Occasionally Boinc WU tasks get "stuck" and make no progress for several hours.
Could BoincTasks "Warnings" be adapted to flag a task that makes no progress for a specified period in minutes?
The CPU% column still shows 100% for these even though they are actually consuming 0% CPU according to Windows Task manager.
What BOINC version are you using?
The project?

Pepo

Quote from: frankwall on January 21, 2010, 11:26:07 PM
The CPU% column still shows 100% for these even though they are actually consuming 0% CPU according to Windows Task manager.
???

I remember having seen cases, when

  • regular CPU-intensive application is not consuming any CPU in the running state (this is not the case of non-CPU-intensive apps like e.g. 0.01 CPUs QCN or DepSpid, they do progress regardless of the CPU used), do not progress %-wise and never reach a checkpoint (to be automatically preempted), this case is already catched by the existing builtin CPU warning.
    The usual solution 8) is to notice :o the case ??? temporarily suspend :-X ot kill :'( such application or restart :-\ the client,
  • regular CPU-intensive application is consuming the usual nearly full CPU in the running state, but progress the % rarely or even just when finished, it could be that such application does neither chechkpoint, nor progress the % during a many hours long run. The only solution is to wait and pray to not loose electricity.
    I have no idea what would BT display in such case.
Peter

fred

Quote from: Pepo on January 22, 2010, 12:03:52 PM
Quote from: frankwall on January 21, 2010, 11:26:07 PM
The CPU% column still shows 100% for these even though they are actually consuming 0% CPU according to Windows Task manager.
???

I remember having seen cases, when

  • regular CPU-intensive application is not consuming any CPU in the running state (this is not the case of non-CPU-intensive apps like e.g. 0.01 CPUs QCN or DepSpid, they do progress regardless of the CPU used), do not progress %-wise and never reach a checkpoint (to be automatically preempted), this case is already catched by the existing builtin CPU warning.
    The usual solution 8) is to notice :o the case ??? temporarily suspend :-X ot kill :'( such application or restart :-\ the client,
  • regular CPU-intensive application is consuming the usual nearly full CPU in the running state, but progress the % rarely or even just when finished, it could be that such application does neither chechkpoint, nor progress the % during a many hours long run. The only solution is to wait and pray to not loose electricity.
    I have no idea what would BT display in such case.
Added it to the wish list

frankwall

#4
I connect to the World Community Grid Project - accepting all their WU's.
I have several machines running BOINC 6.2.28 or 5.10.45 on Windows XP or 2008 R2, also occasionally on Ubuntu.
This "stuck" condition is fixed by stoppng and re-starting the remote client. It has happened on FAAH and HFCC once or twice in the past month and from reading WCG posts it is not exactly common but is also not unheard of. The WU's eventually error-out if they are not noticed, but it can take many hours.
I am trying to migrate from Boincview to BoincTasks but find that BoincTasks does not tell me when all CPU's are busy doing BOINC work
When an application (WU) is in this state Boincview shows 0.000 in CPU Efficiency - BoincTasks still shows CPU% as 100% for that WU.  (I realize from other posts that I may not understand what CPU% is intended to report - I don't have any GPU's)
I think if BT could report actual CPU% being used I could trigger a warning (which is what I do in BoincView) when it is below a threshhold.

frankwall

Ok - thanks for wish list
Also thanks for the product itself.

fred

Quote from: frankwall on January 22, 2010, 12:57:44 PM
I connect to the World Community Grid Project - accepting all their WU's.
I have several machines running BOINC 6.2.28 or 5.10.45 on Windows XP or 2008 R2, also occasionally on Ubuntu.
This "stuck" condition is fixed by stoppng and re-starting the remote client. It has happened on FAAH and HFCC once or twice in the past month and from reading WCG posts it is not exactly common but is also not unheard of. The WU's eventually error-out if they are not noticed, but it can take many hours.
I am trying to migrate from Boincview to BoincTasks but find that BoincTasks does not tell me when all CPU's are busy doing BOINC work
When an application (WU) is in this state Boincview shows 0.000 in CPU Efficiency - BoincTasks still shows CPU% as 100% for that WU.  (I realize from other posts that I may not understand what CPU% is intended to report - I don't have any GPU's)
I think if BT could report actual CPU% being used I could trigger a warning (which is what I do in BoincView) when it is below a threshhold.
The problem is with the older clients that will always show 100%, the newer BOINC client versions will report a percentage.
It is not necessarily a GPU thing. It's the difference between the clock time and the task run time. It should work with the latest stable BOINC release.
But I added this request to the Wish list.

Pepo

Quote from: frankwall on January 22, 2010, 12:57:44 PM
I am trying to migrate from Boincview to BoincTasks but find that BoincTasks does not tell me when all CPU's are busy doing BOINC work
When an application (WU) is in this state Boincview shows 0.000 in CPU Efficiency - BoincTasks still shows CPU% as 100% for that WU. [...]
I think if BT could report actual CPU% being used I could trigger a warning (which is what I do in BoincView) when it is below a threshhold.

I guess the problem is that the BoincTasks is decreasing rather slowly (an average over the whole run time since the last start?) (I've noticed it after having throttled to e.g. 2%, the decay is infinitely slow), whereas BoincView is calculating its Efficiency value over a pretty short period.
But it is possible that these are actually two different numbers and we would need both of them?
Peter

fred

BT now use values from the BOINC client, but only the "newer" clients do report the cpu run time, so you get something else than 100%
And this way the CPU % can take a looong time to respond to changes.
What I can do is Delta = runtime Now - Runtime Previous. Delta = 57:51 - 58:35 = 44 Sec. Wall clock = 60 Sec,  CPU % = 44/60 = 73%
Do it every refresh cycle, instead of over a minute and middle it a bit over time, to smooth things out.
This way there should be a more immediate response and it works with older clients as well.   

frankwall

This sounds like it would address my needs and also help identify when other work is pre-empting WU's and reducing their CPU usage ("other work" may be legitimate or not)
Thanks again.

fred

Quote from: frankwall on January 21, 2010, 11:26:07 PM
Occasionally Boinc WU tasks get "stuck" and make no progress for several hours.
Could BoincTasks "Warnings" be adapted to flag a task that makes no progress for a specified period in minutes?
The CPU% column still shows 100% for these even though they are actually consuming 0% CPU according to Windows Task manager.
Give V 0.40 a try. Extra -> BoincTasks settings: Tab Tasks: Remove the check before "CPU % Long", this should resolve the 100% problem.

Pepo

Worked immediately.

Could the checkbox please apply also to the Gadget miniWindow? It is showing the long-time average regardless of the "CPU % Long-time average" settings.
Peter

fred

Quote from: Pepo on January 25, 2010, 02:49:25 PM
Worked immediately.

Could the checkbox please apply also to the Gadget miniWindow? It is showing the long-time average regardless of the "CPU % Long-time average" settings.
Oeps forgot that one, next release.

frankwall

Thanks - works like a charm!!