BT 1.25

Started by Pepo, October 26, 2011, 07:39:27 AM

Previous topic - Next topic

0 Members and 3 Guests are viewing this topic.

fred

Quote from: idahofisherman on November 03, 2011, 08:35:00 PM
Is there a time limit for missed?
Missed is not seen in an upload or reported state.
It depends on the project. The history only works properly on projects with a dependable time to completion.
It that's more that 50% off it may cause a missed history.
A project going from 60 to 0 in one second will probably produce a missed.
Setting 120 seconds means, that the tasks shouldn't be gone in 120 seconds.
The sampling time is: to completion / 2.

On the other hand, a 15 seconds setting shouldn't give you  98% load.
Make sure BT is closed
  • and running in the background. The history should be as short as possible and be move to long term memory A.S.A.P.


fred

Quote from: 3216842 on November 04, 2011, 11:56:04 AM
Hello to all and congratulation for this great software.
I have a little annoying problem with BT: When stopping BONIC client via menu, BT comes to a little freeze and after ~15-20 sek. BT shows up an error message "The BOINC client couldn't be shut down". Further ~10 sek. later the client  and after this all running tasks/WUs stops (in this order) :-( .
This behavior i watched in BT 1.23/.24/.25. Running XP Pro SP3, BOINC 6.12.34, BOINC Manager is not running.

Log:
04 November 2011 - 12:18:00 Shut down BoincTasks ---- The BOINC client is shutting down
04 November 2011 - 12:18:30 Shut down BoincTasks ---- The BOINC client has shut down
04 November 2011 - 12:18:30 Shut down BoincTasks ---- Der BOINC-Client konnte nicht beendet werden
04 November 2011 - 12:18:30 Connect ---- The connection was lost, because the client stopped

Hope for a fix.

__W__
I will fix the freeze on the main program, while it's waiting for the client to shut down.
Der BOINC-Client konnte nicht beendet werden. : Means the client is still running even as it's ordered to shut down.
The reason may be a running BOINC Manager that restarted the client. Or a failure on the client itself to properly shut down.
This is by no means a BT problem.

Pepo

Quote from: 3216842 on November 04, 2011, 11:56:04 AM
When stopping BONIC client via menu, BT comes to a little freeze and after ~15-20 sek. BT shows up an error message "The BOINC client couldn't be shut down". Further ~10 sek. later the client  and after this all running tasks/WUs stops (in this order) :-(
You could also check BOINC event log, whether it took the client some additional unexpected time to stop the task applications. (But I've already complained D.A. that the client is not fair and verbose enough on this.)

Quote from: fred on November 04, 2011, 12:30:57 PM
The reason may be [...] a failure on the client itself to properly shut down.
I do not remember anymore, how responsive the client was, while it was waiting for unresponsive tasks to finish, until finally killed them. This could have been the 20-30 sec. communication delay.
Peter

fred

Quote from: Pepo on November 04, 2011, 01:32:58 PM
I do not remember anymore, how responsive the client was, while it was waiting for unresponsive tasks to finish, until finally killed them. This could have been the 20-30 sec. communication delay.
That could be it, I raised the timeout to 1 minute.
As it is in a separate thread now, a wait isn't an issue.

3216842

Quote from: fred on November 04, 2011, 12:30:57 PM
I will fix the freeze on the main program, while it's waiting for the client to shut down.
Der BOINC-Client konnte nicht beendet werden. : Means the client is still running even as it's ordered to shut down.
The reason may be a running BOINC Manager that restarted the client. Or a failure on the client itself to properly shut down.
This is by no means a BT problem.
To get some clearance to some points, i have done some testing:
- BOINC Manager is not running
- BOINC Client shuts down much later than expected with the error message as popup and in logs and no other related log entries are listed
- the shutdown problem shows up, even if no WUs/tasks are running (all tasks halted)
- shuting down BOINC client from BOINC Manager is no problem and works fast (no matter running/not running any tasks), so i don't think this is a problem of the Client to shutdown ?!
- no problems with shuting down the client with BT pre .23

Happy debugging
__W__


Pepo

Quote from: 3216842 on November 04, 2011, 10:07:47 PM
- shuting down BOINC client from BOINC Manager is no problem and works fast (no matter running/not running any tasks), so i don't think this is a problem of the Client to shutdown ?!
Maybe a side note - recent BOINC versions appear as shutting down very fast - in reality, as soon as the Manager gets a notification from the client that "it understood it should shut down", it simply disappears in nirvana. The client then slowly tries to finishing the tasks, while noone is aware and bothers anymore :-X :-\
Peter

3216842

Quote from: Pepo on November 04, 2011, 11:26:16 PM
Maybe a side note - recent BOINC versions appear as shutting down very fast - in reality, as soon as the Manager gets a notification from the client that "it understood it should shut down", it simply disappears in nirvana. The client then slowly tries to finishing the tasks, while noone is aware and bothers anymore :-X :-\
The shutdown over the BOINC Manager is much faster than the shutdown over BT.
I know about Windows "fooling" around with running/not running programms and second it's only XP Pro, so i crosschecked it with some systemtools like Sysinternals Process Explorer and some other tools  ;D .

__W__

fred

Quote from: 3216842 on November 05, 2011, 01:22:02 AM
The shutdown over the BOINC Manager is much faster than the shutdown over BT.
Lets see how you like 1.26.

3216842

Quote from: fred on November 05, 2011, 09:49:49 AMLets see how you like 1.26.
I am shure that i will like it 8) .
Just a little cosmetic thing, the "WWW" rollout menu is twice as long as it should be and at the end of the "Extras" menu is a surplus separator line :o .
;D ;D ;D Uhhh, what a horror, this is looking bad ;D ;D ;D
:P but much better than the BOINC Manager :P

__W__

fred

Quote from: idahofisherman on November 03, 2011, 08:35:00 PM
This seems to have fixed the cpu problem, but now I have increased "Missed" instead of Report OK. 
What project(s) is causing these problems.

fred

This is an example:

11777   PrimeGrid   06-11-2011 19:16   Computation for task llrCUL_104726483_4 finished   
11778   PrimeGrid   06-11-2011 19:16   Restarting task llrCUL_105073299_1 using llrCUL version 609   
11779   PrimeGrid   06-11-2011 19:16   Started upload of llrCUL_104726483_4_0   
11780   PrimeGrid   06-11-2011 19:16   Finished upload of llrCUL_104726483_4_0   
11781   PrimeGrid   06-11-2011 19:16   Sending scheduler request: To report completed tasks.   
11782   PrimeGrid   06-11-2011 19:16   Reporting 1 completed tasks, not requesting new tasks   
11783   PrimeGrid   06-11-2011 19:16   Scheduler request completed   

PrimeGrid   6.09 Cullen (LLR)   llrCUL_104726483_4   03d,15:44:09 (02d,04:17:26)   06-11-2011 19:17   06-11-2011 19:17      Missed

Missed by 1 second  :-X

As you can see the upload->ready->gone is within 1 second. So this is quite impossible the catch. At least not with extreme overhead.

Pepo

#26
Quote from: fred on November 06, 2011, 06:35:16 PM
This is an example:

11777   PrimeGrid   06-11-2011 19:16   Computation for task llrCUL_104726483_4 finished      
11779   PrimeGrid   06-11-2011 19:16   Started upload of llrCUL_104726483_4_0   
11781   PrimeGrid   06-11-2011 19:16   Sending scheduler request: To report completed tasks.   
11783   PrimeGrid   06-11-2011 19:16   Scheduler request completed   

PrimeGrid   6.09 Cullen (LLR)   llrCUL_104726483_4   03d,15:44:09 (02d,04:17:26)   06-11-2011 19:17   06-11-2011 19:17      Missed

Missed by 1 second  :-X

As you can see the upload->ready->gone is within 1 second. So this is quite impossible the catch. At least not with extreme overhead.
Unfortunately, Fred, I can personally not see it - your example is not that obvious, it is missing ":seconds" in the time values  ;)
(The same happens to me when posting logs - I often have to turn seconds on and repost them.)


BTW, if the upload phase would start a minute after finhished, what would guess the History, if looking at the task some 10 seconds after finished - already in the Uploading phase? (I would guess so.)
Peter

fred

Quote from: Pepo on November 06, 2011, 11:11:35 PM
Unfortunately, Fred, I can personally not see it - your example is not that obvious, it is missing ":seconds" in the time values  ;)
I will make 2 changes for 1.26.
1) If a tasks running state has changed (running -> upload), the next history fetch will be without any delay. Now it is after the minimum cycle. Thus gaining 4 seconds.
2) A check "Time left not very accurate". In this mode the interval will change from timeleft / 2 to timeleft /4. This shouldn't add to much extra overhead as only the running tasks are read back and not everything.

Pepo

Yes, both could help to catch them. Just these Surveills are monitoring-unfriendly ;D - their last second ETA (at some 87%) is often more than 2 minutes :-X there are apparently no tricks possible.



Just seen something different regarding timing in Messages: 5298 PrimeGrid 07.11.11 10:48:18 [checkpoint] result LLR_SGS_107255402_0 checkpointed
5301 PrimeGrid 07.11.11 10:49:07 Computation for task LLR_SGS_107255402_0 finished
5302 PrimeGrid 07.11.11 10:49:07 Starting task LLR_SGS_106902280_0 using llrTPS version 609
5303 PrimeGrid 07.11.11 10:49:08 Started upload of LLR_SGS_107255402_0_0
5304 PrimeGrid 07.11.11 10:49:10 Finished upload of LLR_SGS_107255402_0_0
5305 PrimeGrid 07.11.11 10:49:15 Sending scheduler request: To report completed tasks.
5307 PrimeGrid 07.11.11 10:49:18 Scheduler request completed: got 0 new tasks

and no more notes on LLR_SGS_107255402_0 - completed.
But the History says  PrimeGrid LLR_SGS_107255402_0 00:31:58 (00:26:39) 07.11.11 10:49:14 07.11.11 10:51:19 Reported: OK where the times are Elapsed / Finished / Reported: it is said being reported 2 minutes after the scheduler request being finished in the event log.

Actually, when I look now at tasks just being finished, if the History notices tasks' "Ready to report" state, they are all getting "Reported" timestamp in History 1-2 minutes after finishing their scheduler report (although the tasks disappear immediately from Tasks tab). I've also seen the transition from "Sending" to "Ready to report" happening with similar more than 1 minute delay.
5387 PrimeGrid 07.11.11 11:11:08 Computation for task LLR_SGS_106902280_0 finished
5390 PrimeGrid 07.11.11 11:11:10 Finished upload of LLR_SGS_106902280_0_0
5393 PrimeGrid 07.11.11 11:11:15 Scheduler request completed: got 0 new tasks
PrimeGrid LLR_SGS_106902280_0 00:22:00 (00:21:09) 07.11.11 11:11:14 07.11.11 11:12:55 Reported: OK

5475 surveill@home 07.11.11 11:29:14 Computation for task wu_1320320103_162664_0 finished
5476 surveill@home 07.11.11 11:29:15 Started upload of wu_1320320103_162664_0_data
5479 surveill@home 07.11.11 11:29:17 Finished upload of wu_1320320103_162664_0_data
5480 surveill@home 07.11.11 11:29:21 Sending scheduler request: To report completed tasks.
5482 surveill@home 07.11.11 11:29:22 Scheduler request completed: got 0 new tasks
surveill@home wu_1320320103_162664_0 00:16:13 (00:00:02) 07.11.11 11:29:21 07.11.11 11:31:29 Reported: OK

On the machine, History sampling is set to 4-10 seconds. Why the delay? If some task gets "missed", its timestamp is immediately (i.e. 1-5 sec.) after noticing it was gone.
Peter

fred

Quote from: Pepo on November 07, 2011, 10:50:09 AM
On the machine, History sampling is set to 4-10 seconds. Why the delay? If some task gets "missed", its timestamp is immediately (i.e. 1-5 sec.) after noticing it was gone.
Normally only the running tasks are fetched. So a state from Uploading -> Ready is not noticed.
Once in 120 second a full fetch is performed (very recourse intensive, imagine a couple of thousand tasks). That's why there is a max 2 minute delay.
A missed shows up in the full fetch, so the time stamp is immediately.

But I did some tweaking an I think it should perform better in 1.26, but BT can't do the impossible.