BT 1.06

Started by Dirk, June 09, 2011, 08:23:16 PM

Previous topic - Next topic

0 Members and 4 Guests are viewing this topic.

Dirk

I installed BoincTasks V1.06 BETA over BOINC DEV-V6.12.28.

I made a config.xml with UL/DL/report+request refresh 600 seconds.

BOINC have currently ~ 120 results ready for UL.

This are the messages:
SETI@home   09.06.2011 21:53:58   Started upload of 04ap11aa.13021.12337.11.10.217_0_0   
SETI@home   09.06.2011 21:54:22   Temporarily failed upload of 04ap11aa.13021.12337.11.10.217_0_0: HTTP error   
SETI@home   09.06.2011 21:54:22   Started upload of 02ap11ab.26890.13564.16.10.208_1_0   
SETI@home   09.06.2011 21:54:34   Temporarily failed upload of 02ap11ab.26890.13564.16.10.208_1_0: HTTP error   
SETI@home   09.06.2011 21:54:34   Started upload of 04mr11af.26524.13564.6.10.198_1_0   
SETI@home   09.06.2011 21:54:57   Temporarily failed upload of 04mr11af.26524.13564.6.10.198_1_0: connect() failed   
   
SETI@home   09.06.2011 22:10:41   Started upload of 04ap11aa.13021.12337.11.10.217_0_0   
SETI@home   09.06.2011 22:11:05   Temporarily failed upload of 04ap11aa.13021.12337.11.10.217_0_0: connect() failed   
SETI@home   09.06.2011 22:11:05   Started upload of 02ap11ab.26890.13564.16.10.208_1_0   
SETI@home   09.06.2011 22:11:29   Temporarily failed upload of 02ap11ab.26890.13564.16.10.208_1_0: connect() failed   
SETI@home   09.06.2011 22:11:29   Started upload of 04mr11af.26524.13564.6.10.198_1_0   
SETI@home   09.06.2011 22:11:46   Temporarily failed upload of 04mr11af.26524.13564.6.10.198_1_0: HTTP error   
   
(I have a cc_config.xml with <max_file_xfers_per_project>1)

This are ~ 15 minutes between the last failure and new try.
That's O.K.?

[EDIT: Maybe it need some time for to adjust. This was the first try. The next few others are ~ 10 mins.]


BTW.
Maybe it's possible (currently I can't test it because of the currently server probs at S@h) that BT make an automatically 'update' retry, and if present all ULs/DLs get a reset/retry, so that BOINC could ask for new work if needed?
And/or, AFAIK if you have ULs ('> CPUs x2') and/or DLs backlogged in BOINC/transfers, BOINC don't ask for new work.
But in my eyes this is nonsense, maybe currently the UL/DL isn't possible, scheduler is available, why not request new work and UL/DL later if it's possible.
You could make it possible that BT/BOINC ask for new work, although ULs/DLs backlogged?
(aka Sutaru Tsureku)

Best regards! :-)


Pepo

Quote from: Sutaru Tsureku on June 09, 2011, 08:23:16 PM
I installed BoincTasks V1.06 BETA over BOINC DEV-V6.12.28.
[...]
Maybe it's possible (currently I can't test it because of the currently server probs at S@h) that BT make an automatically 'update' retry, and if present all ULs/DLs get a reset/retry, so that BOINC could ask for new work if needed?
And/or, AFAIK if you have ULs ('> CPUs x2') and/or DLs backlogged in BOINC/transfers, BOINC don't ask for new work.
But in my eyes this is nonsense, maybe currently the UL/DL isn't possible, scheduler is available, why not request new work and UL/DL later if it's possible.
You could make it possible that BT/BOINC ask for new work, although ULs/DLs backlogged?
I suspect that BT will not be able to argue the client into "nevertheless ask for some new work", if the client refuses to do so. There is no known backdoor yet. The client would need to be changed.
Peter

Pepo

Quote from: fred on June 08, 2011, 05:39:36 AM
Quote from: Pepo on June 08, 2011, 12:48:18 AM
Again the nasty hidden bug with incorrect times, while an active task does not run. This time I've noticed it immediately because of the -23.198% progress (nothing wrong with correctly displaying an eventually incorrect value) and and the progress bar going to the left of the Progress% column (not so correct). (Screenshots at hand. I've suspended this task.)
Could I somehow help with localizing the reason for the wrong times' bug?
I will do a check and a partial rewrite for 1.06, if nothing helps, I will add some debugging options.
When running, the task's Progress is at 186.040% (thanks for still keeping the unusual reported % values even over 0-100%), when suspended, then it is set to 100.000% (why the difference? does the client report it this way?). In both cases the progress bar correctly does not exceed its box. Interestingly, in both cases the task's Properties dialog reports 0.000% (??)

Elapsed time is now identical in both states, but, well, it sometimes was correct and occasionally not - I've again to notice if it will sometimes get incorrect ::)

Remaining time estimate is "-" in the Tasks tab, but "-01d,03:24:38" in Properties dialog. I'm possibly asking and forgetting this again and again - the estimate comes directly from the client, or co-calculated by BT? The Manager displays always "-" - possibly just its confused reaction on a negative calculated estimate? (Sorry, I'm a "debugging guy" - I'd rather like to directly see all incorrect values, possibly with warnings, than "nice clean display with all unexpected values obscured" :-X)
Peter

fred

Quote from: Pepo on June 10, 2011, 07:18:56 AM
When running, the task's Progress is at 186.040% (thanks for still keeping the unusual reported % values even over 0-100%), when suspended, then it is set to 100.000% (why the difference? does the client report it this way?). In both cases the progress bar correctly does not exceed its box. Interestingly, in both cases the task's Properties dialog reports 0.000% (??)

Elapsed time is now identical in both states, but, well, it sometimes was correct and occasionally not - I've again to notice if it will sometimes get incorrect ::)

Remaining time estimate is "-" in the Tasks tab, but "-01d,03:24:38" in Properties dialog. I'm possibly asking and forgetting this again and again - the estimate comes directly from the client, or co-calculated by BT? The Manager displays always "-" - possibly just its confused reaction on a negative calculated estimate? (Sorry, I'm a "debugging guy" - I'd rather like to directly see all incorrect values, possibly with warnings, than "nice clean display with all unexpected values obscured" :-X)
The values are from the client and most likely incorrectly send by the project application. Probably relies on the Manager to adjust all values above 100 to 100.
If the remaining time is invalid it's shown as -, again a project application error.
It's impossible to have a remaining time < 0 :o.

fred

Quote from: Sutaru Tsureku on June 09, 2011, 08:23:16 PM
I installed BoincTasks V1.06 BETA over BOINC DEV-V6.12.28.

I made a config.xml with UL/DL/report+request refresh 600 seconds.

BOINC have currently ~ 120 results ready for UL.
BT only retries the uploads and downloads.
Look for refresh messages in the logging Show->Show log.
To avoid over loading the server only 100 uploads/downloads are retried at the time. But this will break the Project backoff in any case.
But when the connection is overloaded, the uploads and downloads will be back in retry, or backoff, within seconds.

Better post these problems in the Project or BOINC message boards, nothing I can do.

Dirk

Quote from: fred on June 10, 2011, 10:39:20 AM
Quote from: Pepo on June 10, 2011, 07:18:56 AM
When running, the task's Progress is at 186.040% (thanks for still keeping the unusual reported % values even over 0-100%), when suspended, then it is set to 100.000% (why the difference? does the client report it this way?). In both cases the progress bar correctly does not exceed its box. Interestingly, in both cases the task's Properties dialog reports 0.000% (??)

Elapsed time is now identical in both states, but, well, it sometimes was correct and occasionally not - I've again to notice if it will sometimes get incorrect ::)

Remaining time estimate is "-" in the Tasks tab, but "-01d,03:24:38" in Properties dialog. I'm possibly asking and forgetting this again and again - the estimate comes directly from the client, or co-calculated by BT? The Manager displays always "-" - possibly just its confused reaction on a negative calculated estimate? (Sorry, I'm a "debugging guy" - I'd rather like to directly see all incorrect values, possibly with warnings, than "nice clean display with all unexpected values obscured" :-X)
The values are from the client and most likely incorrectly send by the project application. Probably relies on the Manager to adjust all values above 100 to 100.
If the remaining time is invalid it's shown as -, again a project application error.
It's impossible to have a remaining time < 0 :o.

I never saw progress > 100 %.

Pepo, which CPU it's - AMD or Intel?
IIRC, the AMDs have probs with the test at start of the stock S@h app. Sometimes they are in a loop.
And/or the WUs in question are maybe WUs which were restarted?
(aka Sutaru Tsureku)

Best regards! :-)


Dirk

#6
Quote from: Pepo on June 10, 2011, 05:59:08 AM
Quote from: Sutaru Tsureku on June 09, 2011, 08:23:16 PM
I installed BoincTasks V1.06 BETA over BOINC DEV-V6.12.28.
[...]
Maybe it's possible (currently I can't test it because of the currently server probs at S@h) that BT make an automatically 'update' retry, and if present all ULs/DLs get a reset/retry, so that BOINC could ask for new work if needed?
And/or, AFAIK if you have ULs ('> CPUs x2') and/or DLs backlogged in BOINC/transfers, BOINC don't ask for new work.
But in my eyes this is nonsense, maybe currently the UL/DL isn't possible, scheduler is available, why not request new work and UL/DL later if it's possible.
You could make it possible that BT/BOINC ask for new work, although ULs/DLs backlogged?
I suspect that BT will not be able to argue the client into "nevertheless ask for some new work", if the client refuses to do so. There is no known backdoor yet. The client would need to be changed.

Fred, just a question. :)
I'm not a coder. ;)

AFAIK, BOINC is open software, all people could edit it.
E.g. if you take the original BOINC V6.12.26 (last recommended) client you could change this rule?
This is the first client which have the nice <max_tasks_reported> feature.

Maybe it's like (after a few code examples here in the forum ???)..
don't request work, if backlogged downloads/uploads
you chance it to..
also request work, if backlogged downloads/uploads

But I worry it's not so easy.. :(

If you are uncertain if it would be worth to do..
Maybe start a poll if other would like to have this client also. ;)
(aka Sutaru Tsureku)

Best regards! :-)


fred

Quote from: Sutaru Tsureku on June 10, 2011, 01:39:17 PM
don't request work, if backlogged downloads/uploads
you chance it to..
also request work, if backlogged downloads/uploads
No, I will not help release code that will surely be considered cheating. :o
Don't try to convince me it's not cheating....
Simply wait for work like everybody else. ;D

Pepo

Quote from: Sutaru Tsureku on June 10, 2011, 01:22:32 PM
Quote from: fred on June 10, 2011, 10:39:20 AM
Quote from: Pepo on June 10, 2011, 07:18:56 AM
When running, the task's Progress is at 186.040% [...]
The values are from the client and most likely incorrectly send by the project application. Probably relies on the Manager to adjust all values above 100 to 100.
I never saw progress > 100 %.
Pepo, which CPU it's - AMD or Intel?
IIRC, the AMDs have probs with the test at start of the stock S@h app. [...]
It's not about a particular CPU. It's about bugs in the chain  apps (confirmed) - client (rejected) - Manager/BT (rejected).

In this case the culprit is a QCN task, which occurred in an unexpected condition (running without an attached accelerator sensor for more than a couple of weeks) and its progress simply continues over 100.00% towards infinity (happens on any CPU).

In the past, there were many projects (and expectedly there will appear some), which apps occasionally manifested such problem. I'm just sad that BOINC Manager obscures such bugs, instead of pointing at them and helping to resolve them. And glad that Fred is still willing to display (and red-highlight? ;)) such bugs.
Peter

Pepo

Quote from: fred on June 10, 2011, 10:39:20 AM
Quote from: Pepo on June 10, 2011, 07:18:56 AM
Remaining time estimate is "-" in the Tasks tab, but "-01d,03:24:38" in Properties dialog. I'm possibly asking and forgetting this again and again - the estimate comes directly from the client, or co-calculated by BT? The Manager displays always "-" - possibly just its confused reaction on a negative calculated estimate? (Sorry, I'm a "debugging guy" - I'd rather like to directly see all incorrect values, possibly with warnings, than "nice clean display with all unexpected values obscured" :-X)
The values are from the client and most likely incorrectly send by the project application. Probably relies on the Manager to adjust all values above 100 to 100.
If the remaining time is invalid it's shown as -, again a project application error.
It's impossible to have a remaining time < 0 :o.
Nothing is impossible :D especially the booogs are very simply possible - just to find the offender! If the B.Manager would not sanitize displayed values to "valid ranges", then larger masses would notice issues and point their fingers against... (If at least unreleased testing versions would do that...)
Peter

fred

Quote from: Pepo on June 10, 2011, 02:29:21 PM
In the past, there were many projects (and expectedly there will appear some), which apps occasionally manifested such problem. I'm just sad that BOINC Manager obscures such bugs, instead of pointing at them and helping to resolve them. And glad that Fred is still willing to display (and red-highlight? ;)) such bugs.
See wish list.

Pepo

Yesterday (11.6. at 11:33) while looking at BT, I've suddenly seen a few running tasks with warnings about their (overlong) checkpoint times (the machine named Vandus):



A minute later suddenly all was fine:

Note that FreeHALs do checkpoint very often, occasionally even each second! (In my opinion this app is brain-damaged, sorry!)

Today, shortly after a new BOINC start, the same happened again (snapshot saved at 19:39:00):


and a half minute later (snapshot saved at 19:39:16, BT's update frequency around 3-4 seconds)


Unfortunately I've no exact idea, when the snapshots were taken - possibly some 10-15 seconds prior to their storage time.

I took a look at BOINC log - nothing extraordinary:
Quote12. 6. 2011 19:37:50 |  | Suspending computation - initial delay
12. 6. 2011 19:38:01 | Quake-Catcher Network | Restarting task qcnac_097909_0 using qcnsensor version 652
12. 6. 2011 19:38:01 | SETI@home | Restarting task 14mr11ac.9707.7838.7.10.153_0 using setiathome_enhanced version 610
12. 6. 2011 19:38:01 | FreeHAL@home | Restarting task fh_nci_0_30579475_52_0 using newFreeHAL version 193
12. 6. 2011 19:38:01 | FreeHAL@home | Restarting task fh_nci_0_30579475_15_0 using newFreeHAL version 193
12. 6. 2011 19:38:01 | FreeHAL@home | Restarting task fh_nci_0_30579424_101_0 using newFreeHAL version 193
12. 6. 2011 19:38:01 | FreeHAL@home | Restarting task fh_nci_0_30579475_5_0 using newFreeHAL version 193
12. 6. 2011 19:38:01 | WUProp@Home | Restarting task wu_1307636806_24822_0 using data_collect version 243
12. 6. 2011 19:38:03 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_52_0 checkpointed
12. 6. 2011 19:38:03 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_15_0 checkpointed
12. 6. 2011 19:38:03 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:03 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:04 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_52_0 checkpointed
12. 6. 2011 19:38:04 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_15_0 checkpointed
12. 6. 2011 19:38:04 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:04 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:04 | WUProp@Home | [checkpoint] result wu_1307636806_24822_0 checkpointed
12. 6. 2011 19:38:09 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:11 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:16 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:17 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:18 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:18 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:19 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_52_0 checkpointed
12. 6. 2011 19:38:19 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:19 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:20 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:20 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:37 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_15_0 checkpointed
12. 6. 2011 19:38:38 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:40 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_15_0 checkpointed
12. 6. 2011 19:38:41 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:45 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_15_0 checkpointed
12. 6. 2011 19:38:46 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_15_0 checkpointed
12. 6. 2011 19:38:47 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_52_0 checkpointed
12. 6. 2011 19:38:47 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:47 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:49 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:51 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:53 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:56 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:38:56 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:58 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:38:59 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:39:01 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:39:02 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:39:04 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:39:05 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:39:06 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_52_0 checkpointed
12. 6. 2011 19:39:06 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed
12. 6. 2011 19:39:07 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:39:17 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_52_0 checkpointed
12. 6. 2011 19:39:17 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_15_0 checkpointed
12. 6. 2011 19:39:17 | FreeHAL@home | [checkpoint] result fh_nci_0_30579424_101_0 checkpointed
12. 6. 2011 19:39:17 | FreeHAL@home | [checkpoint] result fh_nci_0_30579475_5_0 checkpointed

I've tried to guess the exact snapshot times accordinf to the tasks' checkpoint counts in [braces], snapshot 1 must have been between 19:38:04-19:38:09 && 19:38:03-19:38:20 && 19:38:01-19:38:03 && 19:38:01-19:38:04 - simply not possible, the checkpoint counts in [braces] somehow do not match checkpoints in log? ??? Never mind...

The question to answer is, whether the client occasionally delivers incorrect checkpoint times (for a period of couple updates), or BT is replacing the correct values with incorrect ones?
Peter

Corsair

from time to time in the tasks window - % CPU column and in the GPU tasks only this is not shown, and after some cycles is shown again.
Roses don't bloom on the sailor's grave

Corsair.

Pepo

Quote from: Pepo on June 12, 2011, 09:48:49 PM
Yesterday (11.6. at 11:33) while looking at BT, I've suddenly seen a few running tasks with warnings about their (overlong) checkpoint times. A minute later suddenly all was fine.
(Note that FreeHALs do checkpoint very often, occasionally even each second! (In my opinion this app is brain-damaged, sorry!))

Today, shortly after a new BOINC start, the same happened again (snapshot saved at 19:39:00), and a half minute later (snapshot saved at 19:39:16, BT's update frequency around 3-4 seconds) it was again gone.

The question to answer is, whether the client occasionally delivers incorrect checkpoint times (for a period of couple updates), or BT is replacing the correct values with incorrect ones?
I think the answer is, that the tasks' CPU time is freely jumping (observed on both BT and BM) between just a few minutes:seconds and more hours than the elapsed time (although FreeHAL claims to be an nCi app ;D) - such astronomic CPU times for a single-threaded nCi task can be also seen on my screenshots (I've not noticed it before). (Correction to my my opinion - this app is not brain-damaged, it has no brain, just claims having some virtual intelligence (sorry to be this biased!))
Peter