BT 0.87

Started by glennaxl, November 12, 2010, 02:05:07 PM

Previous topic - Next topic

0 Members and 9 Guests are viewing this topic.

fred

Quote from: Beyond on November 14, 2010, 05:10:14 PM
1)  Maybe a new indicator under Projects/Tasks so it looks like 0/0/1? :) ;) ;D ;) :)
2)  MW tasks run for 3-4 minutes on my systems.  My checkpoint warning triggers are set at 5, 10 & 20 minutes.  About 40% of the time the last BT update period shows a checkpoint time of over 20 minutes and thus flashes the > 20 minute warning color.  Don't know where this erroneous checkpoint time is coming from but it can't be correct to show a > 20 minute checkpoint time on a 3 minute WU.  I have set custom global checkpoints in config.xml.  Here's my checkpoints:
1) On the list.
2) I will investigate.

Beyond

Quote from: fred on November 14, 2010, 05:32:50 PM
Quote from: Beyond on November 14, 2010, 05:10:14 PM
1)  Maybe a new indicator under Projects/Tasks so it looks like 0/0/1? :) ;) ;D ;) :)
2)  MW tasks run for 3-4 minutes on my systems.  My checkpoint warning triggers are set at 5, 10 & 20 minutes.  About 40% of the time the last BT update period shows a checkpoint time of over 20 minutes and thus flashes the > 20 minute warning color.  Don't know where this erroneous checkpoint time is coming from but it can't be correct to show a > 20 minute checkpoint time on a 3 minute WU.  I have set custom global checkpoints in config.xml.  Here's my checkpoints:
1) On the list.
2) I will investigate.

Thanks!

2) After watching the tasks a lot more it appears to be showing this erroneous checkpoint time only on the LAST update period before the WU exits.  I thought that it happened both at the start and the finish of the WU, but it seems to be only at the WU finish.  As far as the behavior of the config.xml checkpoint settings, shouldn't the individual project/app settings override the global settings instead of the other way around?


Pepo

#17
To add my bit to the flashing Christmas trees ;)
Well I've noticed that my CPDN task shows red highlighted "[0] 14d,02:58:24" from last checkpoint (which is pretty unusual for CPDN), but its elapsed times were just around 18:38:20 (19:11:47) hours (snapshots taken if necessary). So I've looked in BT's task properties
Name famous_s303_599_200_000222072_1
CPU time at last checkpoint 14d,02:58:24
CPU time 18:38:20
Elapsed time 19:11:47
Estimated time remaining 01d,02:36:25
Fraction done 92.972 %

and BOINC Mgr's task properties
CPU time for last checkpoint 338:58:24
CPU time 339:04:33
Elapsed time 350:50:31

then took the numbers from client_state.xml
<result>
   <name>famous_s303_599_200_000222072_1</name>
   <final_cpu_time>67100.070000</final_cpu_time> (18:38:20)
   <final_elapsed_time>69107.507910</final_elapsed_time> (19:11:48)
   <exit_status>0</exit_status>
   <state>2</state>
   <platform>windows_intelx86</platform>
   <version_num>611</version_num>
   <wu_name>famous_s303_599_200_000222072</wu_name>
   <report_deadline>1296099703.000000</report_deadline>
   <received_time>1288210643.950953</received_time>
   ....
</result>

<active_task>
   <project_master_url>http://cpdnbeta.oerc.ox.ac.uk/</project_master_url>
   <result_name>famous_s303_599_200_000222072_1</result_name>
   <active_task_state>9</active_task_state>
   <app_version_num>611</app_version_num>
   <slot>1</slot>
   <checkpoint_cpu_time>1220304.000000</checkpoint_cpu_time> (338:58:24)
   <checkpoint_elapsed_time>1262638.891236</checkpoint_elapsed_time> (350:43:59)
   <fraction_done>0.929725</fraction_done>
   <current_cpu_time>1220673.000000</current_cpu_time> (339:04:33)
   <once_ran_edf>0</once_ran_edf>
   <swap_size>102150144.000000</swap_size>
   <working_set_size>59330560.000000</working_set_size>
   <working_set_size_smoothed>59330560.000000</working_set_size_smoothed>
   <page_fault_rate>0.000000</page_fault_rate>
</active_task>


No idea what could went wrong. The task was just waiting preempted. A few hours later, when the task was again running, BT was correctly displaying the Elapsed times around 14 days and the checkpoint delay in minutes.

Finally one snapshot:
Peter

Pepo

#18
Quote from: Pepo on November 15, 2010, 09:38:53 AM
Well I've noticed that my CPDN task shows red highlighted "[0] 14d,02:58:24" from last checkpoint (which is pretty unusual for CPDN), but its elapsed times were just around 18:38:20 (19:11:47) hours (snapshots taken if necessary). So I've looked in BT's task properties
Name famous_s303_599_200_000222072_1
CPU time at last checkpoint 14d,02:58:24
CPU time 18:38:20
Elapsed time 19:11:47
Estimated time remaining 01d,02:36:25
Fraction done 92.972 %

And again now:
CPU time at last checkpoint 14d,21:02:11
CPU time 18:38:20
Elapsed time 19:11:47
Estimated time remaining 12:13:06
Fraction done 96.805 %

I'll try to let it run...
...suddenly it's fine, at least the Tasks tab's Elapsed column (like 15d,10:02:43 (14d,21:09:47) a moment later), but it was not really in the properties window:CPU time at last checkpoint 14d,21:02:11
CPU time 18:38:20
Elapsed time 19:11:47
Estimated time remaining 12:12:59
Fraction done 96.806 %
and again wrong in the Tasks tab after suspending the task.

OK, it seems like the CPU time is constantly wrong in the Properties, and also wrong in the Elapsed column, while the task is not running.
Peter

fred

Quote from: Pepo on November 15, 2010, 11:37:41 AM
OK, it seems like the CPU time is constantly wrong in the Properties, and also wrong in the Elapsed column, while the task is not running.
That's why I proposed to show checkpoints only on running tasks. Values in other states may be unpredictable.

Pepo

Quote from: fred on November 15, 2010, 11:49:41 AM
Quote from: Pepo on November 15, 2010, 11:37:41 AM
OK, it seems like the CPU time is constantly wrong in the Properties, and also wrong in the Elapsed column, while the task is not running.
That's why I proposed to show checkpoints only on running tasks. Values in other states may be unpredictable.
How should (current_cpu_time-checkpoint_cpu_time) value differ between running and other states? What should be predicted there?
The Properties' values are interestingly wrong too...
Peter

Pepo

0.86 behaves similarly. I've checked also the 0.84 and 0.83: Properties wrong the same way, but in the Tasks tab, CPU and checkpoint are fine, just the elapsed wall-clock time is wrong (19:11:47). Interestingly a Collatz task (around 1d2h) and PrimeGrid task (3 1/2 days) are everywhere displayed correctly, also in Properties.
Peter

Pepo

Quote from: Pepo on November 15, 2010, 12:11:52 PM
Interestingly a Collatz task (around 1d2h) and PrimeGrid task (3 1/2 days) are everywhere displayed correctly, also in Properties.
OK, to be fair to the CPDN task, I can now see also a waiting Collatz task with highlighted [0] 01d,02:37:36 in Checkpoint column, following (apparently correct?) could be seen in the Properties: CPU time at last checkpoint 01d,02:37:36
CPU time 01d,00:53:03
Elapsed time 01d,02:20:55
Estimated time remaining 01:03:18
Fraction done 96.388 %

Perhaps an overflow by a number of days, or what?
Peter

Beyond

Quote from: fred on November 15, 2010, 11:49:41 AM
Quote from: Pepo on November 15, 2010, 11:37:41 AM
OK, it seems like the CPU time is constantly wrong in the Properties, and also wrong in the Elapsed column, while the task is not running.
That's why I proposed to show checkpoints only on running tasks. Values in other states may be unpredictable.
This makes sense to me.  Maybe that's why the last value of a task as it's ending gives such a crazy checkpoint value: because it's not actually running anymore?

Pepo

#24
Quote from: Beyond on November 15, 2010, 05:08:06 PM
Quote from: fred on November 15, 2010, 11:49:41 AM
Quote from: Pepo on November 15, 2010, 11:37:41 AM
OK, it seems like the CPU time is constantly wrong in the Properties, and also wrong in the Elapsed column, while the task is not running.
That's why I proposed to show checkpoints only on running tasks. Values in other states may be unpredictable.
This makes sense to me.  Maybe that's why the last value of a task as it's ending gives such a crazy checkpoint value: because it's not actually running anymore?
NO! Because it is incorrectly calculated. Or read from the protocol data. Or the RPC protocol contains wrong data. Or whatever else? Because I can not imagine how could be the (current_cpu_time-checkpoint_cpu_time) value incorrectly calculated.

Note that the BOINC Manager has both values correct at the same moment in the Task Properties dialog, while the BT's Properties sometimes shows crazy elapsed + cpu values, which are the source of the checkpoint column. And both BM and BT read the same GUI RPC data streams.
Peter

Pepo

Quote from: Pepo on November 15, 2010, 12:11:52 PM
Interestingly a Collatz task (around 1d2h) and PrimeGrid task (3 1/2 days) are everywhere displayed correctly, also in Properties.
OK, to be even more fair to the CPDN task, I can now see also the PrimeGrid task waiting with highlighted [0] 03d,16:30:02 in Checkpoint column, following (apparently correct?) could be seen in the Properties: CPU time at last checkpoint 03d,16:30:02
CPU time 03d,13:06:42
Elapsed time 03d,18:24:29
Estimated time remaining 01d,18:55:47
Fraction done 75.260 %

No idea what the cause could be.
Peter

fred

Quote from: Pepo on November 16, 2010, 09:35:48 AM
Quote from: Pepo on November 15, 2010, 12:11:52 PM
Interestingly a Collatz task (around 1d2h) and PrimeGrid task (3 1/2 days) are everywhere displayed correctly, also in Properties.
OK, to be even more fair to the CPDN task, I can now see also the PrimeGrid task waiting with highlighted [0] 03d,16:30:02 in Checkpoint column, following (apparently correct?) could be seen in the Properties: CPU time at last checkpoint 03d,16:30:02
CPU time 03d,13:06:42
Elapsed time 03d,18:24:29
Estimated time remaining 01d,18:55:47
Fraction done 75.260 %

No idea what the cause could be.
But as you can see the CPU time at last checkpoint is higher than the CPU time. ???
This should be impossible, as the checkpoint is always in the past. The value should be the last check point CPU time, this can never be more than the CPU time.
The value shown in this case is the Checkpoint time.

fred

#27
Quote from: Beyond on November 14, 2010, 05:10:14 PM
2)  MW tasks run for 3-4 minutes on my systems.  My checkpoint warning triggers are set at 5, 10 & 20 minutes.  About 40% of the time the last BT update period shows a checkpoint time of over 20 minutes and thus flashes the > 20 minute warning color.  Don't know where this erroneous checkpoint time is coming from but it can't be correct to show a > 20 minute checkpoint time on a 3 minute WU.  I have set custom global checkpoints in config.xml.
I just checked the checkpoints:

The xml values will override the default settings.
The xml is read sequentially, so the LAST command in the list that matches will be the one used.

So don't place global expressions at the end, but at the beginning. See example: http://www.efmer.eu/forum_tt/index.php?topic=503.0

Corsair

in previous versions I said that there was a nasty bug, which didn't say anything but closes BT, now I suspect how is.

two computers running BT in one x64 (no problem about OS XP or win 7 Pro) in the other x32 (XP pro SP3 x32) both of them working marvellous, and both computers looking at each other and another computers attached too, same computers controlled from both computers.

for any reason I have to restart the x64 machine, when this happens and the x64 machine is starting, the program in the x32 machine crashes silently and without any window, message, etc.

If I restart immediately that the crash has happened, sometimes crashes again as before, silently, if I wait until the BT in the x64 machine is up and running, restarted the BT x32 and no problem.

strange problem ?? ?? ?? ??
Roses don't bloom on the sailor's grave

Corsair.

fred

Quote from: Corsair on November 16, 2010, 10:26:18 AM
in previous versions I said that there was a nasty bug, which didn't say anything but closes BT, now I suspect how is.

two computers running BT in one x64 (no problem about OS XP or win 7 Pro) in the other x32 (XP pro SP3 x32) both of them working marvellous, and both computers looking at each other and another computers attached too, same computers controlled from both computers.

for any reason I have to restart the x64 machine, when this happens and the x64 machine is starting, the program in the x32 machine crashes silently and without any window, message, etc.

If I restart immediately that the crash has happened, sometimes crashes again as before, silently, if I wait until the BT in the x64 machine is up and running, restarted the BT x32 and no problem.

strange problem ?? ?? ?? ??

No dump file at all?