History behavior during client's rapid upload+report behavior

Pepo · April 07, 2011, 12:09:04 PM

Because I still can't remember the difference between History's normal mode and Smart mode

(and the BT's Manual page does also not explain it) I'll leave it unmentioned.

We know that if BT's History fetching notices a task's progress is approaching (or exceeded) 100%, it is fetching the task's state far more often, in order not to miss the moment the task is finished. At this moment the task is forgot and often upon the next history refresh it is already uploaded and reported (often in some 3-7 seconds) -> "Reported: OK *".

I'd like to ask for an extension of this behavior, while any task is seen to be actively uploaded. Such state usually takes just a few seconds per task (except some multi-MB beasties - BTW is there any known estimate when an upload or download will finish?). I believe that this way the number of missed reports would be far lower. Sure there is no need for faster History fetching if networking is currently disabled or all uploads are postponed due to any connection problems.

[Edit]As a task is noticed to be finished, its result files are expected to be immediately uploaded... thus the History fetching could immediately start with checking the uploads.

Pepo · April 07, 2011, 08:04:27 PM

Quote from: Pepo on April 07, 2011, 12:09:04 PM
We know that if BT's History fetching notices a task's progress is approaching (or exceeded) 100%, it is fetching the task's state far more often, in order not to miss the moment the task is finished. At this moment the task is forgot...

...but the task's state continues to be fast-fetched if the task's progress continues above 100% - I believe in such case such task should be ignored.

fred · April 07, 2011, 08:12:34 PM

Quote from: Pepo on April 07, 2011, 12:09:04 PM
Because I still can't remember the difference between History's normal mode and Smart mode (and the BT's Manual page does also not explain it) I'll leave it unmentioned.

We know that if BT's History fetching notices a task's progress is approaching (or exceeded) 100%, it is fetching the task's state far more often, in order not to miss the moment the task is finished. At this moment the task is forgot and often upon the next history refresh it is already uploaded and reported (often in some 3-7 seconds) -> "Reported: OK *".

The difference is that normally the history is fetched at regular interval.
In smart mode this time is dependent on time left (time left / 2) , with a maximum of the setting in "Maximum update time".
Only the computer that are within 1 minute will update.
And even as there is an update, only the running tasks are read, so the overhead should still be quite low.
A complete fetch is only done once in the "Maximum update time".

The * isn't very relevant for the data. * is seen at upload, so all the date is there. Without the * the WU is seen as completed = upload completed.

fred · April 07, 2011, 08:14:02 PM

Quote from: Pepo on April 07, 2011, 08:04:27 PM
Quote from: Pepo on April 07, 2011, 12:09:04 PM
We know that if BT's History fetching notices a task's progress is approaching (or exceeded) 100%, it is fetching the task's state far more often, in order not to miss the moment the task is finished. At this moment the task is forgot...
...but the task's state continues to be fast-fetched if the task's progress continues above 100% - I believe in such case such task should be ignored.

What is the time left at that moment?
The impact should still be low as only the running tasks are read.

Pepo · April 08, 2011, 10:30:07 AM

Quote from: fred on April 07, 2011, 08:14:02 PM
Quote from: Pepo on April 07, 2011, 08:04:27 PM
Quote from: Pepo on April 07, 2011, 12:09:04 PM
We know that if BT's History fetching notices a task's progress is approaching (or exceeded) 100%, it is fetching the task's state far more often, in order not to miss the moment the task is finished. At this moment the task is forgot...
...but the task's state continues to be fast-fetched if the task's progress continues above 100% - I believe in such case such task should be ignored.
What is the time left at that moment?
The impact should still be low as only the running tasks are read.

The task in question is a QCN task, running on a machine without any supported acceleration sensor. Usually, the application monitors the sensor, reports measured quake data to the server and receives lists of known recent earthquakes, to be shown in its graphic application, and will finish after 24 hours, reaching 100%.

In the case it runs without a sensor attached, the application just receives lists of known recent earthquakes (to be displayed) and continues to run maybe for a week. Then it automatically terminates and gets replaced by new task (and possibly an updated application). The progress can reach a couple of hundreds % (it was already over 400% yesterday).

Unfortunately I do not remember its "time left" estimate at all (maybe negative??), will tell some half day later.

The impact of additional checking indeed seems to be low (at least for a local client, I've had no opportunity to check network transfers and delays for a remote client), but as can be seen, in case of some apps (and unfortunately there will always be some broken apps) it can (unnecessarily) take hours-to-days. Unnecessarily, because if some task does not finish at 100%, it probably intends to continue to some other value...

Pepo · April 08, 2011, 11:48:40 AM

Quote from: Pepo on April 08, 2011, 10:30:07 AM
Quote from: fred on April 07, 2011, 08:14:02 PM
What is the time left at that moment?
Unfortunately I do not remember its "time left" estimate at all (maybe negative??), will tell some half day later.

OK, I've managed it sooner: The "Time left" is "-" or "-4d,16:36", depends on where you look at (there was small time difference 2-3 minutes between the snapshots):

Code Select

Application		Elapsed time		Progress	CPU %	Checkpoint	Use	Time left	Deadline
6.50 QCN Sensor (nci)	05d,23:00:53 (02:05:27)	471.276		0.020	[1] 00:00:00	0.01C	-		-10d,07:01:00

or

Code Select

Project			Quake-Catcher Network
Application		QCN Sensor 6.50 (nci)
Workunit name		qcnac_053854
State			Running
Received		15.03.11 05:35
Report deadline		29.03.11 06:35
Resources		0.01 CPUs
CPU time at last check.	02:05:27
CPU time		02:05:27
Elapsed time		05d,22:57:29
Estimated time remain.	-04d,16:36:43
Fraction done		0.000 %
Working set size	1.89 MB

BOINC Manager tells "---" in both cases.

fred · April 08, 2011, 07:29:36 PM

Quote from: Pepo on April 08, 2011, 11:48:40 AM
BOINC Manager tells "---" in both cases.

on the bug list: Bug: Don't show negative time left and deadline.

Pepo · April 09, 2011, 04:28:59 PM

Quote from: fred on April 08, 2011, 07:29:36 PM
Quote from: Pepo on April 08, 2011, 11:48:40 AM
BOINC Manager tells "---" in both cases.
on the bug list: Bug: Don't show negative time left and deadline.

No nooooo please hold on! Both are good indications of either buggy application or a time past deadline.

Why not just stop fast refreshing on tasks over 100%?

fred · April 09, 2011, 05:49:40 PM

Quote from: Pepo on April 09, 2011, 04:28:59 PM
Quote from: fred on April 08, 2011, 07:29:36 PM
Quote from: Pepo on April 08, 2011, 11:48:40 AM
BOINC Manager tells "---" in both cases.
on the bug list: Bug: Don't show negative time left and deadline.
No nooooo please hold on! Both are good indications of either buggy application or a time past deadline.

Why not just stop fast refreshing on tasks over 100%?

Ok I change this into ??, indicating an error in the number.
The values are way out of range and the actual numbers mean nothing.

I changed the min refresh time from 1 > 2 seconds, and only the running are read so the impact is really low.

Pepo · April 09, 2011, 08:43:19 PM

Quote from: fred on April 09, 2011, 05:49:40 PM
Quote from: Pepo on April 09, 2011, 04:28:59 PM
Quote from: fred on April 08, 2011, 07:29:36 PM
on the bug list: Bug: Don't show negative time left and deadline.
No nooooo please hold on! Both are good indications of either buggy application or a time past deadline.
Ok I change this into ??, indicating an error in the number.
The values are way out of range and the actual numbers mean nothing.

With negative Time left you're probably right - meaningless value calculated out of meaningless values? "??" could be a good replacement.
But the red negative Deadline value is still correct, whether in "date+time" or relative "dd,hh:mm" format, isn't it?

Pepo · May 24, 2011, 04:40:16 PM

Let's continue here with BT 1.03.

Quote from: Pepo on May 20, 2011, 06:23:18 PM
Quote from: fred on May 20, 2011, 04:12:20 PM
Quote from: Pepo on May 20, 2011, 04:07:16 PM
I assume it is related to the Progress% and running state (fast history fetching kicks in again?) - 3 minutes after I've resumed the EVO task, BT's CPU usage jumped to 2/3 of a core and stayed for hours, with the exception of 3 minutes, while EVO task was temporarily suspended.
The progress comes from the BOINC client and is wrong....
I believe it and I'm sure that the client gets it directly from EVO's wrapper (which behaves everything but correctly )) - BOINC Manager sees the same. But the times are wrong.

QuoteThe CPU usage, comes from the BOINC client as well. Is the time difference over couple of seconds.
How this?? What time difference over few seconds? BT simply consumes that much CPU time and is totally unresponsive (like poured with glue, the same feeling like when GPU tasks are blocking everything visible)... When I suspend the EVO 100% task, BT's CPU usage goes down to 1-2% in a few seconds. When I resume the EVO task, in approx. 30 seconds BT's CPU usage goes again up and it gets slowly responsive. And suddenly makes 4 x more I/O.

A DNETC 2.02 task dnetc_cpu_normal_4071970_0 approached 100% progress a 1/4 hour ago. This means, for BT, 60% load of a core, higher I/O rate, unresponsive GUI - for indefinitely long...

Whether it (Progress >= 100%) is related or not, the task's line in Tasks tab did not display task's Elapsed time. Two subsequent snapshots of main window and task's Properties dialog:

Code Select

Application	Elapsed time		Progress	CPU %	Checkpoint	Use
2.02 DNETC@Home	00:02:19 (02:48:56)	100.000		100.00	[0] 02:48:56	-

or (there was small time difference (a few seconds) between the snapshots)

Code Select

Project			DNETC@HOME
Application		DNETC@Home 2.02
State			Running
CPU time at last check.	-
CPU time		02:48:42
Elapsed time		04:10:33
Estimated time remain.	-
Fraction done		100.000 %

and

Code Select

Application	Elapsed time		Progress	CPU %	Checkpoint	Use
2.02 DNETC@Home	00:02:19 (02:49:54)	100.000		100.00	[0] 02:49:54	-

or (there was small time difference (a few seconds) between the snapshots)

Code Select

Project			DNETC@HOME
Application		DNETC@Home 2.02
State			Running
CPU time at last check.	-
CPU time		02:49:58
Elapsed time		04:12:37
Estimated time remain.	-
Fraction done		100.000 %

I think that if the 100% progress situation takes longer fo a particular task than expected (and also the remaining time is not available (anymore?) instead of approaching zero), BT should slowly lengthen the rapid History refreshing period, at best towards the normal predefined interval.

I have even tried to switch off the "Smart mode" for History fetching, but it had no influence!

Apparently it might be some very different problem...

Quote from: Pepo on May 20, 2011, 06:23:18 PM
One more weird problem: when in such slowly responsive state, BT is able to suspend/resume any of the running or ready-to-run tasks, but sometimes somehow can not modify the suspend/resume state of the not-yet-started (ready to start) tasks (I've tried to restart it a couple of times, but no joy - BOINC Manager had to help.)

I've checked this again - yes, there was no problem to suspend or resume any of the ready-to-run tasks, but from many attempts on 4 ready-to-start (not active yet) tasks, I've suspended just one of them and was not able to resume it anymore. BOINC Manager had to do it afterwards.

News:

History behavior during client's rapid upload+report behavior