A gap in Messages list (v0.63)

Pepo · July 09, 2010, 12:43:48 PM

I've started the currently running BT v0.63 on 2.7.2010 at 19:11:18, after rebooting my machine. At around the same time I've started the BOINC 6.10.57 client, first messages (1, 2, 3, ...) displayed in BT are from 2.7.2010 19:11:14.

Code Select


1			02.07.2010 19:11:14	Unrecognized tag in cc_config.xml: <heartbeat_debug>	
2			02.07.2010 19:11:14	Unrecognized tag in cc_config.xml: <force_auth_off>	
3			02.07.2010 19:11:15	Starting BOINC client version 6.10.57 for windows_x86_64	
4			02.07.2010 19:11:15	Config: don't compute while VCBuildHelper.exe__ is running	
5			02.07.2010 19:11:15	Config: don't compute while link.exe__ is running	
6			02.07.2010 19:11:15	Config: GUI RPC allowed from:

On 7.7.2010 I've downloaded the 6.10.58 client, reinstalled it and after a maybe 1/2 hour delay I've started it at 11:12:18 and it still runs. BT was still running, I was observing the tasks'behavior prior to and after the installation (client alpha testing blah blah blah...).

Just now I've noticed that there is a large gap around the client's restart point, between messages Nr. 12280 and 12281:

Code Select


12274	ralph@home		06.07.2010 3:42:14	[task_debug] result celldiv_zipA_2ozf_ProteinInterfaceDesign_2Jul2010_14839_4_0 checkpointed	
12275	Quake-Catcher Network	06.07.2010 3:43:01	[sched_op_debug] Starting scheduler request	
12276	Quake-Catcher Network	06.07.2010 3:43:01	Sending scheduler request: To send trickle-up message.	
12277	Quake-Catcher Network	06.07.2010 3:43:01	Not reporting or requesting tasks	
12278	Quake-Catcher Network	06.07.2010 3:43:01	[sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs	
12279	Quake-Catcher Network	06.07.2010 3:43:01	[sched_op_debug] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs	
12280	Quake-Catcher Network	06.07.2010 3:43:05	Scheduler request completed	
---gap--- missing messages ---
12281	climateprediction.net	09.07.2010 12:45:46	[task_debug] result hadsm3dhet2_k4g1_006615411_8 checkpointed	
12282	Einstein@Home		09.07.2010 12:48:04	[task_debug] result p2030_53650_82622_0056_G62.31+00.80.N_6.dm_620_2 checkpointed	
12283	Quake-Catcher Network	09.07.2010 12:48:59	[task_debug] result qcnk_sc300_sta200_024526_0 checkpointed	
12284	WUProp@Home		09.07.2010 12:49:21	[task_debug] result wu_1274387826_252720_0 checkpointed

Could it be that BT somehow matched the IDs of the newer client's messages to the IDs of the older client? (I remember that each message contains its unique ID, which is started from 1 on the client start.)

The recent client's first messages (lines 1-6) are these, as seen by BOINC Manager:

Code Select


07.07.2010 11:12:18		Unrecognized tag in cc_config.xml: <heartbeat_debug>
07.07.2010 11:12:18		Unrecognized tag in cc_config.xml: <force_auth_off>
07.07.2010 11:12:19		Starting BOINC client version 6.10.58 for windows_x86_64
07.07.2010 11:12:19		Config: don't compute while VCBuildHelper.exe__ is running
07.07.2010 11:12:19		Config: don't compute while link.exe__ is running
07.07.2010 11:12:19		Config: GUI RPC allowed from:

and a few messages around the end of the gap (lines 12278-12284):

Code Select


09.07.2010 12:44:09	Quake-Catcher Network	Scheduler request completed
09.07.2010 12:44:09	Quake-Catcher Network	[sched_op_debug] Server version 609
09.07.2010 12:44:21	WUProp@Home	[task_debug] result wu_1274387826_252720_0 checkpointed
09.07.2010 12:45:46	climateprediction.net	[task_debug] result hadsm3dhet2_k4g1_006615411_8 checkpointed
09.07.2010 12:48:04	Einstein@Home	[task_debug] result p2030_53650_82622_0056_G62.31+00.80.N_6.dm_620_2 checkpointed
09.07.2010 12:48:59	Quake-Catcher Network	[task_debug] result qcnk_sc300_sta200_024526_0 checkpointed
09.07.2010 12:49:21	WUProp@Home	[task_debug] result wu_1274387826_252720_0 checkpointed

Just for note, the most recent messages are not far from the break point:

Code Select


12695	WUProp@Home		09.07.2010 14:15:48	Starting task wu_1274387826_255522_0 using data_collect version 132	
12696	WUProp@Home		09.07.2010 14:15:50	[task_debug] result wu_1274387826_255522_0 checkpointed	
12697	Quake-Catcher Network	09.07.2010 14:16:12	[task_debug] result qcnk_sc300_sta000_024948_0 checkpointed	
12698	Einstein@Home		09.07.2010 14:18:30	[task_debug] result p2030_53650_82622_0056_G62.31+00.80.N_6.dm_620_2 checkpointed

Another note: out of curiosity, I've counted the messages, displayed in the BOINC Manager. The line numbers since the gap until now are exactly matching the BT's message Nr's!!

MODIFIED:
~~At around~~ Approx. 30 minutes after the gap end time (possibly only a coincidence?), I've just unlocked my locked Windows session, the computer was running and was not sleeped or hibernated, just the session was locked since yesterday evening. I've been interacting with BT, just not looked exactly at the messages, so have no idea what was there...

Pepo · July 21, 2010, 01:12:51 PM

Quote from: Pepo on July 09, 2010, 12:43:48 PM
Just now I've noticed that there is a large gap around the client's restart point, between messages Nr. 12280 and 12281:
[...]
Could it be that BT somehow matched the IDs of the newer client's messages to the IDs of the older client? (I remember that each message contains its unique ID, which is started from 1 on the client start.)

For self-confirmation - displayed message ID match the output of "boinccmd.exe --get_message_count" and "boinccmd.exe --get_messages [ seqno ]".

Quote
Another note: out of curiosity, I've counted the messages, displayed in the BOINC Manager. The line numbers since the gap until now are exactly matching the BT's message Nr's!!

fred · July 21, 2010, 02:49:36 PM

Quote from: Pepo on July 21, 2010, 01:12:51 PM
Quote from: Pepo on July 09, 2010, 12:43:48 PM
Just now I've noticed that there is a large gap around the client's restart point, between messages Nr. 12280 and 12281:
[...]
Could it be that BT somehow matched the IDs of the newer client's messages to the IDs of the older client? (I remember that each message contains its unique ID, which is started from 1 on the client start.)
For self-confirmation - displayed message ID match the output of "boinccmd.exe --get_message_count" and "boinccmd.exe --get_messages [ seqno ]".
Quote
Another note: out of curiosity, I've counted the messages, displayed in the BOINC Manager. The line numbers since the gap until now are exactly matching the BT's message Nr's!!

BT considers the message log valid when the message numbers are sequential. If not the log is cleared and read again.
The client only sends out the last 2500 lines, these are the only lines shown in the BOINC Manager.
BT appends new lines to the existing in memory.
The messages numbers are generated by the BOINC client and start numbering when the client starts.

Pepo · July 21, 2010, 05:13:25 PM

Quote from: fred on July 21, 2010, 02:49:36 PM
1) BT considers the message log valid when the message numbers are sequential. If not the log is cleared and read again.
2) The client only sends out the last 2500 lines,
3) these are the only lines shown in the BOINC Manager.
4) BT appends new lines to the existing in memory.
5) The messages numbers are generated by the BOINC client and start numbering when the client starts.

So, if BT logs some 17000 message lines from host xyz123 (beginning from ID=1), then BOINC client on the same host is restarted and again generates messages starting with ID=1 (because of 5), what will happen? According to 1) the whole displayed log should have been cleared. This was not in my case. I have to test it sometimes again...
2) I know (wasn't this around 2000 some time ago?),
3) not completely true, BOINC manager is also keeping the complete (may be many many thousands) message log until it looses connection to the client, IIRC just then the message log is cleared and only the available messages are reread,
4) ...if the newly coming message IDs are larger than these already stored? I suspect that in my case, it seemed like some messages were skipped, some discarded and some appended

5) I know.

Pepo · July 22, 2010, 11:14:01 AM

Quote from: Pepo on July 21, 2010, 05:13:25 PM
Quote from: fred on July 21, 2010, 02:49:36 PM
1) BT considers the message log valid when the message numbers are sequential. If not the log is cleared and read again.
4) BT appends new lines to the existing in memory.
So, if BT logs some 17000 message lines from host xyz123 (beginning from ID=1), then BOINC client on the same host is restarted and again generates messages starting with ID=1, what will happen? According to 1) the whole displayed log should have been cleared. This was not in my case. I have to test it sometimes again...

After having some 12000 messages, I've restarted the client. This time it happened exactly according to your description. One red empty message with ID=0 appeared at the end, then the remaining messages were red highlighted, IDs zeroed, message texts emptied, messages discarded...

Quote
4) ...if the newly coming message IDs are larger than these already stored? I suspect that in my case, it seemed like some messages were skipped, some discarded and some appended

So, the case remains mysterious, placed ad acta...

Pepo · August 04, 2010, 06:49:19 PM

Today the bug reappeared. Yesterday at 03-Aug-2010 23:28:56 I've upgraded BOINC on BT's localhost from 6.10.58 to 6.11.4. The initial messages were:

Code Select


1			03.08.2010 23:28:56	Unrecognized tag in cc_config.xml: <guirpc_debug>	
2			03.08.2010 23:28:56	Unrecognized tag in cc_config.xml: <heartbeat_debug>	
3			03.08.2010 23:28:56	Unrecognized tag in cc_config.xml: <force_auth_off>	
4			03.08.2010 23:28:58	Starting BOINC client version 6.11.4 for windows_x86_64	
5			03.08.2010 23:28:58	This a development version of BOINC and may not function properly	
6			03.08.2010 23:28:58	Config: don't compute while VCBuildHelper.exe__ is running	
7			03.08.2010 23:28:58	Config: don't compute while link.exe__ is running	
8			03.08.2010 23:28:58	Config: GUI RPC allowed from:

Today 04-Aug-2010 at 12:15:07 and 12:17:06 I've had two short (20 seconds and 1 minute) unintentional client restarts (I blame the new Manager!). BT kept running (since 29.07.2010). And now I've again noticed a gap in the middle of messages (edited the spacing):

Code Select


706	SETI@home		04.08.2010 4:47:09	[task] result 26my10aa.19507.4740.13.10.235_0 checkpointed	
707	SETI@home Beta Test	04.08.2010 4:48:37	[task] result 18no09aj.3284.25021.3.13.81_0 checkpointed	
708				04.08.2010 4:49:24	Can't resolve hostname in remote_hosts.cfg: pippi	
---gap--- missing messages ---
709	PrimeGrid		04.08.2010 19:57:18	[task] result psp_llr_59148376_1 checkpointed	
710	Collatz Conjecture	04.08.2010 19:58:58	[task] result collatz_1280384749_206427_1 checkpointed	
711	WUProp@Home		04.08.2010 19:59:51	[task] result wu_1280862948_5661_0 checkpointed	
712	Quake-Catcher Network	04.08.2010 20:00:22	[task] result qcnk_sc300_sta000_042472_0 checkpointed

The message counter is now at around 850.

Note: I see that it happened shortly after writing my report:

Quote from: Pepo on August 04, 2010, 10:12:07 AM
Now I'm back again in a "stable" situation, with a complete History, but my computers' names are displayed (luckily just in the Computers tab) as
SMachine1«Machine1
SMachine2«Machine2

During it I've displayed the Messages tab a LOT of times. But possibly not anymore afterwards. (If it matters.)

fred · August 05, 2010, 07:57:44 AM

Quote from: Pepo on August 04, 2010, 06:49:19 PM

Code Select Expand
706 SETI@home 04.08.2010 4:47:09 [task] result 26my10aa.19507.4740.13.10.235_0 checkpointed 707 SETI@home Beta Test 04.08.2010 4:48:37 [task] result 18no09aj.3284.25021.3.13.81_0 checkpointed 708 04.08.2010 4:49:24 Can't resolve hostname in remote_hosts.cfg: pippi ---gap--- missing messages --- 709 PrimeGrid 04.08.2010 19:57:18 [task] result psp_llr_59148376_1 checkpointed 710 Collatz Conjecture 04.08.2010 19:58:58 [task] result collatz_1280384749_206427_1 checkpointed 711 WUProp@Home 04.08.2010 19:59:51 [task] result wu_1280862948_5661_0 checkpointed 712 Quake-Catcher Network 04.08.2010 20:00:22 [task] result qcnk_sc300_sta000_042472_0 checkpointed

The message counter is now at around 850.

As there is no gap in the numbering, it's a BOINC thing.
When the client restarts it should start numbering at 1 again.
And the time is sequential as well.

Pepo · August 05, 2010, 09:29:24 AM

Quote from: fred on August 05, 2010, 07:57:44 AM
Quote from: Pepo on August 04, 2010, 06:49:19 PM
Code Select Expand
707 SETI@home Beta Test 04.08.2010 4:48:37 [task] result 18no09aj.3284.25021.3.13.81_0 checkpointed 708 04.08.2010 4:49:24 Can't resolve hostname in remote_hosts.cfg: pippi ---gap--- missing messages --- 709 PrimeGrid 04.08.2010 19:57:18 [task] result psp_llr_59148376_1 checkpointed 710 Collatz Conjecture 04.08.2010 19:58:58 [task] result collatz_1280384749_206427_1 checkpointed
As there is no gap in the numbering, it's a BOINC thing.
When the client restarts it should start numbering at 1 again.
And the time is sequential as well.

Off course there is no gap in numbering. And the client always starts numbering at 1 - while testing the previous case, I've also checked that the numbers requested/sent over GUI RPC do match, with both boinccmd and BoincTasks and BOINC Manager and mesage lines in log file.
But even if there is no gap in numbering and the times are sequential, it does no mean that the displayed fragments do seamlessly match

fred · August 05, 2010, 10:12:36 AM

Quote from: Pepo on August 05, 2010, 09:29:24 AM
Off course there is no gap in numbering. And the client always starts numbering at 1 - while testing the previous case, I've also checked that the numbers requested/sent over GUI RPC do match, with both boinccmd and BoincTasks and BOINC Manager and mesage lines in log file.
But even if there is no gap in numbering and the times are sequential, it does no mean that the displayed fragments do seamlessly match

If there is no gap in the numbering, I can't do anything about it. The problem is in the BOINC client.

wicked · August 06, 2010, 04:12:29 PM

Quote from: fred on August 05, 2010, 10:12:36 AM
If there is no gap in the numbering, I can't do anything about it. The problem is in the BOINC client.

Do you always transfers all messages from the start or does BoincTasks have some kind of cache for the messages? If there is, maybe there's a missing "clear cache" call somewhere?

fred · August 06, 2010, 05:21:16 PM

Quote from: wicked on August 06, 2010, 04:12:29 PM
Quote from: fred on August 05, 2010, 10:12:36 AM
If there is no gap in the numbering, I can't do anything about it. The problem is in the BOINC client.

Do you always transfers all messages from the start or does BoincTasks have some kind of cache for the messages? If there is, maybe there's a missing "clear cache" call somewhere?

Messages are read from the last know number and up. So only the updated messages are read.
They are stored in stdoutdae.txt, but only the BOINC client reads it.

Pepo · August 06, 2010, 08:34:53 PM

Quote from: fred on August 06, 2010, 05:21:16 PM
Quote from: wicked on August 06, 2010, 04:12:29 PM
Quote from: fred on August 05, 2010, 10:12:36 AM
If there is no gap in the numbering, I can't do anything about it. The problem is in the BOINC client.
Do you always transfers all messages from the start or does BoincTasks have some kind of cache for the messages? If there is, maybe there's a missing "clear cache" call somewhere?
Messages are read from the last know number and up. So only the updated messages are read.

wicked, there are 2 modes how to get messages from client: a) request all available messages (the client keeps up to last 2000 cached in memory in a FIFO buffer), they are numbered from 1 since the launch and the available ones have numbers n ... n+1999, b) if you know the ID if your most recently read message (in our case it was n+1999) then you can request the following ones i.e. ID=n+2000 and you will get this and the newer ones (sure if they already exist).
Thus the usual way (for apps like B.Manager, BoincTasks, BoincView, etc.) is to initially get all available messages and then repeatedly (each couple of seconds) ask just for the newer ones.

Quote from: fred on August 06, 2010, 05:21:16 PM
They are stored in stdoutdae.txt, but only the BOINC client reads it.

I doubt this... I think the client immediately logs them into the file and never ever reads them back again.

wicked · August 08, 2010, 09:17:25 AM

Okay, but how does BoincTasks know when to discard the Messages list and start over? I mean when something like this happens:

Computer B has messages 1-10, which BoincTasks running on Computer A is showing on it's Messages tab.
Computer A gets hang and doesn't retrieve messages for few minutes.
Meanwhile Computer B gets restarted and starts storing messages from 1. It now has a new buffer of messages 1-15.
Computer A recovers and BoincTasks asks for messages starting from 11 and gets messages 11-15 from the new messages buffer. It happily appends them to the old buffer of 1-10 and thus creating the original problem.

So basically I'm wondering that there's no way for BoincTasks to detect a client restart and refresh it's message buffer in a situation like this?

What happens if BoincTasks asks for messages from 11 and instead gets a response that there are only messages 1-5 in the buffer? Does it clear it's old messages in this case?

fred · August 08, 2010, 12:04:36 PM

Quote from: wicked on August 08, 2010, 09:17:25 AM
Okay, but how does BoincTasks know when to discard the Messages list and start over? I mean when something like this happens:

Computer B has messages 1-10, which BoincTasks running on Computer A is showing on it's Messages tab.
Computer A gets hang and doesn't retrieve messages for few minutes.
Meanwhile Computer B gets restarted and starts storing messages from 1. It now has a new buffer of messages 1-15.
Computer A recovers and BoincTasks asks for messages starting from 11 and gets messages 11-15 from the new messages buffer. It happily appends them to the old buffer of 1-10 and thus creating the original problem.

So basically I'm wondering that there's no way for BoincTasks to detect a client restart and refresh it's message buffer in a situation like this?

What happens if BoincTasks asks for messages from 11 and instead gets a response that there are only messages 1-5 in the buffer? Does it clear it's old messages in this case?

It has all these checks and more.
A disconnect clears the buffer. But a short, few second reboot may go unnoticed. But than again it's highly unlikely that the buffer is in the 600 range after a few seconds.
I've tried very hard to reproduce any kind of problem.

But 100% is hard to get, I don't want to slow down the messages to a crawl, to catch every unlikely problem.

News:

A gap in Messages list (v0.63)