A gap in Messages list (v0.63)

Started by Pepo, July 09, 2010, 12:43:48 PM

Previous topic - Next topic

0 Members and 6 Guests are viewing this topic.

Pepo

I've started the currently running BT v0.63 on 2.7.2010 at 19:11:18, after rebooting my machine. At around the same time I've started the BOINC 6.10.57 client, first messages (1, 2, 3, ...) displayed in BT are from 2.7.2010 19:11:14.

1 02.07.2010 19:11:14 Unrecognized tag in cc_config.xml: <heartbeat_debug>
2 02.07.2010 19:11:14 Unrecognized tag in cc_config.xml: <force_auth_off>
3 02.07.2010 19:11:15 Starting BOINC client version 6.10.57 for windows_x86_64
4 02.07.2010 19:11:15 Config: don't compute while VCBuildHelper.exe__ is running
5 02.07.2010 19:11:15 Config: don't compute while link.exe__ is running
6 02.07.2010 19:11:15 Config: GUI RPC allowed from:


On 7.7.2010 I've downloaded the 6.10.58 client, reinstalled it and after a maybe 1/2 hour delay I've started it at 11:12:18 and it still runs. BT was still running, I was observing the tasks'behavior prior to and after the installation (client alpha testing blah blah blah...).

Just now I've noticed that there is a large gap around the client's restart point, between messages Nr. 12280 and 12281:

12274 ralph@home 06.07.2010 3:42:14 [task_debug] result celldiv_zipA_2ozf_ProteinInterfaceDesign_2Jul2010_14839_4_0 checkpointed
12275 Quake-Catcher Network 06.07.2010 3:43:01 [sched_op_debug] Starting scheduler request
12276 Quake-Catcher Network 06.07.2010 3:43:01 Sending scheduler request: To send trickle-up message.
12277 Quake-Catcher Network 06.07.2010 3:43:01 Not reporting or requesting tasks
12278 Quake-Catcher Network 06.07.2010 3:43:01 [sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs
12279 Quake-Catcher Network 06.07.2010 3:43:01 [sched_op_debug] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs
12280 Quake-Catcher Network 06.07.2010 3:43:05 Scheduler request completed
---gap--- missing messages ---
12281 climateprediction.net 09.07.2010 12:45:46 [task_debug] result hadsm3dhet2_k4g1_006615411_8 checkpointed
12282 Einstein@Home 09.07.2010 12:48:04 [task_debug] result p2030_53650_82622_0056_G62.31+00.80.N_6.dm_620_2 checkpointed
12283 Quake-Catcher Network 09.07.2010 12:48:59 [task_debug] result qcnk_sc300_sta200_024526_0 checkpointed
12284 WUProp@Home 09.07.2010 12:49:21 [task_debug] result wu_1274387826_252720_0 checkpointed


Could it be that BT somehow matched the IDs of the newer client's messages to the IDs of the older client? (I remember that each message contains its unique ID, which is started from 1 on the client start.)

The recent client's first messages (lines 1-6) are these, as seen by BOINC Manager:

07.07.2010 11:12:18 Unrecognized tag in cc_config.xml: <heartbeat_debug>
07.07.2010 11:12:18 Unrecognized tag in cc_config.xml: <force_auth_off>
07.07.2010 11:12:19 Starting BOINC client version 6.10.58 for windows_x86_64
07.07.2010 11:12:19 Config: don't compute while VCBuildHelper.exe__ is running
07.07.2010 11:12:19 Config: don't compute while link.exe__ is running
07.07.2010 11:12:19 Config: GUI RPC allowed from:


and a few messages around the end of the gap (lines 12278-12284):

09.07.2010 12:44:09 Quake-Catcher Network Scheduler request completed
09.07.2010 12:44:09 Quake-Catcher Network [sched_op_debug] Server version 609
09.07.2010 12:44:21 WUProp@Home [task_debug] result wu_1274387826_252720_0 checkpointed
09.07.2010 12:45:46 climateprediction.net [task_debug] result hadsm3dhet2_k4g1_006615411_8 checkpointed
09.07.2010 12:48:04 Einstein@Home [task_debug] result p2030_53650_82622_0056_G62.31+00.80.N_6.dm_620_2 checkpointed
09.07.2010 12:48:59 Quake-Catcher Network [task_debug] result qcnk_sc300_sta200_024526_0 checkpointed
09.07.2010 12:49:21 WUProp@Home [task_debug] result wu_1274387826_252720_0 checkpointed


Just for note, the most recent messages are not far from the break point:

12695 WUProp@Home 09.07.2010 14:15:48 Starting task wu_1274387826_255522_0 using data_collect version 132
12696 WUProp@Home 09.07.2010 14:15:50 [task_debug] result wu_1274387826_255522_0 checkpointed
12697 Quake-Catcher Network 09.07.2010 14:16:12 [task_debug] result qcnk_sc300_sta000_024948_0 checkpointed
12698 Einstein@Home 09.07.2010 14:18:30 [task_debug] result p2030_53650_82622_0056_G62.31+00.80.N_6.dm_620_2 checkpointed


Another note: out of curiosity, I've counted the messages, displayed in the BOINC Manager. The line numbers since the gap until now are exactly matching the BT's message Nr's!!

MODIFIED:
At around Approx. 30 minutes after the gap end time (possibly only a coincidence?), I've just unlocked my locked Windows session, the computer was running and was not sleeped or hibernated, just the session was locked since yesterday evening. I've been interacting with BT, just not looked exactly at the messages, so have no idea what was there...
Peter

Pepo

Quote from: Pepo on July 09, 2010, 12:43:48 PM
Just now I've noticed that there is a large gap around the client's restart point, between messages Nr. 12280 and 12281:
[...]
Could it be that BT somehow matched the IDs of the newer client's messages to the IDs of the older client? (I remember that each message contains its unique ID, which is started from 1 on the client start.)
For self-confirmation - displayed message ID match the output of "boinccmd.exe --get_message_count" and "boinccmd.exe --get_messages [ seqno ]".
Quote
Another note: out of curiosity, I've counted the messages, displayed in the BOINC Manager. The line numbers since the gap until now are exactly matching the BT's message Nr's!!
Peter

fred

Quote from: Pepo on July 21, 2010, 01:12:51 PM
Quote from: Pepo on July 09, 2010, 12:43:48 PM
Just now I've noticed that there is a large gap around the client's restart point, between messages Nr. 12280 and 12281:
[...]
Could it be that BT somehow matched the IDs of the newer client's messages to the IDs of the older client? (I remember that each message contains its unique ID, which is started from 1 on the client start.)
For self-confirmation - displayed message ID match the output of "boinccmd.exe --get_message_count" and "boinccmd.exe --get_messages [ seqno ]".
Quote
Another note: out of curiosity, I've counted the messages, displayed in the BOINC Manager. The line numbers since the gap until now are exactly matching the BT's message Nr's!!
BT considers the message log valid when the message numbers are sequential. If not the log is cleared and read again.
The client only sends out the last 2500 lines, these are the only lines shown in the BOINC Manager.
BT appends new lines to the existing in memory.
The messages numbers are generated by the BOINC client and start numbering when the client starts.

Pepo

Quote from: fred on July 21, 2010, 02:49:36 PM
1) BT considers the message log valid when the message numbers are sequential. If not the log is cleared and read again.
2) The client only sends out the last 2500 lines,
3) these are the only lines shown in the BOINC Manager.
4) BT appends new lines to the existing in memory.
5) The messages numbers are generated by the BOINC client and start numbering when the client starts.
So, if BT logs some 17000 message lines from host xyz123 (beginning from ID=1), then BOINC client on the same host is restarted and again generates messages starting with ID=1 (because of 5), what will happen? According to 1) the whole displayed log should have been cleared. This was not in my case. I have to test it sometimes again...
2) I know (wasn't this around 2000 some time ago?),
3) not completely true, BOINC manager is also keeping the complete (may be many many thousands) message log until it looses connection to the client, IIRC just then the message log is cleared and only the available messages are reread,
4) ...if the newly coming message IDs are larger than these already stored? I suspect that in my case, it seemed like some messages were skipped, some discarded and some appended ???
5) I know.
Peter

Pepo

Quote from: Pepo on July 21, 2010, 05:13:25 PM
Quote from: fred on July 21, 2010, 02:49:36 PM
1) BT considers the message log valid when the message numbers are sequential. If not the log is cleared and read again.
4) BT appends new lines to the existing in memory.
So, if BT logs some 17000 message lines from host xyz123 (beginning from ID=1), then BOINC client on the same host is restarted and again generates messages starting with ID=1, what will happen? According to 1) the whole displayed log should have been cleared. This was not in my case. I have to test it sometimes again...
After having some 12000 messages, I've restarted the client. This time it happened exactly according to your description. One red empty message with ID=0 appeared at the end, then the remaining messages were red highlighted, IDs zeroed, message texts emptied, messages discarded...
Quote
4) ...if the newly coming message IDs are larger than these already stored? I suspect that in my case, it seemed like some messages were skipped, some discarded and some appended ???
So, the case remains mysterious, placed ad acta...
Peter

Pepo

Today the bug reappeared. Yesterday at 03-Aug-2010 23:28:56 I've upgraded BOINC on BT's localhost from 6.10.58 to 6.11.4. The initial messages were:

1 03.08.2010 23:28:56 Unrecognized tag in cc_config.xml: <guirpc_debug>
2 03.08.2010 23:28:56 Unrecognized tag in cc_config.xml: <heartbeat_debug>
3 03.08.2010 23:28:56 Unrecognized tag in cc_config.xml: <force_auth_off>
4 03.08.2010 23:28:58 Starting BOINC client version 6.11.4 for windows_x86_64
5 03.08.2010 23:28:58 This a development version of BOINC and may not function properly
6 03.08.2010 23:28:58 Config: don't compute while VCBuildHelper.exe__ is running
7 03.08.2010 23:28:58 Config: don't compute while link.exe__ is running
8 03.08.2010 23:28:58 Config: GUI RPC allowed from:


Today 04-Aug-2010 at 12:15:07 and 12:17:06 I've had two short (20 seconds and 1 minute) unintentional client restarts (I blame the new Manager!). BT kept running (since 29.07.2010). And now I've again noticed a gap in the middle of messages (edited the spacing):


706 SETI@home 04.08.2010 4:47:09 [task] result 26my10aa.19507.4740.13.10.235_0 checkpointed
707 SETI@home Beta Test 04.08.2010 4:48:37 [task] result 18no09aj.3284.25021.3.13.81_0 checkpointed
708 04.08.2010 4:49:24 Can't resolve hostname in remote_hosts.cfg: pippi
---gap--- missing messages ---
709 PrimeGrid 04.08.2010 19:57:18 [task] result psp_llr_59148376_1 checkpointed
710 Collatz Conjecture 04.08.2010 19:58:58 [task] result collatz_1280384749_206427_1 checkpointed
711 WUProp@Home 04.08.2010 19:59:51 [task] result wu_1280862948_5661_0 checkpointed
712 Quake-Catcher Network 04.08.2010 20:00:22 [task] result qcnk_sc300_sta000_042472_0 checkpointed


The message counter is now at around 850.

Note: I see that it happened shortly after writing my report:
Quote from: Pepo on August 04, 2010, 10:12:07 AM
Now I'm back again in a "stable" situation, with a complete History, but my computers' names are displayed (luckily just in the Computers tab) as
  SMachine1«Machine1
  SMachine2«Machine2
During it I've displayed the Messages tab a LOT of times. But possibly not anymore afterwards. (If it matters.)
Peter

fred

Quote from: Pepo on August 04, 2010, 06:49:19 PM


706 SETI@home 04.08.2010 4:47:09 [task] result 26my10aa.19507.4740.13.10.235_0 checkpointed
707 SETI@home Beta Test 04.08.2010 4:48:37 [task] result 18no09aj.3284.25021.3.13.81_0 checkpointed
708 04.08.2010 4:49:24 Can't resolve hostname in remote_hosts.cfg: pippi
---gap--- missing messages ---
709 PrimeGrid 04.08.2010 19:57:18 [task] result psp_llr_59148376_1 checkpointed
710 Collatz Conjecture 04.08.2010 19:58:58 [task] result collatz_1280384749_206427_1 checkpointed
711 WUProp@Home 04.08.2010 19:59:51 [task] result wu_1280862948_5661_0 checkpointed
712 Quake-Catcher Network 04.08.2010 20:00:22 [task] result qcnk_sc300_sta000_042472_0 checkpointed


The message counter is now at around 850.

As there is no gap in the numbering, it's a BOINC thing.
When the client restarts it should start numbering at 1 again.
And the time is sequential as well.

Pepo

Quote from: fred on August 05, 2010, 07:57:44 AM
Quote from: Pepo on August 04, 2010, 06:49:19 PM

707 SETI@home Beta Test 04.08.2010 4:48:37 [task] result 18no09aj.3284.25021.3.13.81_0 checkpointed
708 04.08.2010 4:49:24 Can't resolve hostname in remote_hosts.cfg: pippi
---gap--- missing messages ---
709 PrimeGrid 04.08.2010 19:57:18 [task] result psp_llr_59148376_1 checkpointed
710 Collatz Conjecture 04.08.2010 19:58:58 [task] result collatz_1280384749_206427_1 checkpointed

As there is no gap in the numbering, it's a BOINC thing.
When the client restarts it should start numbering at 1 again.
And the time is sequential as well.
Off course there is no gap in numbering. And the client always starts numbering at 1 - while testing the previous case, I've also checked that the numbers requested/sent over GUI RPC do match, with both boinccmd and BoincTasks and BOINC Manager and mesage lines in log file.
But even if there is no gap in numbering and the times are sequential, it does no mean that the displayed fragments do seamlessly match ;)
Peter

fred

Quote from: Pepo on August 05, 2010, 09:29:24 AM
Off course there is no gap in numbering. And the client always starts numbering at 1 - while testing the previous case, I've also checked that the numbers requested/sent over GUI RPC do match, with both boinccmd and BoincTasks and BOINC Manager and mesage lines in log file.
But even if there is no gap in numbering and the times are sequential, it does no mean that the displayed fragments do seamlessly match ;)
If there is no gap in the numbering, I can't do anything about it. The problem is in the BOINC client.

wicked

Quote from: fred on August 05, 2010, 10:12:36 AM
If there is no gap in the numbering, I can't do anything about it. The problem is in the BOINC client.

Do you always transfers all messages from the start or does BoincTasks have some kind of cache for the messages? If there is, maybe there's a missing "clear cache" call somewhere?

fred

Quote from: wicked on August 06, 2010, 04:12:29 PM
Quote from: fred on August 05, 2010, 10:12:36 AM
If there is no gap in the numbering, I can't do anything about it. The problem is in the BOINC client.

Do you always transfers all messages from the start or does BoincTasks have some kind of cache for the messages? If there is, maybe there's a missing "clear cache" call somewhere?
Messages are read from the last know number and up. So only the updated messages are read.
They are stored in stdoutdae.txt, but only the BOINC client reads it.

Pepo

Quote from: fred on August 06, 2010, 05:21:16 PM
Quote from: wicked on August 06, 2010, 04:12:29 PM
Quote from: fred on August 05, 2010, 10:12:36 AM
If there is no gap in the numbering, I can't do anything about it. The problem is in the BOINC client.
Do you always transfers all messages from the start or does BoincTasks have some kind of cache for the messages? If there is, maybe there's a missing "clear cache" call somewhere?
Messages are read from the last know number and up. So only the updated messages are read.
wicked, there are 2 modes how to get messages from client: a) request all available messages (the client keeps up to last 2000 cached in memory in a FIFO buffer), they are numbered from 1 since the launch and the available ones have numbers n ... n+1999, b) if you know the ID if your most recently read message (in our case it was n+1999) then you can request the following ones i.e. ID=n+2000 and you will get this and the newer ones (sure if they already exist).
Thus the usual way (for apps like B.Manager, BoincTasks, BoincView, etc.) is to initially get all available messages and then repeatedly (each couple of seconds) ask just for the newer ones.
Quote from: fred on August 06, 2010, 05:21:16 PM
They are stored in stdoutdae.txt, but only the BOINC client reads it.
I doubt this... I think the client immediately logs them into the file and never ever reads them back again.
Peter

wicked

Okay, but how does BoincTasks know when to discard the Messages list and start over? I mean when something like this happens:

  • Computer B has messages 1-10, which BoincTasks running on Computer A is showing on it's Messages tab.
  • Computer A gets hang and doesn't retrieve messages for few minutes.
  • Meanwhile Computer B gets restarted and starts storing messages from 1. It now has a new buffer of messages 1-15.
  • Computer A recovers and BoincTasks asks for messages starting from 11 and gets messages 11-15 from the new messages buffer. It happily appends them to the old buffer of 1-10 and thus creating the original problem.

So basically I'm wondering that there's no way for BoincTasks to detect a client restart and refresh it's message buffer in a situation like this?

What happens if BoincTasks asks for messages from 11 and instead gets a response that there are only messages 1-5 in the buffer? Does it clear it's old messages in this case?

fred

Quote from: wicked on August 08, 2010, 09:17:25 AM
Okay, but how does BoincTasks know when to discard the Messages list and start over? I mean when something like this happens:

  • Computer B has messages 1-10, which BoincTasks running on Computer A is showing on it's Messages tab.
  • Computer A gets hang and doesn't retrieve messages for few minutes.
  • Meanwhile Computer B gets restarted and starts storing messages from 1. It now has a new buffer of messages 1-15.
  • Computer A recovers and BoincTasks asks for messages starting from 11 and gets messages 11-15 from the new messages buffer. It happily appends them to the old buffer of 1-10 and thus creating the original problem.

So basically I'm wondering that there's no way for BoincTasks to detect a client restart and refresh it's message buffer in a situation like this?

What happens if BoincTasks asks for messages from 11 and instead gets a response that there are only messages 1-5 in the buffer? Does it clear it's old messages in this case?
It has all these checks and more.
A disconnect clears the buffer. But a short, few second reboot may go unnoticed. But than again it's highly unlikely that the buffer is in the 600 range after a few seconds.
I've tried very hard to reproduce any kind of problem. ;D
But 100% is hard to get, I don't want to slow down the messages to a crawl, to catch every unlikely problem.