Hi,
I'm running a lot of LHC-Tasks. Sometimes a task gets "postponed: xxxxxxxx".
BOINCTasks only shows "waiting to run". So I sit there and have no idea why the task doesn't start. When I log on the client I see postponed .....
It would be very helpful, if BOINCTasks could show this status also.
Thanks in Advance
Quote from: Yeti on January 07, 2019, 07:34:34 PM
I'm running a lot of LHC-Tasks. Sometimes a task gets "postponed: xxxxxxxx".
Added to the wish list.
I tried to find it in the BOINC source code but couldn't find anything.
Next time click on the task with the right mouse key and select properties.
Maybe the line State tells us more.
It's a very rare thing, and I think is related to overloading a computer. I get weird stuff happening with LHC tasks as they use a lot of RAM, disk space, and internet bandwidth, and use Virtualbox which is taxing on the CPU. The error "postponed" I think only occurs when it can't allocate a slot directory for a task, because it's already occupied. All I could find is this:
https://boinc.mundayweb.com/wiki/index.php?title=What_do_Suspended,_Waiting_and_Postponed_mean%3F (https://boinc.mundayweb.com/wiki/index.php?title=What_do_Suspended,_Waiting_and_Postponed_mean%3F)
"Postponed: waiting to acquire lock
This message means that a previous task is still occupying the lock file in the slot directory. BOINC cannot continue with this task until that slot directory has been vacated. Try to reboot, that usually stirs things loose. Used in BOINC 7."
Maybe you can find something in the source code to do with slot allocation and lock files?
Quote from: hucker on November 27, 2020, 12:46:06 PM
It's a very rare thing, and I think is related to overloading a computer. I get weird stuff happening with LHC tasks as they use a lot of RAM, disk space, and internet bandwidth, and use Virtualbox which is taxing on the CPU. The error "postponed" I think only occurs when it can't allocate a slot directory for a task, because it's already occupied. All I could find is this:
A lot of things can go wrong in a Virtual box. It's probably easier to use a Linux program that way.
This is something you have to ask the project. They probably do something wrong, maybe something crashes.
Quote from: fred on November 27, 2020, 06:27:14 PM
Quote from: hucker on November 27, 2020, 12:46:06 PM
It's a very rare thing, and I think is related to overloading a computer. I get weird stuff happening with LHC tasks as they use a lot of RAM, disk space, and internet bandwidth, and use Virtualbox which is taxing on the CPU. The error "postponed" I think only occurs when it can't allocate a slot directory for a task, because it's already occupied. All I could find is this:
A lot of things can go wrong in a Virtual box. It's probably easier to use a Linux program that way.
This is something you have to ask the project. They probably do something wrong, maybe something crashes.
LHC use virtualbox so everyone is using precisely the same OS version etc, otherwise they get differing results from work units and they can't verify them easily. I only get the weird stuff happening if I ask too much of the computer. One time it was because an old hard disk had a couple of correctable errors on it, and therefore wasn't responding fast enough and the LHC tasks got fed up waiting. Another time it was simply not enough RAM, they can need 2GB each, and with 24 cores and only 36GB, it ran out.
Okay, here I found some more Details:
from client_state.xml:
<active_task>
<project_master_url>https://lhcathome.cern.ch/lhcathome/</project_master_url>
<result_name>xNvKDmk6p7xn9Rq4apoT9bVoABFKDmABFKDmxh5NDmABFKDmMCdTan_1</result_name>
<active_task_state>0</active_task_state>
<app_version_num>200</app_version_num>
<slot>2</slot>
<checkpoint_cpu_time>0.000000</checkpoint_cpu_time>
<checkpoint_elapsed_time>0.000000</checkpoint_elapsed_time>
<checkpoint_fraction_done>0.000000</checkpoint_fraction_done>
<checkpoint_fraction_done_elapsed_time>0.000000</checkpoint_fraction_done_elapsed_time>
<current_cpu_time>0.000000</current_cpu_time>
<once_ran_edf>0</once_ran_edf>
<swap_size>6565888.000000</swap_size>
<working_set_size>11542528.000000</working_set_size>
<working_set_size_smoothed>7864320000.000000</working_set_size_smoothed>
<page_fault_rate>0.000000</page_fault_rate>
<bytes_sent>0.000000</bytes_sent>
<bytes_received>0.000000</bytes_received>
</active_task>
From Task-Properties in BOINC_Tasks:
Computer: Manni
Project LHC@home
Name xNvKDmk6p7xn9Rq4apoT9bVoABFKDmABFKDmxh5NDmABFKDmMCdTan_1
Application ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
Workunit name xNvKDmk6p7xn9Rq4apoT9bVoABFKDmABFKDmxh5NDmABFKDmMCdTan
State Waiting to run
Received 14-12-2020 01:00
Report deadline 21-12-2020 01:00
Estimated app speed 4,73 GFLOPs/sec
Estimated task size 43.200 GFLOPs
Resources 4 CPUs
CPU time at last checkpoint 00:00:00
CPU time 00:00:00
Elapsed time 00:00:36
Estimated time remaining 02:31:52
Fraction done 0,000%
Virtual memory size 6,26 MB
Working set size 7.500,00 MB
Directory slots/2
Process ID 18860
Debug State: 2 - Scheduler: 1
Added to the to do list
Resolved in V 1.84
Sorry, but it seems not to work.
BOINCTasks shows this WU as "Waiting to run":
Application
ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
Name
cTKMDmz3kIynsSi4apGgGQJmABFKDmABFKDm7slRDmABFKDm1WN9No
State
Postponed: VM environment needed to be cleaned up.
Received
11/01/2021 08:56:55
Report deadline
18/01/2021 08:56:54
Resources
4 CPUs
Estimated computation size
43,200 GFLOPs
CPU time
---
CPU time since checkpoint
---
Elapsed time
00:00:43
Estimated time remaining
02:58:17
Fraction done
0.000%
Virtual memory size
0 bytes
Working set size
7.32 GB
Directory
slots/0
Process ID
13552
Executable
vboxwrapper_26198ab7_windows_x86_64.exe
----------------------------------------
This is from Client_State.xml
<active_task>
<project_master_url>https://lhcathome.cern.ch/lhcathome/</project_master_url>
<result_name>cTKMDmz3kIynsSi4apGgGQJmABFKDmABFKDm7slRDmABFKDm1WN9No_1</result_name>
<active_task_state>0</active_task_state>
<app_version_num>200</app_version_num>
<slot>0</slot>
<checkpoint_cpu_time>0.000000</checkpoint_cpu_time>
<checkpoint_elapsed_time>0.000000</checkpoint_elapsed_time>
<checkpoint_fraction_done>0.000000</checkpoint_fraction_done>
<checkpoint_fraction_done_elapsed_time>0.000000</checkpoint_fraction_done_elapsed_time>
<current_cpu_time>0.000000</current_cpu_time>
<once_ran_edf>0</once_ran_edf>
<swap_size>0.000000</swap_size>
<working_set_size>0.000000</working_set_size>
<working_set_size_smoothed>7864320000.000000</working_set_size_smoothed>
<page_fault_rate>0.000000</page_fault_rate>
<bytes_sent>0.000000</bytes_sent>
<bytes_received>0.000000</bytes_received>
</active_task>
Quote from: Yeti on January 12, 2021, 03:52:30 PM
Sorry, but it seems not to work.
The info isn't enough to see what's going on.
The Active state is only part of the info and this wu isn't active......
<result> has more and maybe <workunit>
But maybe this "Virtual memory size 0 bytes" triggers the message.
The message "VM environment needed to be cleaned up" should be somewhere in the source code. I will try to find it....