Please add Status Postponed

Started by Yeti, January 07, 2019, 07:34:34 PM

Previous topic - Next topic

0 Members and 3 Guests are viewing this topic.

Yeti

Hi,

I'm running a lot of LHC-Tasks. Sometimes a task gets "postponed: xxxxxxxx".

BOINCTasks only shows "waiting to run". So I sit there and have no idea why the task doesn't start. When I log on the client I see postponed .....

It would be very helpful, if BOINCTasks could show this status also.

Thanks in Advance

fred

Quote from: Yeti on January 07, 2019, 07:34:34 PM
I'm running a lot of LHC-Tasks. Sometimes a task gets "postponed: xxxxxxxx".
Added to the wish list.

fred

I tried to find it in the BOINC source code but couldn't find anything.
Next time click on the task with the right mouse key and select properties.
Maybe the line State tells us more.

hucker

It's a very rare thing, and I think is related to overloading a computer.  I get weird stuff happening with LHC tasks as they use a lot of RAM, disk space, and internet bandwidth, and use Virtualbox which is taxing on the CPU.  The error "postponed" I think only occurs when it can't allocate a slot directory for a task, because it's already occupied.  All I could find is this:

https://boinc.mundayweb.com/wiki/index.php?title=What_do_Suspended,_Waiting_and_Postponed_mean%3F

"Postponed: waiting to acquire lock
This message means that a previous task is still occupying the lock file in the slot directory. BOINC cannot continue with this task until that slot directory has been vacated. Try to reboot, that usually stirs things loose. Used in BOINC 7."

Maybe you can find something in the source code to do with slot allocation and lock files?

fred

Quote from: hucker on November 27, 2020, 12:46:06 PM
It's a very rare thing, and I think is related to overloading a computer.  I get weird stuff happening with LHC tasks as they use a lot of RAM, disk space, and internet bandwidth, and use Virtualbox which is taxing on the CPU.  The error "postponed" I think only occurs when it can't allocate a slot directory for a task, because it's already occupied.  All I could find is this:
A lot of things can go wrong in a Virtual box. It's probably easier to use a Linux program that way.
This is something you have to ask the project. They probably do something wrong, maybe something crashes.

hucker

Quote from: fred on November 27, 2020, 06:27:14 PM
Quote from: hucker on November 27, 2020, 12:46:06 PM
It's a very rare thing, and I think is related to overloading a computer.  I get weird stuff happening with LHC tasks as they use a lot of RAM, disk space, and internet bandwidth, and use Virtualbox which is taxing on the CPU.  The error "postponed" I think only occurs when it can't allocate a slot directory for a task, because it's already occupied.  All I could find is this:
A lot of things can go wrong in a Virtual box. It's probably easier to use a Linux program that way.
This is something you have to ask the project. They probably do something wrong, maybe something crashes.
LHC use virtualbox so everyone is using precisely the same OS version etc, otherwise they get differing results from work units and they can't verify them easily.  I only get the weird stuff happening if I ask too much of the computer.  One time it was because an old hard disk had a couple of correctable errors on it, and therefore wasn't responding fast enough and the LHC tasks got fed up waiting.  Another time it was simply not enough RAM, they can need 2GB each, and with 24 cores and only 36GB, it ran out.

Yeti

Okay, here I found some more Details:

from client_state.xml:
<active_task>
    <project_master_url>https://lhcathome.cern.ch/lhcathome/</project_master_url>
    <result_name>xNvKDmk6p7xn9Rq4apoT9bVoABFKDmABFKDmxh5NDmABFKDmMCdTan_1</result_name>
    <active_task_state>0</active_task_state>
    <app_version_num>200</app_version_num>
    <slot>2</slot>
    <checkpoint_cpu_time>0.000000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>0.000000</checkpoint_elapsed_time>
    <checkpoint_fraction_done>0.000000</checkpoint_fraction_done>
    <checkpoint_fraction_done_elapsed_time>0.000000</checkpoint_fraction_done_elapsed_time>
    <current_cpu_time>0.000000</current_cpu_time>
    <once_ran_edf>0</once_ran_edf>
    <swap_size>6565888.000000</swap_size>
    <working_set_size>11542528.000000</working_set_size>
    <working_set_size_smoothed>7864320000.000000</working_set_size_smoothed>
    <page_fault_rate>0.000000</page_fault_rate>
    <bytes_sent>0.000000</bytes_sent>
    <bytes_received>0.000000</bytes_received>
</active_task>

From Task-Properties in BOINC_Tasks:

Computer:   Manni
Project   LHC@home
   
Name   xNvKDmk6p7xn9Rq4apoT9bVoABFKDmABFKDmxh5NDmABFKDmMCdTan_1
   
Application   ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
Workunit name   xNvKDmk6p7xn9Rq4apoT9bVoABFKDmABFKDmxh5NDmABFKDmMCdTan
State   Waiting to run
Received   14-12-2020 01:00
Report deadline   21-12-2020 01:00
Estimated app speed   4,73 GFLOPs/sec
Estimated task size   43.200 GFLOPs
Resources   4 CPUs
CPU time at last checkpoint   00:00:00
CPU time   00:00:00
Elapsed time   00:00:36
Estimated time remaining   02:31:52
Fraction done   0,000%
Virtual memory size   6,26 MB
Working set size   7.500,00 MB
Directory   slots/2
Process ID   18860
   
Debug   State: 2 - Scheduler: 1
   

fred


fred


Yeti

Sorry, but it seems not to work.

BOINCTasks shows this WU as "Waiting to run":

Application
ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
Name
cTKMDmz3kIynsSi4apGgGQJmABFKDmABFKDm7slRDmABFKDm1WN9No
State
Postponed: VM environment needed to be cleaned up.
Received
11/01/2021 08:56:55
Report deadline
18/01/2021 08:56:54
Resources
4 CPUs
Estimated computation size
43,200 GFLOPs
CPU time
---
CPU time since checkpoint
---
Elapsed time
00:00:43
Estimated time remaining
02:58:17
Fraction done
0.000%
Virtual memory size
0 bytes
Working set size
7.32 GB
Directory
slots/0
Process ID
13552
Executable
vboxwrapper_26198ab7_windows_x86_64.exe

----------------------------------------

This is from Client_State.xml

<active_task>
    <project_master_url>https://lhcathome.cern.ch/lhcathome/</project_master_url>
    <result_name>cTKMDmz3kIynsSi4apGgGQJmABFKDmABFKDm7slRDmABFKDm1WN9No_1</result_name>
    <active_task_state>0</active_task_state>
    <app_version_num>200</app_version_num>
    <slot>0</slot>
    <checkpoint_cpu_time>0.000000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>0.000000</checkpoint_elapsed_time>
    <checkpoint_fraction_done>0.000000</checkpoint_fraction_done>
    <checkpoint_fraction_done_elapsed_time>0.000000</checkpoint_fraction_done_elapsed_time>
    <current_cpu_time>0.000000</current_cpu_time>
    <once_ran_edf>0</once_ran_edf>
    <swap_size>0.000000</swap_size>
    <working_set_size>0.000000</working_set_size>
    <working_set_size_smoothed>7864320000.000000</working_set_size_smoothed>
    <page_fault_rate>0.000000</page_fault_rate>
    <bytes_sent>0.000000</bytes_sent>
    <bytes_received>0.000000</bytes_received>
</active_task>

fred

Quote from: Yeti on January 12, 2021, 03:52:30 PM
Sorry, but it seems not to work.
The info isn't enough to see what's going on.
The Active state is only part of the info and this wu isn't active......
<result> has more and maybe <workunit>

But maybe this "Virtual memory size 0 bytes" triggers the message.

The message "VM environment needed to be cleaned up" should be somewhere in the source code. I will try to find it....