Checkpoints for multithread projects

Started by jjwhalen, September 14, 2010, 12:39:30 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

jjwhalen

Re: wishlist item
QuoteBUG: Checkpoints for multi thread projects are messed up. But ... there is no multi thread project that has work........

AQUA's IQUANA [IQUANA (mt1) v1.14] has work and to spare.  I just completed 2 WUs and downloaded 2 more.  And the checkpoint column is as funky as ever ;D

Reminder -- IQUANA still has a problem of randomly locking up a CPU core when in state Waiting, especially on a quad.  So it's a good idea to Suspend Project when it isn't actually running ;)


Pepo

Quote from: jjwhalen on September 14, 2010, 12:39:30 PM
AQUA's IQUANA [IQUANA (mt1) v1.14] has work and to spare.  I just completed 2 WUs and downloaded 2 more.  And the checkpoint column is as funky as ever ;D

Reminder -- IQUANA still has a problem of randomly locking up a CPU core when in state Waiting, especially on a quad.  So it's a good idea to Suspend Project when it isn't actually running ;)
[rant]A couple of weeks ago (on 12 Aug 2010 2:22:12 UTC) I've received a 3-core task. Since the moment, my AQUA's STD went from some -86 400 (or whatever it was) to +100 000 (or whatever it is really now - OK, +31 565), but the task did not start yet. My AQUA's resource share is 2.80%. What is the client still waiting for? Enough free cores? The machine did restart a couple of times since. And I wanted to check exactly the checkpointing.

One more week till deadline, so I can expect the task being run in 5-6 days.[/rant]
Peter

jjwhalen

#2
Quote from: Pepo on September 14, 2010, 01:35:08 PM
[rant]A couple of weeks ago (on 12 Aug 2010 2:22:12 UTC) I've received a 3-core task. Since the moment, my AQUA's STD went from some -86 400 (or whatever it was) to +100 000 (or whatever it is really now - OK, +31 565), but the task did not start yet. My AQUA's resource share is 2.80%. What is the client still waiting for? Enough free cores? The machine did restart a couple of times since. And I wanted to check exactly the checkpointing.

One more week till deadline, so I can expect the task being run in 5-6 days.[/rant]


Strange.

By coincidence I'm running a 2 core IQUANA right now, on my slow machine.  I have my AQUA Resource Share at 11.43%.  I'm not sure how the Scheduler actually applies the RS to a multicore task.  I guarantee to you that they aren't running 11.43% of the time, but they do get done ;)  It might well be (RS/# of cores), so for you 2.8/3=0.93%.  They do seem to follow the "Switch between applications every..." (in my case 150 minutes) in linear wallclock time.

It may be in your case that the low RS is just making the task wait until the deadline approaches, when it will go into panic mode.  You could try forcing it to run by momentarily suspending other projects on that host.  AQUA won't give more than 2 WUs at a time, so you won't get a year's worth of work by mistake.


Pepo

#3
Quote from: jjwhalen on September 14, 2010, 10:05:26 PM
Quote from: Pepo on September 14, 2010, 01:35:08 PM
[rant]A couple of weeks ago (on 12 Aug 2010 2:22:12 UTC) I've received a 3-core task. Since the moment, my AQUA's STD went from some -86 400 (or whatever it was) to +100 000 (or whatever it is really now - OK, +31 565), but the task did not start yet. My AQUA's resource share is 2.80%. What is the client still waiting for? Enough free cores? The machine did restart a couple of times since. And I wanted to check exactly the checkpointing.

One more week till deadline, so I can expect the task being run in 5-6 days.[/rant]


Strange.

By coincidence I'm running a 2 core IQUANA right now, on my slow machine.  I have my AQUA Resource Share at 11.43%.  I'm not sure how the Scheduler actually applies the RS to a multicore task.  I guarantee to you that they aren't running 11.43% of the time, but they do get done ;)  It might well be (RS/# of cores), so for you 2.8/3=0.93%.  They do seem to follow the "Switch between applications every..." (in my case 150 minutes) in linear wallclock time.

It may be in your case that the low RS is just making the task wait until the deadline approaches, when it will go into panic mode.  You could try forcing it to run by momentarily suspending other projects on that host.  AQUA won't give more than 2 WUs at a time, so you won't get a year's worth of work by mistake.
I'm aware of all you wrote about. The RS seems indeed to be equal to some 0.93% for a single-threaded task. But finally the task has been started a bit sooner than I've expected - already some 5 1/2 days prior to its deadline (off course in the High Priority mode ;D). I wanted to let the task start on its own...

And now the Checkpointing column - during some small observations, with both short and long time CPU% averages (both around 68%) and different refresh times, it seemed to move in sync with "Elapsed's (CPU time)". I guess your funky observations are maybe caused by two things: a) the "Time Left" value often stays unchanged or jumps back, and b) the "Checkpoint" value possibly does not display the true "used CPU time since checkpoint", but is calculated out of .....?

[rant]And again a note: a 1CPU+1NV Einstein task was running together with 3 CPU Aqua task, and two more 0.01CPU tasks, while the client's limit is 85% of 4 cores = 3.4 cores - how much % is the client's allowed overcommitment ??? (OK maybe it is indeed in limits.)[/rant]
Peter

Pepo

Quote from: jjwhalen on September 14, 2010, 12:39:30 PM
Reminder -- IQUANA still has a problem of randomly locking up a CPU core when in state Waiting, especially on a quad.  So it's a good idea to Suspend Project when it isn't actually running ;)
In my case, after a 3-thread task is Waiting, just 1 core is left available for other tasks. But according to the list of tasks, really just one of them is getting "a green" to run. Thus I think the problem is in the client.

Apparently it is sufficient to briefly suspend (and then resume) the waiting Aqua task - the remaining (nCPU-1) tasks start immediately. I do again blame the client. Let's report it on the Alpha channel.
Peter

jjwhalen

#5
Quote from: Pepo on September 17, 2010, 10:33:33 PM
Quote from: jjwhalen on September 14, 2010, 12:39:30 PM
Reminder -- IQUANA still has a problem of randomly locking up a CPU core when in state Waiting, especially on a quad.  So it's a good idea to Suspend Project when it isn't actually running ;)
In my case, after a 3-thread task is Waiting, just 1 core is left available for other tasks. But according to the list of tasks, really just one of them is getting "a green" to run. Thus I think the problem is in the client.

Apparently it is sufficient to briefly suspend (and then resume) the waiting Aqua task - the remaining (nCPU-1) tasks start immediately. I do again blame the client. Let's report it on the Alpha channel.

I won't disagree with your assessment.  I don't think the core client is quite ready for multicore/multithreaded tasks.

But I'll also add that I haven't tried limiting IQUANA to less than <ncpus=all of them> even though the app is supposed to allow that.  I've had reasonably good results with the default 4.00CPUs & 2.00CPUs using the RS=100(11.43%).  But progress on the CoreDuo machines is quite slow as you might expect.  I stick with the project because I'm interested in the developer's goal of trying algorithms for theoretical Quantum computers--too cool for an old Star Trek fan to resist.

I probably need to pay more attention to AQUA's user forum.  The only threads I've really followed are about the problem of locking up 1 core when in state Waiting (which they are blaming on the current BOINC Scheduler implementation).

[edit]I see from the wishlist that Fred still can't get any work from Aqua--I guess the folks there must not like him :([/edit]


Pepo

#6
Quote from: jjwhalen on September 17, 2010, 11:24:08 PM
But I'll also add that I haven't tried limiting IQUANA to less than <ncpus=all of them>
I'm limiting the whole BOINC - to keep heat low and have some instant available CPU headroom. (For years I've been running BOINC on 100%). I did not find any option how could I limit IQUANA (except what I've done some months ago - lowering BOINC's CPU limit to 2, download one Aqua task and restoring former BOINC's CPU limit).

QuoteI probably need to pay more attention to AQUA's user forum.  The only threads I've really followed are about the problem of locking up 1 core when in state Waiting (which they are blaming on the current BOINC Scheduler implementation).
I suddenly felt this need too, and 90 minutes ago started to read the forum ;D (No, not really, I see I've already posted a few messages there some 2 years ago.)
Peter

fred

Quote from: jjwhalen on September 17, 2010, 11:24:08 PM
[edit]I see from the wishlist that Fred still can't get any work from Aqua--I guess the folks there must not like him :([/edit]
Got 2 of them for 8 Cpu's.

fred

Fixed 2 problems.
1) Elapsed time () and the checkpoint time.

Pepo

Quote from: fred on September 18, 2010, 01:34:54 PM
Fixed 2 problems.
1) Elapsed time () and the checkpoint time.
What was actually going on with the funky Checkpoint time on AQUA? I did not experience anything.
Peter

fred

Quote from: Pepo on September 18, 2010, 01:39:49 PM
Quote from: fred on September 18, 2010, 01:34:54 PM
Fixed 2 problems.
1) Elapsed time () and the checkpoint time.
What was actually going on with the funky Checkpoint time on AQUA? I did not experience anything.
For some reasons they use something like the elapsed cpu time. And that is when running e.g. 8 cores 8 times too much.
The elapsed cpu time that is used as a baseline is also 8 times too much.

dCheckpointRelative = dElapsedCheck/dCheckpointCpuFactor  - dCheckpoint/dCheckpointCpuFactor;
dCheckpointRelative /= dCheckpointRatio;

dCheckpointCpuFactor  = number op cores. ceil(cores in usage);
dCheckpoint = relative checkpoint time.
dElapsedCheck = elapsed cpu time = time left ().
dCheckpointRatio = CPU % only used on the GPU otherwise 1.

jjwhalen

Quote from: Pepo on September 18, 2010, 12:13:41 AM
Quote from: jjwhalen on September 17, 2010, 11:24:08 PM
But I'll also add that I haven't tried limiting IQUANA to less than <ncpus=all of them>
I'm limiting the whole BOINC - to keep heat low and have some instant available CPU headroom. (For years I've been running BOINC on 100%). I did not find any option how could I limit IQUANA (except what I've done some months ago - lowering BOINC's CPU limit to 2, download one Aqua task and restoring former BOINC's CPU limit).


I don't want to sound like a know-it-all especially since I haven't tried it myself, but apparently you can manually edit the <app_version> section of client_state which gets automatically written to allow all CPU cores to be used:

Quote<?example from a quad-core host?>

<app_version>
    <app_name>IQUANA</app_name>
    <version_num>114</version_num>
    <platform>windows_x86_64</platform>
    <avg_ncpus>4.000000</avg_ncpus>
    <max_ncpus>4.000000</max_ncpus>
    <flops>15448699211.215364</flops>
    <plan_class>mt1</plan_class>
    <api_version>6.9.0</api_version>
    <cmdline>--nthreads 4</cmdline>
    <file_ref>
        <file_name>iquana_1.14_windows_x86_64__mt1.exe</file_name>
        <main_program/>
    </file_ref>
    <file_ref>
        <file_name>vcomp90_64bit</file_name>
        <open_name>vcomp90.dll</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>logo.jpg</file_name>
        <open_name>logo.jpg</open_name>
    </file_ref>
    <file_ref>
        <file_name>Helvetica.txf</file_name>
        <open_name>Helvetica.txf</open_name>
    </file_ref>
    <file_ref>
        <file_name>gfx_4.15_x86_64.exe</file_name>
        <open_name>graphics_app</open_name>
        <copy_file/>
    </file_ref>
</app_version>


<?example from a dual-core host?>

<app_version>
    <app_name>IQUANA</app_name>
    <version_num>114</version_num>
    <platform>windows_x86_64</platform>
    <avg_ncpus>2.000000</avg_ncpus>
    <max_ncpus>2.000000</max_ncpus>
    <flops>6523948620.146737</flops>
    <plan_class>mt1</plan_class>
    <api_version>6.9.0</api_version>
    <cmdline>--nthreads 2</cmdline>
    <file_ref>
        <file_name>iquana_1.14_windows_x86_64__mt1.exe</file_name>
        <main_program/>
    </file_ref>
    <file_ref>
        <file_name>vcomp90_64bit</file_name>
        <open_name>vcomp90.dll</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>logo.jpg</file_name>
        <open_name>logo.jpg</open_name>
    </file_ref>
    <file_ref>
        <file_name>Helvetica.txf</file_name>
        <open_name>Helvetica.txf</open_name>
    </file_ref>
    <file_ref>
        <file_name>gfx_4.15_x86_64.exe</file_name>
        <open_name>graphics_app</open_name>
        <copy_file/>
    </file_ref>
</app_version>

The structure looks straightforward and I expect all you need do is amend the <cmdline>--nthreads value.  When I get a slack period I'll play around and see what happens.  What could go wrong ??? ;D


jjwhalen

#12
Quote from: fred on September 18, 2010, 02:11:00 PM
Quote from: Pepo on September 18, 2010, 01:39:49 PM
Quote from: fred on September 18, 2010, 01:34:54 PM
Fixed 2 problems.
1) Elapsed time () and the checkpoint time.
What was actually going on with the funky Checkpoint time on AQUA? I did not experience anything.
For some reasons they use something like the elapsed cpu time. And that is when running e.g. 8 cores 8 times too much.
The elapsed cpu time that is used as a baseline is also 8 times too much.

dCheckpointRelative = dElapsedCheck/dCheckpointCpuFactor  - dCheckpoint/dCheckpointCpuFactor;
dCheckpointRelative /= dCheckpointRatio;

dCheckpointCpuFactor  = number op cores. ceil(cores in usage);
dCheckpoint = relative checkpoint time.
dElapsedCheck = elapsed cpu time = time left ().
dCheckpointRatio = CPU % only used on the GPU otherwise 1.


Yup, that makes perfect sense, consistent with the visual indications.  This is excellent news! :)


Pepo

Quote from: fred on September 18, 2010, 02:11:00 PM
Quote from: Pepo on September 18, 2010, 01:39:49 PM
Quote from: fred on September 18, 2010, 01:34:54 PM
Fixed 2 problems.
1) Elapsed time () and the checkpoint time.
What was actually going on with the funky Checkpoint time on AQUA? I did not experience anything.
For some reasons they use something like the elapsed cpu time. And that is when running e.g. 8 cores 8 times too much.
The elapsed cpu time that is used as a baseline is also 8 times too much.
Well, you can call me ignorant, but... both BOINC Manager and BoincTasks do say in tandem (in the task's Properties window "CPU time at last checkpoint". Thus, on a 8 core rig it HAVE TO flow (up to) 8 times faster than the wall-clock elapsed time. Doesn't it??
Peter

fred

Quote from: Pepo on September 19, 2010, 05:17:02 PM
Well, you can call me ignorant, but... both BOINC Manager and BoincTasks do say in tandem (in the task's Properties window "CPU time at last checkpoint". Thus, on a 8 core rig it HAVE TO flow (up to) 8 times faster than the wall-clock elapsed time. Doesn't it??
Cpu time is the time of each core. So 12 seconds wallclock gives a 1*12=12 at one core and 8*12=96 on 8 cores cpu time.