News:

Follow BoincTasks on Twitter Facebook        Visit our website here.
BoincTasks cloud login is working again

Main Menu
Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - JStateson

#16
Occasionally a GPU gets hung and never finishes a job, or it can reject a job within seconds of receiving it.  These events are quickly discovered using the rules mechanism.  Currently, a batch file can be executed and an email or text message can easily be sent.  However, it would be advantageous to the project and the user, to be able to handle the situation automatically.  This can only be implemented if identifying parameters can be passed from BoincTasks to the handler.  At a minimum, the following parameters might be needed

$temp---------temperature of the device assuming tthrottle running or "none"
$device-------device id of GPU (D0, D1, etc) or just "CPU" if not a co-processor
$ip_address---need to know which system has problem
$port---------if needed to communicate with client and some systems have multiple clients
$password-----if needed to communicate with the client
$rule_name----the name of the rule could have an identifying phrase useful to the handler
$computer-----name of the system
$platform-----handler might need to know which OS: Linux, mac, windows
$project------name of project would be useful to handler
$app----------name of app
$rule_count---number of times rule has been applied

Example of rule usage

if Elapsed time > 5 minutes,  project "SETI@home",  app "8.01 setiathome_v8 (cuda90)", run program:
d:\ProgramData\boinc\scripts\HandleRule.bat $rule_name $ $ip_address $device

With these additions, more useful rules can be contributed as well as 3rd party scripts or apps such as resetting the GPU, excluding it from use by the Boinc client, or shutting down the client or system.

There is a discussion back in jan 2019 by Boinc principals here where they are considering adding xml files that basically duplicate a few of the BoincTasks rules.  Their xml includes, for example, instructions to a particular nvidia board to enable or disable.
This functionality is partially present in BoincTasks but is missing the parameters required to identify the device and system having the problem.  Even if their "Computing prefs 2.0" is implemented it would required those XML file to be present on each system.

The device_id can be 0, 1, 2 etc for each type of GPU so it must include a type such as nvidia, intel, amd, etc
Need to be consistent with naming used by the exclude_gpu which appear to be
  [<type>NVIDIA|ATI|intel_gpu</type>]
#17
Right click on a computer in the "All Computers" tree brings up a select list of apps to run.

Apps would be on a tab similar to "Extra" -> "BoincTask settings" -> "Messages"

Buttons such as ADD, DEL, TEST etc.

example of what might look like.  Instead of "Project" and "Message"
  "Name"                    "Command"
PuttyLinux            "C:\Program Files\PuTTY\putty.exe" username@$(IP_ADDRESS) -pw password
IssueMWUpdate          "D:\RUN_MW_RPC_APP.BAT" $(IP_ADDRESS) $(PORT) $(PASSWORD)

etc


The names would show up in the dropdown box
#18
Questions / need help making up a rule
August 24, 2019, 11:08:24 AM
This is a hardware or software problem but it would be nice if I could spot the problem when it first occurs.  Going to post over at BOINC also as possibly the problem could be debugged better if I knew more about what was happening.

---once every couple of days----

On a 5 GPU rig, one of the GPUs crunches for 4-5 seconds then goes on to another work unit.  A queue of "waiting to run" starts building up.  Because there are 4 other working GPUs. they pull from this queue so the queue grows only slowly.  After about an hour or two there might be 40 items in the queue.

sudo /etc/init.d/boinc-client restart  => does not always work
sudo shutdown now => looks like it works but I generally cycle the power after a few minutes of waiting

When the system boots back up I run a script to set the fans to %100 else temps get up past 80 for a pair of gtx1060

I failed to make a note of which GPU had the problem if indeed the problem is a single gpu.  The only way to tell is to stop the fan and see which one reports 0 speed and then look up the bus id and see which GPU it matches in coproc-info.xml.  Have not done this yet but will the next time this happens. It would be nice if BOINC reported the same GPU# that nvidia reports on their diagnostics.  BOINC assigned 0 to best (like 1070 or gtx 2080) and larger numbers to weaker GPUs.  Not sure why they bother to rank GPUs in the first place.

---back to the rule---

The most obvious thing is to see if there are more than X items in the "waiting to run" queue and then run a script that sends me a text message.  I already have a script that does that but there is no "waiting to run" and I am pretty sure the %cpu was 99 percent so I cant use that as a trigger.  However, the CPU% is always 99 because I need to run "-nobs" to force the system to dedicated a thread %1l00 to the GPU.  So possilbly the cpu is really idle and the 99 is simply a "busy polling all the time" symptom which is a feature of the "-nobs" parameter.
#19
Questions / Re: Rule not continuous?
August 23, 2019, 06:37:10 PM
Yes, I think your observation is correct.  I have a similar problem with  the SETI NoCal app:  It seems to get stuck.  Instead of finishing in 14 - 15 minutes it drags on for hours until it times out.  I use a rule to suspend the task after 24 minutes.  The following is the log:

23 August 2019 - 13:04:15 Rule(s) ---- Active: 1
23 August 2019 - 13:04:15 Rule: StopNocal ---- rx560, SETI@home, 8.22 SETI@home v8 (opencl_ati5_SoG_nocal),  | Elapsed Time > 00d,00:24:00
23 August 2019 - 13:04:15 ============================================================================== ----
23 August 2019 - 13:05:16 Rule: StopNocal, trigger ---- rx560, SETI@home, SETI@home v8, (Elapsed >00d,03:26:24),
23 August 2019 - 13:05:16 Rule: StopNocal, from que ---- Suspend task
23 August 2019 - 13:05:17 Rule: StopNocal ---- Activated: OK, Project: SETI@home, rx560, Suspend task
23 August 2019 - 13:07:19 Rule: StopNocal ---- No longer active: rx560, SETI@home, SETI@home v8, (Elapsed >00d,03:26:24)


Note that the rule is NO LONGER ACTIVE.  That is because the  task still exists and is beyond the time limit.  The problem is that the tasks needs to be aborted.  I don't see an easy way to do that especially since the task is on a remote system.  I will look at this problem but I suspect it is not an easy fix as a simple re-activate will just try to suspend the same task again.  Just a guess.  Possibly Fred could add "Abort" as a rule option.

[EDIT] This may not apply to your case, but I discovered that if I resume the task it finishes within a few minutes.  Obviously there is a problem of some type, hardware most likely.   For this unique case I can write an additional rule that if a task is suspended for x minutes it could be resumed.  However, I do not see a "resume task" option. I think an abort option is best.
#20
I added a feature to compare GPU boards.  GRC mining frequently has a mining rack with 6 or more GPUs.

As shown below, There are 10 assorted nVidia boards recognized by BOINC.  The system TB85-nvidia is running the Linux app "cuda90" and shows 9351 work units completed successfully.  The "Type Analysis" shows the GPU option and the last 5.27 hours were analyzed and organized by GPU#.  Note that the gtx1070 Ti had the best performance with the GTX1060's slower.  The display is elapsed time in minutes.

Source to build the app (windows c#) and links to the executables as well as instructions for running the program are in the Boinctasks History Analyzer & Project performance post.  Feel free to email or PM me any questions, bugs, suggestions, etc.  I assume you can also post here.  I put a zip file with 32 and 64 bit executables here  This app uses dotnet framework 4.6.1 but I assume anything later could also be installed to allow it to run.



The following show how many credits were earned in the last 1.2 days.  Note that the system needs to be continuously mining in addition to boinctasks running constantly during the 1.2 days.




Added feature:  can plot wall time -vs- elapsed time to see changes in GPU performance.




All, or Individua,l GPUs can be scatter graphed to show differences in elapsed time. 







In addition, one can offset each GPU to see if there are differences in the processing of the data.
For example, the first graph shows a 5 GPU processing Milkyway datasets.  Once offset, it is
obvious that there are two different type of datasets some of which take longer to get the same
credits.

#21
Rules / Re: Members that will be removed
August 07, 2019, 01:41:38 PM
Quote from: fred on June 11, 2010, 04:33:15 PM
The following members will be removed:

All members that have never posted and are on http://www.stopforumspam.com/

I once got on that StopForumSpam list and it took me 3 weeks to get off.  The result of complaining on a climate forum that skeptics should not be called deniers and that satellite studies show the earth is greening as a result of Co2 increases.  It had to be a moderator or project admin who did it.  I will never crunch on any BOINC climate projects and have not since about 2012 when it happened.  I also avoid any discussions where possible.  I do not think there is any accountability there and ones email can be easily spoofed. 
#22
I  have seen this exact problem before.  It is not related to Boinctask and can happen with any windows form that tries to restore its previous position when the monitor layout has been changed or the multi-display driver has "forgotten" that two monitors exist.

The problem can happen if one of 2 (or more video boards) has a reset.  This shows up in the event viewer as an "nVidia driver reset" or ATI.    Can also show up (lost program) if monitor is turned off and/or not recognized when system is powered up. Possibly compounded by older drivers and lack of Microsoft updates to video handler in windows 7.

I do not  know how you got the problem but I can give you my procedure for fixing it.

1. The icon for the missing program is in the task bar but the program but is no where to be seen.
2. Select the icon so that program would have popped up on the screen (this is trial and error)
3. press ALT+SPACE-M  Assuming the program is "on screen" somewhere, this gets windows to focus on it and enables the cursor keys to control the movement "m" of the form.
4.  using the cursor keys move left , right, up, down until the form shows up.
5.  Once the form shows up click on the maximize button so it maximizes to the current display.   This is important because if you don't maximize it then next time it will revert to its hidden location.

These problem went away for me when I switched to windows 10.  I suggest you use the free upgrade to get into 10 .

ALT+SPACE-M = Hold down ALT and Space bar at same time then release both and press the "m" key.  Do not use the mouse until the form is visible one of the displays.
#23
Thanks Fred!

I will include the setup part about 2 days and 30 days in my app.
#24
I thought if I set "remove history after 2 days" and also "move to long term storage after 2 days" that I could save history files that have older data and would get (3*long) *2 days + 1 cvs worth of old and fresh data (7 days).  That didn't happened as it is not working like I guessed.

Looking at the help info here I see the recommended is after 7 days.  The files at \appdata\roaming\efmer\BoincTasks\history that are named "long" never seem to have anything in them other than headers.

With 2 and 7 days respectively it seems like I would lose 5 days of data after the 7th day but that does not happen as nothing gets into the "long" files.

I get cvs1 & cvs2 filling even though I have not checked "Make  a  backup ..... (not recommended)".

I did not see an explanation for that option.   Why is it not recommended?

Asking about this as my app "BTHistoryReader" is more useful if there is more data to analyze and I am losing data after 2 days and would like to have at least a weeks worth of data to analyze.
#25
If you signed up for the insider programs you can disable those early release in windows by clicking on start -> settings -> update and security -> insider

I just got bit by that update.  One of my systems had an old driver for the nVidia card and the 1809 feature update put in a newer driver.  Known problem is that the newer drivers are just updates to the graphics and and do not include opencl so all the GPU tasks on that system stopped.  However, I should have updated the driver much earlier.

#26
Note sure if this is a real problem as I was easily able to fix it.  I do not know when it started, but I know only when I first noticed the problem (Jun 2)

I upgraded to BT 1.8 on May 28 and did not notice any problems but probably was not looking at temps.

This morning I look at BT and see a strange problem:  Temperatures for the CPU are way too low on a system that I know runs hot.  I reboot that system, same problem.  BT shows temps in low 30c which I know is wrong.  I go over to the system and bring up tthrottle to do a re-cal but I notice it shows temps in the mid 50c which is correct.  I got back to BT to see if I was looking at the wrong system but it is the correct one.  Exact tasks and details just the reported temperatures are not correct.  I Unchecked [X] from "computers" and then put it back in and the temperatures are correct. It was getting temps from somewhere else or was confused.
#27
Fred has graciously allowed my program BThistory to be promoted as an add-on to Boinctasks.
This post is frequently updated, please use your browser refresh feature to get changes and make
sure you have the latest build date as shown on main menu under the Open History button.
The location of the executables are listed  at the end of this post.   I put a zip file with 32 and 64 bit executables here
I do not have an install package for windows so you will have to answer a lot of "are you sure" questions from windows and your anti-virus when downloading or executing the app the first time.

BTHistory reads one or more Boinctasks's history files and allows data analysis for elapsed time, throughput and idle time.  If more than one file is opened, then comparisons can be made between different systems.  New or unknown applications are reported, highlighted and can be compared.  The program is written in C# and compiled under Visual Studio 2017.  One can download the executables or build the sources at location GitHub/JStateson.  Additional utility programs are included in the VS2017 solution and are explained below.  This app uses dotnet framework 4.6.1 but I assume anything later could also be installed to allow it to run.

1.   BTHistory main form and throughput analysis.


The history file "z400-4-s9x00" has been opened and there are 2 project names available. One has been selected (milkyway) and that project has only one app running on this system.  The number 5517 indicates the number of results.  A figure over 20,000 may take a while to load but that can be restricted to only ore recent results.  The throughput filter is selected and the last hours of data was fetched (no problems are shown in info window).  The continuity check was done indicating at most 1.48 minutes between results.  Knowing that this system had 4 boards, with 5 concurrent WUs and and average credit was 224 points, the results show about 5 credits per second per device.  The "1" adjacent to the App Name box indicates there is only 1 app associated with Milkyway, at least on this particular system.

2.   BTAhistory and Elapsed Time



Elapsed time is in minutes, but the plot parameters were changed to show the effective ET since four tasks were being run concurrently each GPU.  If there were 5 GPUs that mean of 50 seconds (as shown above) indicates the system with 5 GPUs will produce about (50 / 5 = 10 seconds per work unit)

2.1.   BThistory Idle Time


The Idle Time analysis is useful to show when projects run out of data or, as in the case above, the project fails to provide data until all the data has been processed.   In that case, where the system is waiting for data to arrive, the gap is considered idle time.  The above data shows that about every two hours there is a 6 to 9 minute gap before the server provides data.

2.2.  BThistory Dataset Scatter Plot


This graph can be used to observe how different data sets compare to one another in elapsed time.  Once a project is selected, then all the data sets can be displayed or any subset of them.  The application can also be changed to see how that compares.

2.3.  BThistory Dataset Member


Possible to find what dataset (name) the data belongs to by clicking on or near the point.  Restrictions: under 250 points and only 1 series


3.   BThistory Project Structure


This shows which projects are in the BThistory database or are in the history file

4.   BThistory Select Multiple Systems

If more than one history file is opened, then the BThistory produces a comparison of the different systems.  Typically the files of interest end in CVS and do not have phrase "_long_" in the filename.  In the event that the "long" files do contain data then you should uncheck that exclusion.



4.1   BThistory Comparison


This feature allows comparison of systems across the same project and app.  Currently only SELECT and REPORT are the only defined operations.  You can use this tool to compare, for example, the computation of SETI using NVidia  or AMD boards.  Once the project and app are selected only those systems will be shown.  Select (example above shows sse2) and the system you want to use in the comparison and click "save" to copy the results into notepad.  Then select the avx app and the system that is desired and you can examine the statistics and compare to what you saved in notepad.


4.2  BThistory Scatter Plots

There are two scatter plot options.  The first one compares Elapsed time  between the same applications.  This can show the difference between nVidia, ATI , sse2 or avx for example.



The second scatter plot shows only the selected app but each system is represented.  This shows how a particular app performs on different systems.





5   Other programs peripheral to BThistory

   The BThistory program resides at
https://github.com/JStateson/Gridcoin/tree/master/BTHistoryReader
The executables are stored in a .7z file at this GitHub location or you can obtain from my web site as
listed below

However, the actual VS2017 solution is at GitHub/JStateson/GridCoin which will cause one or more of the following programs to be built in addition to BTHistory.   All programs listed were built using VS2017 C# except for the RPC library, in C, which is only used by BoincRpc.

5.1  HostProjectStats

The is an aspx program that creates a web page.   It can be compiled and executed on your windows system or you can run the program using most browsers by clicking on the link below
http://stateson.net/HostProjectStats

This program obtains elapsed time information from most Boinc Projects and, if you optionally know the load and idle wattage, it will calculate the average credit and wattage used to produce those credits for the system and the boards.  This program requires that the data be available so it may not work with anonymous access unless the project had allowed it for the specified HOSTID.

As shown below, the project Milkyway has been browsed to and hostID 705276 selected.  This is one of the top systems and is listed by default when first bringing up the program.  It may not always be available.



Inexpensive watt meters are available, but you can build your own as shown

Here

https://github.com/JStateson/Gridcoin/blob/master/HostProjectStats/wmeter_wiring.jpg

and assembled here

https://github.com/JStateson/Gridcoin/blob/master/HostProjectStats/wmeter_assembled.jpg

Full load results of a pair of GPUs running 4 concurrent threads are shown here
https://github.com/JStateson/Gridcoin/blob/master/HostProjectStats/e5620_s9000_milkyway_4t_load.png

   HostProjectStats produced the following results based on the above data
https://github.com/JStateson/Gridcoin/blob/master/HostProjectStats/e5620_s9000_milkyway_4t.png


The 32 and 64 bit executables here I do not have an install package so it may not run unless you got the latest visual c run-times and dot net modules and windows 10.   Visual studio 2017 is a free download and has all the stuff needed to build this program.  PM me if a problem.
#28
Wish List / Re: test for out of work
May 03, 2019, 03:48:01 PM
Found a way to test for out of work - it was in BT already, just was unaware of it
Discussion is here
https://boinc.berkeley.edu/forum_user_posts.php?userid=3749

Unfortunately, BT cannot do an update as that is not one of the options.  However, it can run a batch file which can then do an update.
I put together a tool to do an update across the internet from the batch file.
https://github.com/BeemerBiker/MilkywayRPCupdate/tree/master/RPCupdate
#29
Wish List / test for out of work
April 25, 2019, 01:28:45 AM
Would like to know when there are no work units left along with a parameter such as time elapsed since queue was empty.  Currently there is a problem being discussed at milkyway where fast computers complete too soon and ask too often so they get banned and end up not getting any work until a manual update is issued.  Maybe the problem can be fixed at their end, no telling when though.

I would use the rule to issue a manual update and the rule would then be disabled and only re-enabled if more work actually arrived.

some projects go offline for maintenance but  I have %shares set to allow other projects to start.  The problem is fast systems getting "banned" and then needing an update to get more work.

Obviously this is a problem on the project server, but you might consider adding a rule to cover this.

[EDIT] Looks like there is a fix on the server side as discussed HERE.  Not the same fix as the topic suggests as one fix requires another as is usual.
#30
Questions / Re: NCI Tasks not showing
April 08, 2019, 04:23:23 AM
Thanks, had the same question.  I attached wuprop so I could monitor CPU temps as I was running only GPU tasks