Crunchers Anonymous

Cafe => General Discussion => Topic started by: William on June 12, 2012, 11:12:11 PM

Title: dont_use_dcf active at SETI main
Post by: William on June 12, 2012, 11:12:11 PM
Attention all SETI users:

as of the maintenance on Tuesday 12th June 2012 <dont_use_dcf/> is active on SETI main.

That means BOINC 7 clients are instructed to ignore DCF. To be fair since APR works now again, DCF isn't really needed.

The only problem with it is that the 'recommended' 7.0.25 client has a known bug that causes tasks to run in High Priority mode (HP/EDF) all the time.

So if you are on a 7.0.25 client upgrade to 7.0.28 (http://boinc.berkeley.edu/download_all.php) ASAP.
Title: Re: dont_use_dcf active at SETI main
Post by: arkayn on June 12, 2012, 11:20:53 PM
I still have flops set in my app_infos, but I am using 7.0.28 on the FX4100/GTX460.

The i3/GTX560 is still running 6.12.43.
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 12, 2012, 11:22:01 PM
This is truly psychotic.


Quote from: Barry on June 12, 2012, 11:12:11 PM... To be fair since APR works now again, DCF isn't really needed. ...

I better point out that this assertion isn't strictly true, and is one I've heard mentioned by Joe Segur, and I think David in changesets.

APR controls long term general estimates, but provides no short term facility for tracking machine hardware or performance changes, therefore relying on APR alone will result in headaches for those up (or down)-grading.

In Engineering control theory, cascade control systems  in distinct transient domains  (as DCF with single app + APR were ) are perfectly fine, provided they are operational, contain no faulty heuristics/limits that unduly influence destablise normal operation,  and one is intended for quick response & the other slower.

Removing the faulty DCF thing is probably good, given that it doesn't cope with multiple applications well, but APR can't cope gracefully with most of the things DCF did well... So the net effect is we lose the 'good parts' of DCF & gain wonky estimates that take even longer to correct when hardware (or software) changes.

Jason
Title: Re: dont_use_dcf active at SETI main
Post by: William on June 12, 2012, 11:25:10 PM
Don't shoot the messenger. But I can amend the 'isn't really needed' later.

edit: amended in the onlyplace that has a timelimit on edit ::)
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 12, 2012, 11:44:20 PM
Thanks! that'll do quite well.

[Edit:] hmm, I left one out that I can think of now.  Using the machine for an extended period while crunching could be a problem for some as well.
Title: Re: dont_use_dcf active at SETI main
Post by: Richard Haselgrove on June 13, 2012, 01:05:30 AM
This particular bug is really only a big issue for people who run multiple BOINC projects on the same hardware.

If you run SETI only, the only difference you'll see is that your tasks run in 'Earliest Deadline First' mode. That is marked as 'high priority' in BOINC Manager's display, but it isn't the sort of raised priority which places any extra thermal stress on your CPU or GPU.

If you run other projects alongside SETI, and you're running v7.0.25, you may find that SETI appears to monopolise the CPU or GPU, and not allow other projects to take their turn at the trough. I think that's a big-enough show-stopper for BOINC as a whole that we should be able to get it turned off within a very few hours.
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 13, 2012, 01:13:11 AM
Sounds reasonable.  Given that the current recommended Boinc client is 7.0.25, I had been looking at migrating my mods to newer than 6.10.58/60... I'll think I'll put that idea off for revisit once 7 series is stable-ish.
Title: Re: dont_use_dcf active at SETI main
Post by: Richard Haselgrove on June 13, 2012, 01:22:28 AM
v7.0.28 is pretty good, and far more eligible for 'recommended' status than v7.0.25 ever was. There's been very little client development work since v7.0.28 was tagged - they're mainly doing server, API, and git work.
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 13, 2012, 01:38:17 AM
Quote from: Richard Haselgrove on June 13, 2012, 01:22:28 AM
v7.0.28 is pretty good, and far more eligible for 'recommended' status than v7.0.25 ever was. There's been very little client development work since v7.0.28 was tagged - they're mainly doing server, API, and git work.
OK.  the API is the one that mostly concerns my end for the apps themselves, since I've never been able to use it in unmodified form, even for older CPU apps.  Probably I'll assess the newer client once I have the chance to examine the requirements for a v6<->v7 data pump.
Title: Re: dont_use_dcf active at SETI main
Post by: Richard Haselgrove on June 13, 2012, 01:40:36 AM
Fair enough. Your call whether you follow the 'workround' or the 'fix at source' route.
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 13, 2012, 01:47:28 AM
Quote from: Richard Haselgrove on June 13, 2012, 01:40:36 AM
Fair enough. Your call whether you follow the 'workround' or the 'fix at source' route.
For the time being 'workaround' as has been, since there are loose threads still to yank out with seti apps.  Once those are completely unravelled though, gutting Boinc sources for the root design issues becomes more viable.
Title: Re: dont_use_dcf active at SETI main
Post by: Josef W. Segur on June 13, 2012, 03:03:36 AM
Quote from: Jason G on June 12, 2012, 11:22:01 PM
Quote from: Barry on June 12, 2012, 11:12:11 PM... To be fair since APR works now again, DCF isn't really needed. ...

I better point out that this assertion isn't strictly true, and is one I've heard mentioned by Joe Segur, and I think David in changesets.

APR controls long term general estimates, but provides no short term facility for tracking machine hardware or performance changes, therefore relying on APR alone will result in headaches for those up (or down)-grading.
...
Jason

It "works" in the sense that a carpenter's clawhammer can be used to drive tacks or railroad spikes, though not as efficiently as the common nails it's designed for.

For the ~50% of hosts attached to S@H which do 1 task a day or less, APR is truly a long-term average and the faster adaptation of DCF was probably better. For a top 500 host which does over 200 tasks a day with an application, APR isn't what I'd call long-term. If there weren't delays caused by cache and whatever fraction of reported tasks have to wait for a wingmate's return, APR would adapt to changes within 1.5 days for such hosts. For hosts running stock, cache doesn't have any effect since the feedback is in the form of <flops> adjustment.

There are certainly conditions where it would be better to reset APR and let it do the early quick building of a new average, for instance if a high end GPU dies and an older one is inserted as a temporary replacement. It's awkward that the only way to accomplish that now is to force a new hostID.
                                                                              Joe
Title: Re: dont_use_dcf active at SETI main
Post by: Richard Haselgrove on June 13, 2012, 04:09:17 AM
An 'at source' fix is under way - dont_use_dcf is being removed from SETI as we speak.

Don't feel under any pressure to upgrade from v7.0.25 just because of this - although v7.0.28 is undeniably better, and "should have already been able to be promoted to a public release before now" (Rom Walton)  ;) ::) ;D
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 13, 2012, 04:38:08 PM
Phew! that 7.0.25 thing was looking like a forum thread nightmare, glad that's averted  :)

With DCFs, I guess I'm spoiled by having more or less real-time tracking during machine usage on a per app basis, which 'feels' 'more right' on the faster GPUs.  Someday, after the apps are deemed tidy enough, I will end up transferring that operation to local <flops> adjustment, with a few other extensions (toward mixed devices)... back to v7 multibeam for now  :)

Jason
Title: Re: dont_use_dcf active at SETI main
Post by: arkayn on June 14, 2012, 01:45:43 AM
Both of my machines are out of high-priority mode, but they have a dcf of 0.01.
Title: Re: dont_use_dcf active at SETI main
Post by: William on June 14, 2012, 01:52:06 AM
Quote from: arkayn on June 14, 2012, 01:45:43 AM
Both of my machines are out of high-priority mode, but they have a dcf of 0.01.

?! the 6.12.34 shouldn't have been affected by the dont_use_dcf? and the other is running 7.0.27 (why not .28?)
Title: Re: dont_use_dcf active at SETI main
Post by: Claggy on June 14, 2012, 02:22:56 AM
Quote from: Barry on June 14, 2012, 01:52:06 AM
Quote from: arkayn on June 14, 2012, 01:45:43 AM
Both of my machines are out of high-priority mode, but they have a dcf of 0.01.

?! the 6.12.34 shouldn't have been affected by the dont_use_dcf? and the other is running 7.0.27 (why not .28?)
Or even 6.12.43, that host only has ~1600 Seti tasks, does it also have tasks from other projects with short deadlines?

and/or a large 'Maintain enough tasks to keep busy for at least' value in it's preferences?

if those two hosts are sharing the same venue, you might want to split them so they use preferences suited to Boinc 6 or Boinc 7,

Claggy
Title: Re: dont_use_dcf active at SETI main
Post by: arkayn on June 14, 2012, 02:28:22 AM
Quote from: Barry on June 14, 2012, 01:52:06 AM
Quote from: arkayn on June 14, 2012, 01:45:43 AM
Both of my machines are out of high-priority mode, but they have a dcf of 0.01.

?! the 6.12.34 shouldn't have been affected by the dont_use_dcf? and the other is running 7.0.27 (why not .28?)

It was not the dont_use_dcf that caused it, but the removal of the <flops> from my app_info.

I have just not gotten around to updating to .28 yet, and the other machines are running 6.12.43.

Quote from: Claggy on June 14, 2012, 02:22:56 AM

Or even 6.12.43, that host only has ~1600 Seti tasks, does it also have tasks from other projects with short deadlines?

Claggy

I have NNT set for a little while to let my systems stabilize.
Title: Re: dont_use_dcf active at SETI main
Post by: William on June 14, 2012, 03:35:53 AM
Quote from: arkayn on June 14, 2012, 02:28:22 AM
It was not the dont_use_dcf that caused it, but the removal of the <flops> from my app_info.

I have just not gotten around to updating to .28 yet, and the other machines are running 6.12.43.

you have set nnt? as soon as the cache is empty, edit dcf to 1 before you let new tasks in.
Title: Re: dont_use_dcf active at SETI main
Post by: arkayn on June 14, 2012, 04:04:32 AM
Quote from: Barry on June 14, 2012, 03:35:53 AM
Quote from: arkayn on June 14, 2012, 02:28:22 AM
It was not the dont_use_dcf that caused it, but the removal of the <flops> from my app_info.

I have just not gotten around to updating to .28 yet, and the other machines are running 6.12.43.

you have set nnt? as soon as the cache is empty, edit dcf to 1 before you let new tasks in.

Yes it is on NNT.

I also just suspended my 9 AP GPU tasks as they would most likely end up with a -177 error with the current dcf.
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 16, 2012, 02:01:09 AM
Is this righting itself with Claggy's & Barry's recommendations etc ? or still lots of cache to burn through ?

Jason
Title: Re: dont_use_dcf active at SETI main
Post by: arkayn on June 16, 2012, 04:38:16 AM
The 460 is still at 356 tasks and the 560 has 1134.

So there is a ways to go yet.

Both machines still have a 0.01 dcf.
Title: Re: dont_use_dcf active at SETI main
Post by: William on June 21, 2012, 12:04:29 AM
Is anybody else running a V7 boinc on SETI beta under anonymous platform?

I'm seeing a bug - it's not picking up APR after the 10th valid here.
Can anybody confirm or dispute (?) this?
I'm doing an initial bugrep to David,  but I'd like to know if it's just me...
I doubt it, but you never know, if the next task that should have been APR estimated, was just send too early after the 10th validation came through.
Title: Re: dont_use_dcf active at SETI main
Post by: Richard Haselgrove on June 21, 2012, 12:38:44 AM
Quote from: Barry on June 21, 2012, 12:04:29 AM
Is anybody else running a V7 boinc on SETI beta under anonymous platform?

Not at the moment, but it would only take a few moments to set one running - when it would reach the 10th validation would be subject to the availability of wingmates.
Title: Re: dont_use_dcf active at SETI main
Post by: William on June 21, 2012, 01:25:38 AM
Quote from: Richard Haselgrove on June 21, 2012, 12:38:44 AM
Quote from: Barry on June 21, 2012, 12:04:29 AM
Is anybody else running a V7 boinc on SETI beta under anonymous platform?

Not at the moment, but it would only take a few moments to set one running - when it would reach the 10th validation would be subject to the availability of wingmates.

that would be helpful thanks. Maybe it just needs 11 validations and not 10.
Validating on beta is always extremely slow.
I'm doing the report - if it's just a case of >10 insted of the >=10 as we have been thinking David should be telling me...
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 21, 2012, 02:17:20 AM
Had a quick look at the server trunk code (at least up until move to GIT)

You'll need 10 results complete for the app, AND 10 consecutive valids as well, before it'll kick it credit  / estimate scaling.

There are some 'interesting' kludges in the same scheduler file (credit.cpp) including one that sticks out as a bit weird... namely use of a random number generator to determine if a host is 'trusted' under certain cases of single replication (which we don't use at this stage) ... might look at this file a bit closer when I get time.
Title: Re: dont_use_dcf active at SETI main
Post by: William on June 21, 2012, 02:34:15 AM
I have ten valid and 13 consequitive valid :P

x41x - that tells you that 'valid' actually only counts the validations that have gone into APR (i.e. without the -9 in this case)
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 21, 2012, 02:41:10 AM
Correct, it only stores the calculated scaling if it all passes sanity checks & isn;t regarded as a runtime outlier:
Quoteif (!r.runtime_outlier && is_pfc_sane(x, wu, app)) {
            avp->pfc_samples.push_back(x);
        }

[Edit:] if looking at it with a view to optimisation, you'd probably not bother doing all the scaling factor calculation if you already know it's a runtime outlier, but hey, at least it's avoiding some unnecessary sanity checks for overflows with this.
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 21, 2012, 02:55:18 AM
Walked through again,
  yeah, you'll need 10 pre-existing, and 10 or more consecutive valid, (that went right through to APR),

So the scaling should start with the 11th consecutive valid that isn't a runtime outlier.

[Edit:]  Hmmm, I'll have to look at that runtime outlier code too, when I can.... my modded client 'let's go' of its own breed of suspected runtime outlier thing after about 5 consecutive 90% overestimates... If the server code doesn't let go there could be some stuck hosts that never scale, especially on major hardware + app upgrades... will have to walk through that logic to make sure it's got some sort of escape hatch as well (which can change the logic as to when scaling kicks in, or inhibit it completely .... i.e. won't necessarily start scaling on the 11th consecutive valid).
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 21, 2012, 04:02:19 AM
having just looked through the validator code,  the only time a runtime outlier is marked, is when "result_overflow" appears in the stderr text ... so watch out for those valid overflow results with truncated or missing stderr ... they'll scale APR.
Title: Re: dont_use_dcf active at SETI main
Post by: Josef W. Segur on June 22, 2012, 10:51:25 AM
Quote from: Jason G on June 21, 2012, 04:02:19 AM
having just looked through the validator code,  the only time a runtime outlier is marked, is when "result_overflow" appears in the stderr text ... so watch out for those valid overflow results with truncated or missing stderr ... they'll scale APR.

When the focus shifts to S@H v7 we could suggest Eric implement a check on the result file rather than totally relying on stderr. It wouldn't be particularly difficult, every task which exits at completion puts a best_gaussian in the result (though it may just have initialization values), but a result_overflow does not report any best_xxxxx signals. The best_autocorr, best_pulse or best_spike are not guaranteed to be present for a normal exit, though if not it would be strong evidence of a corrupted WU, and best_triplet is only there when there's a reportable triplet.

Another idea would be to get David to think again about adding code to make sure the stderr is properly handled. Last January one of the Milkyway@home devs submitted a patch, but was unable to convince David it was needed.
                                                                                 Joe
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on June 22, 2012, 06:39:37 PM
What I'll do at some point (pretty busy at this time with work) is ask for permission from Eric, then ask Claggy and Arkayn, or someone skilled like that, to help test some possible vulnerabilities in that mechanism.  That might help prompt a better look down the road as V7 & GBT get closer to coming online.

Jason
Title: Re: dont_use_dcf active at SETI main
Post by: William on August 14, 2012, 07:54:54 PM
It appears that the latest attempts the fix the scheduler on the VLAR to NV issue have resulted in 'dont_use_dcf' getting reactivated.
Title: Re: dont_use_dcf active at SETI main
Post by: Jason G on August 14, 2012, 09:06:05 PM
Quote from: Barry on August 14, 2012, 07:54:54 PM
It appears that the latest attempts the fix the scheduler on the VLAR to NV issue have resulted in 'dont_use_dcf' getting reactivated.

Which will be good for long term crunchers indeed... but my intuition based assessment suggests that 90% of users attach, experience problems, then abandon the work... so when I can I will reassess the move of modified Boinc from aDCFs to adaptive flops as Joe suggested a while back, embedding into newer builds, and a form that could be suitable for stock Boinc... well see, needs another look after x41z is hammered flat, at least in third party form.  Probably won't be waiting for Beta project anymore, and move on to quick XAK & XAP test builds on the weekends.