Re: Replace motherboard ?



On Fri, 15 Sep 2006 01:39:50 +0100, "RJK"
"cquirke (MVP Windows shell/user)"
On Mon, 11 Sep 2006 20:45:57 +0100, "RJK"

On the odd occasion one assembles a system box with serious TLC, hardware
and bios are correctly configured and usually Windows fly's in without a
hitch. ...perhaps drivers go in beautifully - then several apps. ...and
then BANG something is unhappy with something else, one uninstalls that
last app. or driver and drops back to the restore point made before whatever
it was that started playing up, that usually fixes it but, again - on the odd
occasion - things get worse !

...depends what the cause is ... more besides apps and drivers.

...agreed :-)

...I should have said "...eventually proven robust hardware."

I always re-test hardware every time a PC comes back with nebulous
problems, in case it's a rare intermittent fault that I've missed so
far, or it's a subtle problem that's progressed enough to detect.

I supplied a laptop one, that was always flaky but always passed all
tests, including a parallel OS install to exclude missed issues within
the software installation. Always passed all tests, until a few
months down the line, when the same old tests lit up bad RAM. Always
test 5 on the MemTest of the time; all else OK.

RAM swap no joy, so they swapped for a more upmarket brand of RAM that
might be more compatible (it's a laptop, so one can't avoid the risk
of proprietary BS quite as easily). No joy, so I suggested the
possibility of bad RAM within the CPU cache, and we swapped that.
Still no joy, so the suppliers swapped the mobo. Fine ever since.

I later saw a completely unrelated desktop PC do the same thing;
always test 5, prompt failure. Also went away when mobo was replaced.
Common to both; SiS 650 chipset. I'm (even) less keen on SiS, now.

...Malware ? ...personally, I never knowingly install it ! ...and there's
also lots of very poorly written application software that should be
labelled "malware" in my view !!

Agreed - swave vendor behavior is bounded purely by the rules of the
market in which they operate (so set rules wisely, and police with a
big stick). The business case for sware was proven back in the DOS
days, and the business case for malware was proven quite a few years
later. Programmers in both fields now have plenty of opportunities
for contract, hire or employment; few do either just for fun.

Every inexplicable PC problem requires the exclusion of malware, IMO.
That's why the "just wipe and re-install" bozos just don't get; it's
not just the "known infected" systems that require malware exclusion,
so there *has* to be a way to attain this.

To paraphrase an ad: "You can't take every cough to the abattoir"

...in addition, a year or so ago I did discover a Trojan that tried to
install itself along with Adobe Photoshop 7.0.1.

Hmm... warez? The tradition of bundling trojans with warez goes all
the way back to one of the very first viruses ever known.

...if memory serves, it was attached to some crappy 3rd party program on the
Adobe cd that tried to monitor installation of Adobe and pre-detect dll
incompatibilities (DLL hell)

Yukk

That's the OEM approach, where they're either dealing with virgin
systems, or they simply don't care about collateral damage.
When you work in a client-centric manner, you'd never do this.
Instead, you'd verify each layer of abstraction before standing on it
to reach the next, like this... ***

That's a humdinger of a paragraph ! ..and I may study it some more !
...largely agreed ...though I would suspect that MS would be concerned about
any problems with their "hidden" OS algorithms which could reveal their very
presence:-

Hmm... I'm not too sure whether there's much fire under that smoke.

My view is that after a quick or full format, NOTHING should interfere with
OS/GUI installation, and nothing should be going on in addition to OS/GUI
installation.

This used to be what I'd test in the form of a parallel install.
Those circumstances seem rarer in the XP age, possibly because it's a
more solid OS (after all, it's not obliged to compromize stability for
compatibility with DOS or Win3.yuk apps).

I didn't often have to do it with Win9x either, but often enough to
semi-automate the process of setting up the parallel installs,
switching between them, and cross-updating or cross-comparing registry
and file content betwen them.

I'd either use the fresh install as an "organ donor" for existing
install (e.g. Vmm32.vxd) or cross-build the existing install over the
new one, manually, piece by piece.

The usual context for this was stability issues following a
poorly-logged cleanup of certain intra-file code infectors
(PrettyPark, Magistr, some others) that can leave the code files
clean, but damaged. I wrote a less-drastic tool to extract and
compare files according to a list entered via CLI parameters, but that
depende on the list of "fixed" files being known.

(Someone will blame this on a spec of dust off the cd ! ).

Generic way to test that...
- copy \i386 to two locations on HD
- FC the file sets against each other
- FC against the CD
- repeat and cross-test using different drive
- repeat and cross-test using different disk of same content
- repeat and cross-test using different PC (on-the-fly infection?)

...haven't used FC.exe or COMP.exe for years, and when I did everything
compared was always the same ...differences with the /b switch / binary
compares got me thinking years ago for a few minutes, til I twigged there
was a difference with file date/time stamps !

FC can be the bottom line when deep tshooting "real bustard" problems.
I pinned down errors between loose HDs this way, by FC'ing 500M of
transferred data and detecting a single 32-bit sequence that was
wrong. Problem went away as soon as both HDs were in shell to chassis
contact; came back when this wsas broken (A/B/A repro).

Makes one wonder about all-plastic hard drive trays.

...having said that, I did pin down a cd-reader drive, years ago, that was
suffering read errors way in excess of the maunfacturers "several in a
million years," or whatever it is they specifiy ! :-)

Yep. This may be more of an issue with DVDs, in that AFAIK they lack
some of the robust error-detection that's built into CD spec.

Zero fill the hard-disk for half a day ! and all is again lovely and
Windows
will fly in beautifully !
:-)

Erm... I'd rather identify and replace bad hardware or detect and kill
the malware, tho I can see how caressing a dying HD may cause the HD's
firmware to relocate failing sectors so the HD "works" again ;-)

How did "bad hardware" and a "dying hd" get in here ?

Everything conspires to hide the dying of hard drives.


At the OS level, Win9x became more and more aggressive about "fixing"
errors automatically, when Scandisk ran automatically after a bad exit
or detected HD failures. Win98 defaulted to fixing with no prompt,
and WinME did away with the ability to fine-control this.

XP doesn't even offer interactive control - the brain-dead DOS-era
ChkDsk has only "fix nothing" or "fix everything without asking"
modes, and AutoChk forces you to accept automatic fixing of
everything. In Vista, the whole process is hidden completely.

Another change came with NTFS, which now checks for bad sectors and,
if found, they are silently "fixed" on the fly. Where does all these
fixes get logged? Deep in the jungle of Event Viewer, under something
seemingly unrelated like "Winlogon" or something. Really SEPTIC.


At the HD level, HD vendors started adding logic to the drive's
firmware to remap failing sectors on the fly. If the data couldn't be
recovered, well... <shrug> ...just write 512 bytes of whatever was in
the buffer. So folks started noticing silent disappearance of data,
and I saw this myself when copying files from sick to good HD.

Normally, I check the total byte size and file count on source and
destination after a recovery transfer. If the destination has less
material, then there's something that got missed. But in this case,
it was the source drive that had less and less material in the same
selected subtree, and I figured it was lost contents from failing
sectors within the directory's clusters. Subsequent DiskEdit found
full-sector slabs of zeros in the middle of the dir concerned. No
error messages or alerts whatsoever.

SMART gives a window into this process, and may have been in response
to OEMs kicking the HD vendors' asses. SMART has been around for ages
now, but Windows STILL has no UI to it whatsoever, and many BIOSs
default to switching it off.

HD vendor's SMART tools generally give you a simple "OK" or "Fail".
Deeper in, you find there are raw event counters that can clock
thousands of events before the data counter decriments by a single
click. Between 100 and 255 such clicks may be accepted as "OK" before
the threshold is reached and the HD finally concedes all is not well.


Then there's the question of warranty fulfilment. HD vendors swill in
unison, like a cartel of sharks; when Seagate dropped warranty from
the industry-standard 3 years to 1 year, most vendors followed suit
within the month. Now warranties are back to 3 years, or in "butter
wouldn't melt in my mouse" Seagate, 5 years.

Seagate were the first to replace ill HDs with "refurbished" drives
rather than new ones, way back when storage was measured in Megs.
That's common practice, tho the terminology varies from
"remanufactured" to "reconditioned" to "recertified".

I don't see dozens of engineers putting new heads and platters into
old HD casings in some huge clean room, do you? No; I suspect all
they do is wind back the SMART counters, test it a bit, and ship it
out again. The replacement HD you get may well be the sick HD I had
replaced a few months ago, with fresh lipstick on the corpse.


That's how "bad hardware" and a "dying hd" can get in to just about
any "why is the PC not working properly?" scenario.

...at outset perhaps I should have said, "...one assembles a NEW
system box with TLC ...one that proves over years to be hardware
robust and problem free.

I have had DoA, first-week failures, as well as the expected in-life
failures. The way I build PCs takes about 3-4 days clock time and
about 1 full day interactive, so most DoA stuff gets caught before
shipping, but not all. I don't bother to start the build process
until the rig has passed 12 hours in MemTest plus HD Tune surface scan
and SMART detail verification. I do it that way because I've found it
cheaper for me, in terms of niot wasting time building in soft mud
:-)

Ideally, manufacturers want a bath-shaped failure-over-time curve;
failed manufacture failing early during pre-shipping testing, then a
low rate of failures, then stuff fails when it's "too old" anyway.

This is generally how electronic parts fail (except when bad caps
failed way earlier than planned). But mechanical parts (as well as
monitors, which have some pseudo-mechanica charactaristics such as
hot/cold cycles and the pull of magnetic field changes) usually have a
tick-shaped curver; initial early failures, then a baseline rate of
failure that starts low but steadily increases over time and with use,
possibly accelerated by heat (ideally, keep HD < 40C).

Handling and electrical damage can poke spikes through those smooth
failure curves, and I cringe when I see despatch counters handling RAM
and other electronic parts with bare hands, not even trying to avoid
touching the shiny bits. It's like watching a fast-food operator
picking his nose and then stirring your coffee with the same finger!



------------ ----- --- -- - - - -
Drugs are usually safe. Inject? (Y/n)
------------ ----- --- -- - - - -
.


Loading