Re: 2008 SBS no longer boots



Hi Cliff,

thanks again for the re'. Think I'll go with Les' advice to take it to
the other newsgroup.

Any thermal issues and a lot of other hardware problems should show up
in HP's system log (this is hardware based, not event viewer). There are
no errors there whatsoever.

If it was a voltage problem, that would explain the reboot etc, but not
the not the unable to boot. Especially since it's exactly the same error
in very early boot stages for the third time.

Reason I'm hunting for SATA first amongst other things is MS
specifically said this error had to do with SATA mode and them seeing it
before on this hardware. Then again, they also insisted on using
listsvc, fdisk /mbt and several other tools that simply don't exist any
more on 2008.

Happen to know a tool to determine driver load order by just having the
registry/files (obviously can't run it in windows)? Would be really
curious what would come after crcdisk.sys. Also don't get the safemode
output... if it only prints drivers that load successful (thus after
loading) it's more useless than useful. I'd rather have it print what it
is *going* to load that way it would be much easier to determine what
makes it crash.

Sysinternals has a tool called loadorder, but it's only for 32b and 2008
isn't on the list either.

Regards

Cliff Galiher wrote:
Back to inline (just easier to compose my thoughts that way)

-Cliff


"Freaky" <wontsay@xxxxxxxxxx> wrote in message
news:eHsWhuBuJHA.5836@xxxxxxxxxxxxxxxxxxxxxxx
Cliff Galiher wrote:
Short answer is if it rebooted with updates, but then failed over the
weekend, then it is *VERY very very very* (did I stress that enough?)
likely hardware.

Agreed, but some things just don't add up to it.

The fact that another server had the same problem? Coincidence.
Servers don't just reboot for no reason. Although I would've bet on a
driver issue initially, the fact that it seems to happen randomly now
has me leaning towards a physical failure. Some random thoughts, in no
particular order.

Hardware tests don't turn up anything, SBS 2008 CD boots fine, running
chkdsk from it turns up nothing, all disks/cd drives etc are recognized
by the boot CD. This issue is gone after reinstalling, which is very
weird with a hardware problem that consistently doesn't get through boot
with the current installation. After restoring system state with the
other customer, even a 2 week old one, the problem immediately returned.
So that server was reinstalled with a new domain etc. and is running for
about 3 months now. According to my collegues it had updates installed,
but wasn't rebooted for quite a while, so I can't really say it's not
the updates, can't really say it is either.

Thermal and voltage related problems usually won't show up with a
chkdsk. An old-school practice that has fallen out of favor is burn-in
testing. There was a time when a server (even one that had been "burned
in by the manufacturer) was fired up with a specialized OS (I stil use
knoppix) and had a battery of tests run against each component to
purposefully stress it. If thermal or electrical problems exist, those
would usually cause enough problems to force a failure. Although it has
fallen out of practice these days because of the low failure rates
modern manufacturers release, I wouldn't count the process out. Low
failure is not 0% failure yet....

That is what I'd recommend now...



1) Is the server on a good battery backup? Dirty power coupled with a
poorly regulating power supply == random data corruption. A good UPS
can provide clean power to the server and resolve the issue.

Definitely, APC UPS'es. Normally the smart-ups.

Good enough. Move on... :)

2) RAID controllers should have their own on-card battery. If power is
failing in the machine for any reason (overheating CPU, power supply,
etc) and the RAID is not able to flush its cache...game over. Corrupt
array and no amount of RAIDness will protect you.

The smartarray E200i's come with battery backed cache. They have 128M
cache. Any corruption should have shown up in chkdsk. Furthermore if
it's a driver that's damaged it would be very odd because drivers
usually don't have anything written to them and so they usually don't
become corrupt during power outages etc. These usually are the open
files like registries, AD database etc.

Again...good enough. Move on...


3) Does the server have thermal monitoring? You should know if a CPU,
chipset, or other component is running hot well before the thing
actually reboots. Sure, by default these errors write to the event log,
but since the OS gets corrupted, you can't *view* the event log. Your
next install should have some sort of alerting system (email to another
server, pager, cell phone) set up so you at least know what failures
occurred before the data corruption.

Yea the HP's servers will shutdown at certain temperatures. If the
server is cold there is the same issue though, once it occurs on an
installation it remains unbootable. If it were thermal then booting from
CD etc. should have issues too.

Without off-server monitoring, you can't gaurantee what is happening.
How do you *know* the server was cold? All you've reported here is that
the server starts off okay and fails.....could be thermal. Without a
cooling solution, processors can go from room temperatures to
metal-fracturing temperatures in less than one second. If heat-sink
compound was not applied properly (happens even in the most stringent
quality-controlled environments, and HP is not stringently controlled)
or another flaw exists, it is not unreasonable for a part to fail rather
quickly while not actually observed.

Don't *guess* that this isn't the same issue. *KNOW* it. Monitor it.
Otherwise you are pursuing ghosts...


4) Ultimately the data on the HD is getting corrupted, hence the boot
issue. Could be a failing HD. Could be a bad RAID controller. Or could
be both of these are fine, but another failure higher up (see points 1
through 3) are causing the problem. Regardless, focus on fixing the
hardware and the OS issue will resolve itself. If you focus on fixing
the OS through a registry fix, driver update, or rollback, then the
problem will only continue to re-occur. In other words, I agree with
you that your first obligation (a) is correct.

You have a very valid point here, and normally I'd agree with you in an
instance, but this just doesn't seem (purely) hardware based. It looks
more like some driver remembering some chip registers or something like
that causing the issue. Otherwise reinstalling the entire server and
having no issues until systemstate is restored simply doesn't make sense
either.

Except that you've already admitted that the system-state is not
retained long enough to verify that this is the case. If you could,
with certainty, restore the system-state from a time from *BEFORE* the
last system reboot then you could say this. But right now I see your
reboot policies and data retention policies exist in direct conflict.
Restoring a system-state with a corrupt registry provides no value if
the corruption was caused by one of the failures mentioned above...

5) Obligation (b) is unrealistic. You pointed out that you know how
things should be but sometimes customers don't have the money or you got
the customer in an already crappy situation and you have to make due.
So you justify not making local backups. But then you want a 5-minute
fix for what is obviously a hardware problem. Let me be clear; IT
DOESN'T EXIST. Sometimes there is no 5 minute fix PERIOD. Just like the
customer sometimes tells you that they won't spend money to do something
properly, sometimes you have to tell the customer that there is no quick
fix. That is the reality of the situation. Period. If everything
could be fixed in 5 minutes with a registry fix then nobody would make
backups. So you can't ask us to accept that you are dealing with a
customer who won't do the right thing, but then make the same demand of
us and expect us to do right by it. It will never work.

I don't really justify not having local backups. I advise all my
customers to use tape backups with specialized software for them. If
they want online I consider it extra (I don't like it at all).
Unfortunately the customer often disagrees or doesn't want to afford it.
For most it's very simple math. Online backup is very cheap, on-site
backup with roulation is much more expensive than me having to
troubleshoot say 4 hours more in case it does happen. These are usually
the really small customers, the larger ones have a larger impact on
employees not being able to do anything and tend to buy these things.

hard drives are cheap. USB hard drives, nominally more expensive. If a
terabyte hard drive is more expensive than 4 hrs of your labor then you
aren't charging enough. I know that doesn't help you...but it does mean
your customers are taking advantage of you. And they probably already
know it, which is why they are putting up with these outages when they
could've gone to another contractor by now. Your rates are *way* below
industry standard....


And to be clear, I agree with one of your sentiments. I am not trying
to be short, rude, or argumentative here. It is not my intention to
offend or be harsh with the observations above, but sometimes the facts
do not line up with the goals, and the goals need to be adjusted. This
is clearly one of those times.

Not quite about to give up yet. Troubleshooting much longer isn't really
financially viable either. It is what I like most though :).

I love troubleshooting too... It is why I let my junior techs handle
the mundane stuff. It is nice being the boss.....

We however
do have the risk of it returning, especially if it does turn out to be
software.

It isn't. I will put money down NOW. I'm not there. I don't know all
the facts. But SBS (as a product) is solid and I've installed
everything from BackOffice 4 to SBS 2008. *DRIVERS* are technically
software, but since they tie the hardware to the OS, they don't count.
Especially if you actually were sticking to WHQL drivers and not
installing after-market, which it sounds like you did avoid during the
latest install. It is a HW problem...through and through.

It can take months for the issue to reappear and we're about
to setup the same ML350G5/2008 SBS at some larger customers.

Perhaps I misinterpreted your posts, but you made it sound like it
reappears quickly. A problem spanning *months* is rarely SBS...but can
be tracked to anything from failing RAM to a third-party app. I
re-read your previous posts after reading this...and everything you've
indicated (until now) is that this develops rather quickly. Changing
the symptoms will obviously change the diagnosis.


We really
don't want to run into the issue there (or anywhere else for that
matter, unfortunately fact is it's better to loose small customer than
large customer, albeit it's better to loose neither :D).

Nobody likes to lose customers.

I'd atleast
need some certainty it won't return, is software based or we should look
for other hardware.

There is never certainty in this business. But if one specific
configuration gives you problems then it is time to give it away. SBS
2k8 went through a thorough beta and release candidate period. It is
deployed (by now) many places, and this is the first I've heard of any
consistently repeatable problem. Without more details, I lean towards
hardware. *MAYBE* it is a specific HP driver. Either way, if it is
repeatedly happening on more than one site, then it is time to change
things up.

We try to have the same hardware at customers as
much as possible, not only because it eases work, but it makes it
affordable to have some spare parts for it at our office etc. too.

Not an uncommon practice. But if you stick to name-brand vendors (HP,
Dell, etc) then keeping parts at your office becomes moot. They can
drop-ship parts faster than you can schedule a visit in most cases.
Having parts "on hand" is an advertising slogan that I've only heard
from white-box vendors that know they had a reputation they needed to
fix...


Looking up the error code it seems to be INACCESSIBLE_BOOT_DEVICE,
however, this doesn't show in the blue screen. It seems to be often
related to the SATA mode (mainly vista posts about this). This is also
what the microsoft engineer told us which specifically said SATA had to
be in AHCI mode. Which I don't think it is.

An inaccessible boot device is a failure during the kernel. Not
specific to Vista or SATA. It just happens to be that 90% of computer
sales are desktops, and 99% of those have sata drives....so if a kernel
failure happens, that is the reported error. Don't make assumptions.


If I unhook the CD player, which is on the only SATA port on the
mainboard (which is also oddly called SATA Optical Port by HP
description) the boot progresses a bit further than with it connected.
It however does crash with the same error.

Again, you are pursuing a SATA path based on pretty thin circumstantial
evidence. And the fact that it still crashes shows that it was wrong.
It is crashing at some point during a kernel-driver load...and since
windows boot logging only displays *successful* loads, viewing the load
process doesn't illuminate much. You have a driver doing something bad,
and that is caused because you have hardware doing something bad. Maybe
the driver is corrupt, or maybe the hardware is...I don't know. I'm not
there. I get paid to be there......you have to use the tools at YOUR
disposal.


Flushed/resetted BIOS settings, tried an older BIOS version, tried
disabling IDE/SATA but this isn't possible (verified by HP). Looked up
if I could find the SATA driver on the other customers server. However,
all I see is 2 IDE controllers. According to device management the CD
player should be connected to one of the IDE controllers. I have the
exact drivernames for these at home (at another customer now) but I
recall atapi.sys some other one with ata in it, pciidx.sys and another
one starting with pci. Renamed these drivers to .sys.org. Server will
not boot then and says it's missing the driver atapi.sys. So disabled
them in registry by setting their services' start registry value to 0x4.
Same BSOD. Odd thing is, if I disconnect the DVD it still boots a bit
further, which I find odd as with removing the drivers I'd assume it's
no longer talking with the DVD at all. So I seem to be missing, atleast
a part, of the SATA driver. Perhaps this is because it's probably in IDE
mode. It's also the last thing I'd want to know before giving up if it
isn't the issue.

Still chasing SATA...still wrong...


Another thing that doesn't add up to corruption btw, is that
virtualizing the server from a bootable CD works fine. The resulting
guest has no issues whatsoever, albeit it obviously does find new
hardware amongst which there are no SATA controllers.

I will, again, refer to my comment above, that you indicated that the
problem "may not turn up for months." Unless you've been virtualizing
SBS2k8 for longer than its release (I envision bad Back To The Future
references here) you can't say that virtualizing works fine. Either the
problems occur quickly or they don't. There is also the fact that,
ultimately, a virtual machine (even Hyper-V) is a file...just a really
big .vhd file) so a driver corruption that would occur on a *real*
machine would not happen on a .vhd because the .vhd file would never be
committed and the RAID array would correct. Thatis some pretty
low-level file-write talk to explain it...but if you want to go there,
I'd be happy to privately explain it.


Probably will give up soon, not worth the time, albeit that will have to
be seen if the error returns, at whatever customer. I'm also not looking
for a 5 minute fix now, I'm looking for the answer if this is a software
based issue so I can fix it in 5, 10 maybe 30 minutes *next* time.

Onsite backups. With full rotation. WORST CASE. Even better if you
find the bad hardware/driver and fix it (of course I know you already
want that...but still...worth stating...

If it
even returns if we know what's causing it. Still thinking it's a driver
problem, the question is just which one and why after so long (and then
very consistly thus, we do reinstall all the same (latest on HP
site/windowsupdate) after reinstallation) so one would expect the issue
to be back immediately but it doesn't (but it thus does with restoring
systemstate which should have all the same drivers).

A driver issue would show on reboot. If you are *SURE* that the server
is being rebooted after essential updates then I don't think it is a
driver.

As you said...it just doesn't add up...



Thanks a bunch :)

.



Relevant Pages

  • Re: 2008 SBS no longer boots
    ... driver issue initially, the fact that it seems to happen randomly now ... weird with a hardware problem that consistently doesn't get through boot ... other customer, even a 2 week old one, the problem immediately returned. ... There was a time when a server was fired up with a specialized OS and had a battery of tests run against each component to purposefully stress it. ...
    (microsoft.public.windows.server.sbs)
  • Re: 2008 SBS no longer boots
    ... Intel raid card for Server 2003. ... install with one raid driver, or with the other raid driver. ... Any thermal issues and a lot of other hardware problems should show up ... other customer, even a 2 week old one, the problem immediately returned. ...
    (microsoft.public.windows.server.sbs)
  • Re: 2008 SBS no longer boots
    ... He copied in the hpcisss2.sys driver from the other ... Intel raid card for Server 2003. ... the not the unable to boot. ... other customer, even a 2 week old one, the problem immediately returned. ...
    (microsoft.public.windows.server.sbs)
  • Re: 2008 SBS no longer boots
    ... Windows *does* write to a log, however, during a safe-mode boot what driver it is attempting to load. ... Any thermal issues and a lot of other hardware problems should show up ... other customer, even a 2 week old one, the problem immediately returned. ... There was a time when a server (even one that had been "burned ...
    (microsoft.public.windows.server.sbs)
  • Re: 2008 SBS no longer boots
    ... driver issue initially, the fact that it seems to happen randomly now ... Hardware tests don't turn up anything, SBS 2008 CD boots fine, running ... other customer, even a 2 week old one, the problem immediately returned. ... can provide clean power to the server and resolve the issue. ...
    (microsoft.public.windows.server.sbs)