Re: 2008 SBS no longer boots
- From: Freaky <wontsay@xxxxxxxxxx>
- Date: Wed, 08 Apr 2009 09:28:34 +0200
Cliff Galiher wrote:
Short answer is if it rebooted with updates, but then failed over the
weekend, then it is *VERY very very very* (did I stress that enough?)
likely hardware.
Agreed, but some things just don't add up to it.
The fact that another server had the same problem? Coincidence.
Servers don't just reboot for no reason. Although I would've bet on a
driver issue initially, the fact that it seems to happen randomly now
has me leaning towards a physical failure. Some random thoughts, in no
particular order.
Hardware tests don't turn up anything, SBS 2008 CD boots fine, running
chkdsk from it turns up nothing, all disks/cd drives etc are recognized
by the boot CD. This issue is gone after reinstalling, which is very
weird with a hardware problem that consistently doesn't get through boot
with the current installation. After restoring system state with the
other customer, even a 2 week old one, the problem immediately returned.
So that server was reinstalled with a new domain etc. and is running for
about 3 months now. According to my collegues it had updates installed,
but wasn't rebooted for quite a while, so I can't really say it's not
the updates, can't really say it is either.
1) Is the server on a good battery backup? Dirty power coupled with a
poorly regulating power supply == random data corruption. A good UPS
can provide clean power to the server and resolve the issue.
Definitely, APC UPS'es. Normally the smart-ups.
2) RAID controllers should have their own on-card battery. If power is
failing in the machine for any reason (overheating CPU, power supply,
etc) and the RAID is not able to flush its cache...game over. Corrupt
array and no amount of RAIDness will protect you.
The smartarray E200i's come with battery backed cache. They have 128M
cache. Any corruption should have shown up in chkdsk. Furthermore if
it's a driver that's damaged it would be very odd because drivers
usually don't have anything written to them and so they usually don't
become corrupt during power outages etc. These usually are the open
files like registries, AD database etc.
3) Does the server have thermal monitoring? You should know if a CPU,
chipset, or other component is running hot well before the thing
actually reboots. Sure, by default these errors write to the event log,
but since the OS gets corrupted, you can't *view* the event log. Your
next install should have some sort of alerting system (email to another
server, pager, cell phone) set up so you at least know what failures
occurred before the data corruption.
Yea the HP's servers will shutdown at certain temperatures. If the
server is cold there is the same issue though, once it occurs on an
installation it remains unbootable. If it were thermal then booting from
CD etc. should have issues too.
4) Ultimately the data on the HD is getting corrupted, hence the boot
issue. Could be a failing HD. Could be a bad RAID controller. Or could
be both of these are fine, but another failure higher up (see points 1
through 3) are causing the problem. Regardless, focus on fixing the
hardware and the OS issue will resolve itself. If you focus on fixing
the OS through a registry fix, driver update, or rollback, then the
problem will only continue to re-occur. In other words, I agree with
you that your first obligation (a) is correct.
You have a very valid point here, and normally I'd agree with you in an
instance, but this just doesn't seem (purely) hardware based. It looks
more like some driver remembering some chip registers or something like
that causing the issue. Otherwise reinstalling the entire server and
having no issues until systemstate is restored simply doesn't make sense
either.
5) Obligation (b) is unrealistic. You pointed out that you know how
things should be but sometimes customers don't have the money or you got
the customer in an already crappy situation and you have to make due.
So you justify not making local backups. But then you want a 5-minute
fix for what is obviously a hardware problem. Let me be clear; IT
DOESN'T EXIST. Sometimes there is no 5 minute fix PERIOD. Just like the
customer sometimes tells you that they won't spend money to do something
properly, sometimes you have to tell the customer that there is no quick
fix. That is the reality of the situation. Period. If everything
could be fixed in 5 minutes with a registry fix then nobody would make
backups. So you can't ask us to accept that you are dealing with a
customer who won't do the right thing, but then make the same demand of
us and expect us to do right by it. It will never work.
I don't really justify not having local backups. I advise all my
customers to use tape backups with specialized software for them. If
they want online I consider it extra (I don't like it at all).
Unfortunately the customer often disagrees or doesn't want to afford it.
For most it's very simple math. Online backup is very cheap, on-site
backup with roulation is much more expensive than me having to
troubleshoot say 4 hours more in case it does happen. These are usually
the really small customers, the larger ones have a larger impact on
employees not being able to do anything and tend to buy these things.
And to be clear, I agree with one of your sentiments. I am not trying
to be short, rude, or argumentative here. It is not my intention to
offend or be harsh with the observations above, but sometimes the facts
do not line up with the goals, and the goals need to be adjusted. This
is clearly one of those times.
Not quite about to give up yet. Troubleshooting much longer isn't really
financially viable either. It is what I like most though :). We however
do have the risk of it returning, especially if it does turn out to be
software. It can take months for the issue to reappear and we're about
to setup the same ML350G5/2008 SBS at some larger customers. We really
don't want to run into the issue there (or anywhere else for that
matter, unfortunately fact is it's better to loose small customer than
large customer, albeit it's better to loose neither :D). I'd atleast
need some certainty it won't return, is software based or we should look
for other hardware. We try to have the same hardware at customers as
much as possible, not only because it eases work, but it makes it
affordable to have some spare parts for it at our office etc. too.
Looking up the error code it seems to be INACCESSIBLE_BOOT_DEVICE,
however, this doesn't show in the blue screen. It seems to be often
related to the SATA mode (mainly vista posts about this). This is also
what the microsoft engineer told us which specifically said SATA had to
be in AHCI mode. Which I don't think it is.
If I unhook the CD player, which is on the only SATA port on the
mainboard (which is also oddly called SATA Optical Port by HP
description) the boot progresses a bit further than with it connected.
It however does crash with the same error.
Flushed/resetted BIOS settings, tried an older BIOS version, tried
disabling IDE/SATA but this isn't possible (verified by HP). Looked up
if I could find the SATA driver on the other customers server. However,
all I see is 2 IDE controllers. According to device management the CD
player should be connected to one of the IDE controllers. I have the
exact drivernames for these at home (at another customer now) but I
recall atapi.sys some other one with ata in it, pciidx.sys and another
one starting with pci. Renamed these drivers to .sys.org. Server will
not boot then and says it's missing the driver atapi.sys. So disabled
them in registry by setting their services' start registry value to 0x4.
Same BSOD. Odd thing is, if I disconnect the DVD it still boots a bit
further, which I find odd as with removing the drivers I'd assume it's
no longer talking with the DVD at all. So I seem to be missing, atleast
a part, of the SATA driver. Perhaps this is because it's probably in IDE
mode. It's also the last thing I'd want to know before giving up if it
isn't the issue.
Another thing that doesn't add up to corruption btw, is that
virtualizing the server from a bootable CD works fine. The resulting
guest has no issues whatsoever, albeit it obviously does find new
hardware amongst which there are no SATA controllers.
Probably will give up soon, not worth the time, albeit that will have to
be seen if the error returns, at whatever customer. I'm also not looking
for a 5 minute fix now, I'm looking for the answer if this is a software
based issue so I can fix it in 5, 10 maybe 30 minutes *next* time. If it
even returns if we know what's causing it. Still thinking it's a driver
problem, the question is just which one and why after so long (and then
very consistly thus, we do reinstall all the same (latest on HP
site/windowsupdate) after reinstallation) so one would expect the issue
to be back immediately but it doesn't (but it thus does with restoring
systemstate which should have all the same drivers).
Thanks a bunch :)
.
- Follow-Ups:
- Re: 2008 SBS no longer boots
- From: Cliff Galiher
- Re: 2008 SBS no longer boots
- References:
- 2008 SBS no longer boots
- From: Freaky
- Re: 2008 SBS no longer boots
- From: Cliff Galiher
- Re: 2008 SBS no longer boots
- From: Freaky
- Re: 2008 SBS no longer boots
- From: Cliff Galiher
- Re: 2008 SBS no longer boots
- From: Freaky
- Re: 2008 SBS no longer boots
- From: Cliff Galiher
- 2008 SBS no longer boots
- Prev by Date: Re: Create an email account to forward - sbs2008
- Next by Date: Re: 2008 SBS no longer boots
- Previous by thread: Re: 2008 SBS no longer boots
- Next by thread: Re: 2008 SBS no longer boots
- Index(es):
Relevant Pages
|
Loading