Re: 2008 SBS no longer boots
- From: "Cliff Galiher" <cgaliher@xxxxxxxxx>
- Date: Wed, 8 Apr 2009 03:39:55 -0600
Back to inline (just easier to compose my thoughts that way)
-Cliff
"Freaky" <wontsay@xxxxxxxxxx> wrote in message news:eHsWhuBuJHA.5836@xxxxxxxxxxxxxxxxxxxxxxx
Cliff Galiher wrote:Short answer is if it rebooted with updates, but then failed over the
weekend, then it is *VERY very very very* (did I stress that enough?)
likely hardware.
Agreed, but some things just don't add up to it.
The fact that another server had the same problem? Coincidence.
Servers don't just reboot for no reason. Although I would've bet on a
driver issue initially, the fact that it seems to happen randomly now
has me leaning towards a physical failure. Some random thoughts, in no
particular order.
Hardware tests don't turn up anything, SBS 2008 CD boots fine, running
chkdsk from it turns up nothing, all disks/cd drives etc are recognized
by the boot CD. This issue is gone after reinstalling, which is very
weird with a hardware problem that consistently doesn't get through boot
with the current installation. After restoring system state with the
other customer, even a 2 week old one, the problem immediately returned.
So that server was reinstalled with a new domain etc. and is running for
about 3 months now. According to my collegues it had updates installed,
but wasn't rebooted for quite a while, so I can't really say it's not
the updates, can't really say it is either.
Thermal and voltage related problems usually won't show up with a chkdsk. An old-school practice that has fallen out of favor is burn-in testing. There was a time when a server (even one that had been "burned in by the manufacturer) was fired up with a specialized OS (I stil use knoppix) and had a battery of tests run against each component to purposefully stress it. If thermal or electrical problems exist, those would usually cause enough problems to force a failure. Although it has fallen out of practice these days because of the low failure rates modern manufacturers release, I wouldn't count the process out. Low failure is not 0% failure yet....
That is what I'd recommend now...
1) Is the server on a good battery backup? Dirty power coupled with a
poorly regulating power supply == random data corruption. A good UPS
can provide clean power to the server and resolve the issue.
Definitely, APC UPS'es. Normally the smart-ups.
Good enough. Move on... :)
2) RAID controllers should have their own on-card battery. If power is
failing in the machine for any reason (overheating CPU, power supply,
etc) and the RAID is not able to flush its cache...game over. Corrupt
array and no amount of RAIDness will protect you.
The smartarray E200i's come with battery backed cache. They have 128M
cache. Any corruption should have shown up in chkdsk. Furthermore if
it's a driver that's damaged it would be very odd because drivers
usually don't have anything written to them and so they usually don't
become corrupt during power outages etc. These usually are the open
files like registries, AD database etc.
Again...good enough. Move on...
3) Does the server have thermal monitoring? You should know if a CPU,
chipset, or other component is running hot well before the thing
actually reboots. Sure, by default these errors write to the event log,
but since the OS gets corrupted, you can't *view* the event log. Your
next install should have some sort of alerting system (email to another
server, pager, cell phone) set up so you at least know what failures
occurred before the data corruption.
Yea the HP's servers will shutdown at certain temperatures. If the
server is cold there is the same issue though, once it occurs on an
installation it remains unbootable. If it were thermal then booting from
CD etc. should have issues too.
Without off-server monitoring, you can't gaurantee what is happening. How do you *know* the server was cold? All you've reported here is that the server starts off okay and fails.....could be thermal. Without a cooling solution, processors can go from room temperatures to metal-fracturing temperatures in less than one second. If heat-sink compound was not applied properly (happens even in the most stringent quality-controlled environments, and HP is not stringently controlled) or another flaw exists, it is not unreasonable for a part to fail rather quickly while not actually observed.
Don't *guess* that this isn't the same issue. *KNOW* it. Monitor it. Otherwise you are pursuing ghosts...
4) Ultimately the data on the HD is getting corrupted, hence the boot
issue. Could be a failing HD. Could be a bad RAID controller. Or could
be both of these are fine, but another failure higher up (see points 1
through 3) are causing the problem. Regardless, focus on fixing the
hardware and the OS issue will resolve itself. If you focus on fixing
the OS through a registry fix, driver update, or rollback, then the
problem will only continue to re-occur. In other words, I agree with
you that your first obligation (a) is correct.
You have a very valid point here, and normally I'd agree with you in an
instance, but this just doesn't seem (purely) hardware based. It looks
more like some driver remembering some chip registers or something like
that causing the issue. Otherwise reinstalling the entire server and
having no issues until systemstate is restored simply doesn't make sense
either.
Except that you've already admitted that the system-state is not retained long enough to verify that this is the case. If you could, with certainty, restore the system-state from a time from *BEFORE* the last system reboot then you could say this. But right now I see your reboot policies and data retention policies exist in direct conflict. Restoring a system-state with a corrupt registry provides no value if the corruption was caused by one of the failures mentioned above...
5) Obligation (b) is unrealistic. You pointed out that you know how
things should be but sometimes customers don't have the money or you got
the customer in an already crappy situation and you have to make due.
So you justify not making local backups. But then you want a 5-minute
fix for what is obviously a hardware problem. Let me be clear; IT
DOESN'T EXIST. Sometimes there is no 5 minute fix PERIOD. Just like the
customer sometimes tells you that they won't spend money to do something
properly, sometimes you have to tell the customer that there is no quick
fix. That is the reality of the situation. Period. If everything
could be fixed in 5 minutes with a registry fix then nobody would make
backups. So you can't ask us to accept that you are dealing with a
customer who won't do the right thing, but then make the same demand of
us and expect us to do right by it. It will never work.
I don't really justify not having local backups. I advise all my
customers to use tape backups with specialized software for them. If
they want online I consider it extra (I don't like it at all).
Unfortunately the customer often disagrees or doesn't want to afford it.
For most it's very simple math. Online backup is very cheap, on-site
backup with roulation is much more expensive than me having to
troubleshoot say 4 hours more in case it does happen. These are usually
the really small customers, the larger ones have a larger impact on
employees not being able to do anything and tend to buy these things.
hard drives are cheap. USB hard drives, nominally more expensive. If a terabyte hard drive is more expensive than 4 hrs of your labor then you aren't charging enough. I know that doesn't help you...but it does mean your customers are taking advantage of you. And they probably already know it, which is why they are putting up with these outages when they could've gone to another contractor by now. Your rates are *way* below industry standard....
And to be clear, I agree with one of your sentiments. I am not trying
to be short, rude, or argumentative here. It is not my intention to
offend or be harsh with the observations above, but sometimes the facts
do not line up with the goals, and the goals need to be adjusted. This
is clearly one of those times.
Not quite about to give up yet. Troubleshooting much longer isn't really
financially viable either. It is what I like most though :).
I love troubleshooting too... It is why I let my junior techs handle the mundane stuff. It is nice being the boss.....
We however
do have the risk of it returning, especially if it does turn out to be
software.
It isn't. I will put money down NOW. I'm not there. I don't know all the facts. But SBS (as a product) is solid and I've installed everything from BackOffice 4 to SBS 2008. *DRIVERS* are technically software, but since they tie the hardware to the OS, they don't count. Especially if you actually were sticking to WHQL drivers and not installing after-market, which it sounds like you did avoid during the latest install. It is a HW problem...through and through.
It can take months for the issue to reappear and we're about
to setup the same ML350G5/2008 SBS at some larger customers.
Perhaps I misinterpreted your posts, but you made it sound like it reappears quickly. A problem spanning *months* is rarely SBS...but can be tracked to anything from failing RAM to a third-party app. I re-read your previous posts after reading this...and everything you've indicated (until now) is that this develops rather quickly. Changing the symptoms will obviously change the diagnosis.
We really
don't want to run into the issue there (or anywhere else for that
matter, unfortunately fact is it's better to loose small customer than
large customer, albeit it's better to loose neither :D).
Nobody likes to lose customers.
I'd atleast
need some certainty it won't return, is software based or we should look
for other hardware.
There is never certainty in this business. But if one specific configuration gives you problems then it is time to give it away. SBS 2k8 went through a thorough beta and release candidate period. It is deployed (by now) many places, and this is the first I've heard of any consistently repeatable problem. Without more details, I lean towards hardware. *MAYBE* it is a specific HP driver. Either way, if it is repeatedly happening on more than one site, then it is time to change things up.
We try to have the same hardware at customers as
much as possible, not only because it eases work, but it makes it
affordable to have some spare parts for it at our office etc. too.
Not an uncommon practice. But if you stick to name-brand vendors (HP, Dell, etc) then keeping parts at your office becomes moot. They can drop-ship parts faster than you can schedule a visit in most cases. Having parts "on hand" is an advertising slogan that I've only heard from white-box vendors that know they had a reputation they needed to fix...
Looking up the error code it seems to be INACCESSIBLE_BOOT_DEVICE,
however, this doesn't show in the blue screen. It seems to be often
related to the SATA mode (mainly vista posts about this). This is also
what the microsoft engineer told us which specifically said SATA had to
be in AHCI mode. Which I don't think it is.
An inaccessible boot device is a failure during the kernel. Not specific to Vista or SATA. It just happens to be that 90% of computer sales are desktops, and 99% of those have sata drives....so if a kernel failure happens, that is the reported error. Don't make assumptions.
If I unhook the CD player, which is on the only SATA port on the
mainboard (which is also oddly called SATA Optical Port by HP
description) the boot progresses a bit further than with it connected.
It however does crash with the same error.
Again, you are pursuing a SATA path based on pretty thin circumstantial evidence. And the fact that it still crashes shows that it was wrong. It is crashing at some point during a kernel-driver load...and since windows boot logging only displays *successful* loads, viewing the load process doesn't illuminate much. You have a driver doing something bad, and that is caused because you have hardware doing something bad. Maybe the driver is corrupt, or maybe the hardware is...I don't know. I'm not there. I get paid to be there......you have to use the tools at YOUR disposal.
Flushed/resetted BIOS settings, tried an older BIOS version, tried
disabling IDE/SATA but this isn't possible (verified by HP). Looked up
if I could find the SATA driver on the other customers server. However,
all I see is 2 IDE controllers. According to device management the CD
player should be connected to one of the IDE controllers. I have the
exact drivernames for these at home (at another customer now) but I
recall atapi.sys some other one with ata in it, pciidx.sys and another
one starting with pci. Renamed these drivers to .sys.org. Server will
not boot then and says it's missing the driver atapi.sys. So disabled
them in registry by setting their services' start registry value to 0x4.
Same BSOD. Odd thing is, if I disconnect the DVD it still boots a bit
further, which I find odd as with removing the drivers I'd assume it's
no longer talking with the DVD at all. So I seem to be missing, atleast
a part, of the SATA driver. Perhaps this is because it's probably in IDE
mode. It's also the last thing I'd want to know before giving up if it
isn't the issue.
Still chasing SATA...still wrong...
Another thing that doesn't add up to corruption btw, is that
virtualizing the server from a bootable CD works fine. The resulting
guest has no issues whatsoever, albeit it obviously does find new
hardware amongst which there are no SATA controllers.
I will, again, refer to my comment above, that you indicated that the problem "may not turn up for months." Unless you've been virtualizing SBS2k8 for longer than its release (I envision bad Back To The Future references here) you can't say that virtualizing works fine. Either the problems occur quickly or they don't. There is also the fact that, ultimately, a virtual machine (even Hyper-V) is a file...just a really big ..vhd file) so a driver corruption that would occur on a *real* machine would not happen on a .vhd because the .vhd file would never be committed and the RAID array would correct. Thatis some pretty low-level file-write talk to explain it...but if you want to go there, I'd be happy to privately explain it.
Probably will give up soon, not worth the time, albeit that will have to
be seen if the error returns, at whatever customer. I'm also not looking
for a 5 minute fix now, I'm looking for the answer if this is a software
based issue so I can fix it in 5, 10 maybe 30 minutes *next* time.
Onsite backups. With full rotation. WORST CASE. Even better if you find the bad hardware/driver and fix it (of course I know you already want that...but still...worth stating...
If it
even returns if we know what's causing it. Still thinking it's a driver
problem, the question is just which one and why after so long (and then
very consistly thus, we do reinstall all the same (latest on HP
site/windowsupdate) after reinstallation) so one would expect the issue
to be back immediately but it doesn't (but it thus does with restoring
systemstate which should have all the same drivers).
A driver issue would show on reboot. If you are *SURE* that the server is being rebooted after essential updates then I don't think it is a driver.
As you said...it just doesn't add up...
Thanks a bunch :)
.
- Follow-Ups:
- Re: 2008 SBS no longer boots
- From: Freaky
- Re: 2008 SBS no longer boots
- References:
- 2008 SBS no longer boots
- From: Freaky
- Re: 2008 SBS no longer boots
- From: Cliff Galiher
- Re: 2008 SBS no longer boots
- From: Freaky
- Re: 2008 SBS no longer boots
- From: Cliff Galiher
- Re: 2008 SBS no longer boots
- From: Freaky
- Re: 2008 SBS no longer boots
- From: Cliff Galiher
- Re: 2008 SBS no longer boots
- From: Freaky
- 2008 SBS no longer boots
- Prev by Date: Re: Event ID 1011 SBCore..Help
- Next by Date: Re: Expose SCL from IMF in Outlook 2007
- Previous by thread: Re: 2008 SBS no longer boots
- Next by thread: Re: 2008 SBS no longer boots
- Index(es):
Relevant Pages
|