SBS 2000 SP1 (/SP4) system hang
From: William Kennedy (williamkennedy_at_vigilant.it)
Date: 02/22/04
- Next message: TP: "Coincidence?...does anyone know the answer? Computer migrations..."
- Previous message: Mariette Knap [SBS MVP]: "Re: Servers & Routers and Firewalls, Oh my...."
- Messages sorted by: [ date ] [ thread ]
Date: Sun, 22 Feb 2004 23:27:46 +1100
I added Smart Array 431 RAID controller to a client's Proliant ML370 G1 SBS
2000 server.
Windows 2000 SP4
Exchange 2000 SP3
ISA 2000 SP1
Before the addition the client had the following configuration.
C:\ and D:\ drives were located across a two 9 GB disk using the O/S
software mirroring.
C:\ was the system disk.
D:\ contained several directories of which the only significant one
directory was the D:\Program Files\ which contained the SQL, Exchange and
Office directories.
E:\ was a 36 GB hard disk that contains all the company data as well as a
E:\Program Files\ directory containing a Exchange directory and an SQL
directory which used to be the location the SQL databases.
Upon adding the Smart Array card we configured four hard disks in a RAID 5
configuration. Each hard disk was an 72 GB Ultra 3 hard disk and the result
was a 205 GB drive. I copied all company data to the new disks. I then
attempted to migrate ISA's system cache to temporarily to D:\ drive but
while I believe I knew the location to move the cache configuration /drives
I could not determine how to move the cache so I just reset it to zero as
they do not use ISA server anyway, (after the completion of the migration I
did attempted to recreate the cache to its original size on the new E:\ but
was unable to do so.)
After stopping all non essential services I copied the E:\Program Files\ and
all its contents to the new drive and re labelled the new drive to E:\ this
went off with out a hitch and when the system was rebooted it came up
cleanly with no issues at all.
Once we were satisfied that the system was running ok we removed the old E:\
drive. We then uninstalled SQL as they had previously migrated there
GoldMine and Heat applications to a separate server but had not removed the
SQL installation on the SBS server.
Having only just taken over the network I had a few weeks ago added a
Windows 2003 Server using terminal services to replace the SBS server. This
allowed me to remove Office from the domain controller.
At this time I also moved the exchange databases from D:\ to E:\.
This work was completed over two nights and according to the logs and the
fact that all services started we felt the process had been successful.
It is difficult to pinpoint when the system began to hang as the users only
direct access with it now is file access as they use GoldMine as there mail
client and internet access comes through there WatchGuard firewall.
I have just sampled the system hang a total of ten time and the system hang
occurs between every 5 minutes and 35 seconds to 5 minutes and 40 seconds,
(00:05;35 and 00:05;40). The hang lasts for between 1 minute and 27 seconds
and 1 minute and 32 seconds, (00:01;27 and 00:01;33).
After reviewing the logs there were no errors that we could directly
attribute to the hang, with the exception of 'Information' logs from
Exchange which seem to be related to an Exchange service failing to get
access to the database (which is now located on the E:\ drive.)
In attempt to isolate the issue we rebooted into Safe Mode where the issue
was not replicated and so we concluded that it was not a hardware issue, but
many of the services are not that rely on the E:\ drive like Exchange are
not started in Safe Mode.
We rebooted the server and begun to stop services to see if the hang was
related to a particular service. After stopping all but what we considered
the core services the problem was still occurring. We then stopped all
services but the services which the system does not allow you to stop.
When we did this the hanging stopped.
We then restarted the core services again and the server still did not hang.
These core services included; Alerter, Com+, System Event Notify, Computer
Browser, Local Disk Manger, Net Logon, Network Connections, RPC Locator,
SNMP Service, TCP/IP NetBIOS Helper, Windows Management Instrumentation,
Windows Management Instrumentation Driver Extension, Workstation, IIS Admin
Service, Protected Storage, Remote Access Connection Manger, Telephony, Plug
and Play. I may have missed one or two of the services that can not be
stopped but this is a quite a comprehensive list.
We guessed that the reason that one of the core service which is a dependent
service for a non core service may have impacted the behaviour of a core
service and that it was not until the core service was stopped and restarted
manually with out the non-core service then starting that it did not hang
the system.
At this point we begun to restart services after restarting about two thirds
of the way through when we restarted the 'Microsoft Scheduled Cache Content
Download' service and in the middle of starting the service the system hung.
When the system hang was over the serviced finished starting. We waited to
see if the system would hang again which it did. We then stopped the
'Microsoft Scheduled Cache Content Download' service in the hope that this
was our offending service. When the hanging still kept occurring we stopped
related ISA services, but the hanging remained.
We then stopped all services so as to stop the hanging and begun restarting
services again. Around a third of the way through the restarting the system
again hung this service (which I did not note down, stupidly) was not
related to ISA, and from what I recall seemed to be quite an unremarkable
service, (if you can ever say that about a service.)
We noted also that at the time of the hang the RAID disks E:\ drive which
contains the page file and exchange the lights on the disks would go solid
and when the system came back from the hang the lights would go back to
flashing in unison. For this reason we downloaded the latest firmware for
the RAID card and installed it, this had no effect.
Throughout the investigation I have been monitoring a few performance
monitor counters where a pattern is developing before the system hangs,
(unfortunately during the hang no data is collected.) (Please note this
data is collected over the weekend so not much other activity by other users
is expected on the system.) These patterns are as follows.
Between 20 and 22 seconds before the system hangs the counter % Processor
Time moves from a resting % of about 8% to an average of about 55% to 60%.
Not surprisingly the counter Processor Queue Length goes from average of 0%
to average around 20% and the counter Page/sec increases from occasional
activity spiking from 0% to 30% to frequent activity around 20% to 30%. The
counter % Time increases from an average of around 6% to about 20%. All this
activity occurs about 20 to 22 seconds before the hang.
Upon returning from the hang the counters Processor Queue Length, Context
Switches/sec, Page/sec starts returns at around 100%.
I am sorry for the verboseness of the posting but as the issue does not
offer any obvious error message I have tried to give as complete a summary
of the issue.
Thank you for your time.
Regards,
William Kennedy
- Next message: TP: "Coincidence?...does anyone know the answer? Computer migrations..."
- Previous message: Mariette Knap [SBS MVP]: "Re: Servers & Routers and Firewalls, Oh my...."
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|