RE: Server 2008 Cluster Issues
- From: Priit Vosu <PriitVosu@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Sun, 7 Dec 2008 16:54:01 -0800
I upgraded the IBM DS4700 firmware from 6.23.05.02 to 7.36.08.00
I also upgraded the storage MPIO drivers on the Windows 2008 server to 10.36
version and since then the system has been stabile for a week which is not
very long time, but previously the error occured way sooner then that.
I am also running standby replication off that server and I still get
occasional Event ID: 2082 errors, but the replication itself is not breaking
and the servers are working correctly so far.
We are planning to make the same upgrade in couple of days in the production
environment if all will be still ok at that time. So far it does seam that
the firmware and MPIO driver update have made the system more stabile and
maybe even fixed it.
"Dale Kiefer" wrote:
We have the exact same issues you are experiencing with the cluster becoming.
non-responsive. I have also posted about this in the Exchange forums at:
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.exchange.clustering&p=1&tid=f9ee94b3-d92c-4273-a962-f8f2de77d68f
Please let me know how you make out with the storage firmware upgrade or any
other changes. I am not running in a test environment so I don't have the
ability to make changes as easily.
What version of firmware are you upgrading from/to?
I'm glad to see a few more people are bumping into these same problems.
Hopefully we can find a resolution soon.
"Priit Vosu" wrote:
I seem to have come across similar problems:
We have 2 datacenters both running identical hardware and we are testing
upgrade for exchange to exchange 2007.
I built 2 x Windows 2008 SP1 clusters one in each DC and put exchange 2007
clustered mailbox role on them and standby continuous replication between
them.
Hardware is running on IBM BladeCenter HS21 blades
Storage is on IBM System Storage DS4700 storage boxes.
Have latest drivers and patches applied and also Exchange SP1 Rollup Update 5.
The system that is on the receiving end of the standby relication does not
seem to suffer from the issues however the Exchange box that is running live
gets after couple of days the following errors:
Source: Microsoft-Windows-FailoverClustering
Event ID: 1069
Level: Error
Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 9)' in
clustered service or application 'E2K7CLUS02' failed.
Source: Microsoft-Windows-FailoverClustering
Event ID: 1230
Level: Error
User: SYSTEM
Computer: E2K7NODE2.domain.local
Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 10)'
(resource type '', DLL 'clusres.dll') either crashed or deadlocked. The
Resource Hosting Subsystem (RHS) process will now attempt to terminate, and
the resource will be marked to run in a separate monitor.
After these errors come, the Cluster administrator will hang and Windows
Explorer will also hang and not ba able to browse the drives.
Exchange will however continue to work and mail will keep flowing and
clients can still access mailboxes. To restore the functionality of admin
tools so far I have found nothing short of reboot that would do the trick.
When I move the cluster over to the standby cluster in the other DC the same
problems start there.
I will try the firmware upgrade of the storage in the next couple of days in
one of the DC-s to se it it has any impact. It does look like it is some sort
of issue between IBM storage or storage drivers and the Windows 2008 SP1
cluster. We have multiple Windows 2003 clusters running off the same storage
boxes and so far we have had no errors on any of them.
"Dale Kiefer" wrote:
There was an identical incident posted at:
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c086d541-b2f7-4db1-899a-a0ff03f55183
However, I thought I would pose it as a seperate question in the hope of
bringing additional attention to this problem. I am basically copying and
editing the previous posters' information as it is so close to our issue.
We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
with Cumulative Rollup 4 installed. The second node is continually going
down. Here is what is going on:
• Cluster servers are IBM xSeries 3650s
• Using a IBM DS4800 SAN for shared storage
• The NIC configuration on the nodes is as follows:
o Onboard Broadcom adapter - v4.4.16.0
o 2 Intel PCI-X adapters
o 3 network connections setup: public - 10.2.105.x Intel
switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
192.168.1.x Broadcom crossover cable
o We setup the 3 network connections to help eliminate the
network as the issue.
o IPv4 Connectivity only, no teaming
o Windows cluster validation does not report any issues.
The issue that we are seeing is that intermittently Node 2 gets kicked out
of the cluster and shuts down the cluster service generating an 1177 error in
the event log. Basically, this means that it lost quorum due to losing
connectivity with the cluster nodes. This sometimes happens 3 times an hour,
but might not happen for a few hours. The cluster service will always
automatically restart and everything is fine again for a period of time.
The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
exhibits the problem. Using Node and Disk Majority for quorum setting.
It looks like the nodes are losing network connectivity to each other based
on the cluster logs indicating the routes as down, but we now have 3 network
connections between the 2 nodes using 3 different adapters from 2 different
vendors. So I doubt this is the issue.
MS believes the issue to be storage related due to "error 170" appearances
in the cluster logs and indicates these are related to persistent reservation
problems. We have installed the latest MPIO from IBM which supposedly
resolves some of these types of issues. However, the problem continues. IBM
is also looking into this, but we await a solution.
Has anyone else ran into this problem? Suggestions? Any help is greatly
appreciated.
- Follow-Ups:
- RE: Server 2008 Cluster Issues
- From: Dale Kiefer
- RE: Server 2008 Cluster Issues
- References:
- RE: Server 2008 Cluster Issues
- From: Priit Vosu
- RE: Server 2008 Cluster Issues
- From: Dale Kiefer
- RE: Server 2008 Cluster Issues
- Prev by Date: RE: EventID 1055 after Migration
- Next by Date: RE: Server 2008 Cluster Issues
- Previous by thread: RE: Server 2008 Cluster Issues
- Next by thread: RE: Server 2008 Cluster Issues
- Index(es):
Relevant Pages
|