RE: Server 2008 Cluster Issues

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Has been a while and we have pretty much tried everything. Updated Storage
firmware to v7.36.12, Updated Brocade fiber switches firmware to v6.1.1d,
updated all drivers to the latest versions as soon as the new ones have come
out. Updated every component firmware on the server that was possible to
update. Applied every Microsoft patch for Win2008 clustering and storage/MPIO
related that could find. Have done 3 cases with IBM so far and the problem
still remains.

So finally after all that work and time IBM told us today, that it appears
there is an issue in the IBM MPIO drivers for Windows 2008 that is likely to
be the cause for this error. They do not have new drivers ready yet, but they
are working on them and will let us know as soon as they will be available.
It turns out that they have been getting more cases similar to this and they
believe they have identified the issue and are working on it.


"Priit Vosu" wrote:

It appears the problems have not gone. The system was up and running without
errors for 2 weeks and this morning it bluescreened and rebooted with
Bugcheck: 0x00000018
After the server came up again 2 hours later I got the Event ID: 1230 again
with the
Cluster resource 'FileServer-(EXCLUSTER02)(Cluster Disk 1)' (resource type
'', DLL 'clusres.dll') either crashed or deadlocked.
So the storage issues seam to be back. By now we already have a few hundred
users on that server, so it is somewhat of an annoyance. I will open a case
with IBM and see where we go from there.


"Dale Kiefer" wrote:

Thanks for letting me know. On Friday, we also upgraded our SAN firmware at
IBM's recommendation to version 7.36.08.00 (from 6.60). We are running the
latest MPIO as well. The issue still exists though.

I sure hope your issues have disappeared, but I might recommend doing some
failovers on your cluster if you already haven't. I had thought the issue
had been resolved after a HBA BIOS upgrade. However, after a few failovers
(using both the management console and simply rebooting the server), the
issue returned.

Keep me posted as to how things are going on your end.
Thanks.

"Priit Vosu" wrote:

I upgraded the IBM DS4700 firmware from 6.23.05.02 to 7.36.08.00
I also upgraded the storage MPIO drivers on the Windows 2008 server to 10.36
version and since then the system has been stabile for a week which is not
very long time, but previously the error occured way sooner then that.

I am also running standby replication off that server and I still get
occasional Event ID: 2082 errors, but the replication itself is not breaking
and the servers are working correctly so far.

We are planning to make the same upgrade in couple of days in the production
environment if all will be still ok at that time. So far it does seam that
the firmware and MPIO driver update have made the system more stabile and
maybe even fixed it.


"Dale Kiefer" wrote:

We have the exact same issues you are experiencing with the cluster becoming
non-responsive. I have also posted about this in the Exchange forums at:

http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.exchange.clustering&p=1&tid=f9ee94b3-d92c-4273-a962-f8f2de77d68f

Please let me know how you make out with the storage firmware upgrade or any
other changes. I am not running in a test environment so I don't have the
ability to make changes as easily.

What version of firmware are you upgrading from/to?

I'm glad to see a few more people are bumping into these same problems.
Hopefully we can find a resolution soon.

"Priit Vosu" wrote:

I seem to have come across similar problems:

We have 2 datacenters both running identical hardware and we are testing
upgrade for exchange to exchange 2007.
I built 2 x Windows 2008 SP1 clusters one in each DC and put exchange 2007
clustered mailbox role on them and standby continuous replication between
them.

Hardware is running on IBM BladeCenter HS21 blades
Storage is on IBM System Storage DS4700 storage boxes.

Have latest drivers and patches applied and also Exchange SP1 Rollup Update 5.
The system that is on the receiving end of the standby relication does not
seem to suffer from the issues however the Exchange box that is running live
gets after couple of days the following errors:

Source: Microsoft-Windows-FailoverClustering
Event ID: 1069
Level: Error
Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 9)' in
clustered service or application 'E2K7CLUS02' failed.

Source: Microsoft-Windows-FailoverClustering
Event ID: 1230
Level: Error
User: SYSTEM
Computer: E2K7NODE2.domain.local
Description: Cluster resource 'FileServer-(E2K7CLUS02)(Cluster Disk 10)'
(resource type '', DLL 'clusres.dll') either crashed or deadlocked. The
Resource Hosting Subsystem (RHS) process will now attempt to terminate, and
the resource will be marked to run in a separate monitor.

After these errors come, the Cluster administrator will hang and Windows
Explorer will also hang and not ba able to browse the drives.
Exchange will however continue to work and mail will keep flowing and
clients can still access mailboxes. To restore the functionality of admin
tools so far I have found nothing short of reboot that would do the trick.

When I move the cluster over to the standby cluster in the other DC the same
problems start there.
I will try the firmware upgrade of the storage in the next couple of days in
one of the DC-s to se it it has any impact. It does look like it is some sort
of issue between IBM storage or storage drivers and the Windows 2008 SP1
cluster. We have multiple Windows 2003 clusters running off the same storage
boxes and so far we have had no errors on any of them.



"Dale Kiefer" wrote:

There was an identical incident posted at:
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?&lang=&cr=&guid=&sloc=en-us&dg=microsoft.public.windows.server.clustering&p=1&tid=c086d541-b2f7-4db1-899a-a0ff03f55183

However, I thought I would pose it as a seperate question in the hope of
bringing additional attention to this problem. I am basically copying and
editing the previous posters' information as it is so close to our issue.

We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
with Cumulative Rollup 4 installed. The second node is continually going
down. Here is what is going on:

• Cluster servers are IBM xSeries 3650s
• Using a IBM DS4800 SAN for shared storage
• The NIC configuration on the nodes is as follows:
o Onboard Broadcom adapter - v4.4.16.0
o 2 Intel PCI-X adapters
o 3 network connections setup: public - 10.2.105.x Intel
switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
192.168.1.x Broadcom crossover cable
o We setup the 3 network connections to help eliminate the
network as the issue.
o IPv4 Connectivity only, no teaming
o Windows cluster validation does not report any issues.

The issue that we are seeing is that intermittently Node 2 gets kicked out
of the cluster and shuts down the cluster service generating an 1177 error in
the event log. Basically, this means that it lost quorum due to losing
connectivity with the cluster nodes. This sometimes happens 3 times an hour,
but might not happen for a few hours. The cluster service will always
automatically restart and everything is fine again for a period of time.

The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
exhibits the problem. Using Node and Disk Majority for quorum setting.

It looks like the nodes are losing network connectivity to each other based
on the cluster logs indicating the routes as down, but we now have 3 network
connections between the 2 nodes using 3 different adapters from 2 different
vendors. So I doubt this is the issue.

MS believes the issue to be storage related due to "error 170" appearances
in the cluster logs and indicates these are related to persistent reservation
problems. We have installed the latest MPIO from IBM which supposedly
resolves some of these types of issues. However, the problem continues. IBM
is also looking into this, but we await a solution.

Has anyone else ran into this problem? Suggestions? Any help is greatly
appreciated.
.



Relevant Pages

  • RE: Server 2008 Cluster Issues
    ... we also upgraded our SAN firmware at ... failovers on your cluster if you already haven't. ... I also upgraded the storage MPIO drivers on the Windows 2008 server to 10.36 ... We are planning to make the same upgrade in couple of days in the production ...
    (microsoft.public.windows.server.clustering)
  • Re: Cluster service is requesting a bus reset for device DeviceClusDisk0
    ... Drivers and firmware are up to date and certified with SP1 according to tech ... >> cluster and could not form a new server cluster. ...
    (microsoft.public.windows.server.clustering)
  • RE: Server 2008 Cluster Issues
    ... I also upgraded the storage MPIO drivers on the Windows 2008 server to 10.36 ... We are planning to make the same upgrade in couple of days in the production ... Cluster resource 'FileServer-' ... the Cluster administrator will hang and Windows ...
    (microsoft.public.windows.server.clustering)
  • shared bus scsi storage problem
    ... I am tring to install a two node windows 2003 active-passive clustering ... with a Adaptec SC4100 external storage but the storage is not stable. ... installed on a internal disk connected to a onboard scsi controller; ... I successfully installed first cluster node on Node A. then added Node ...
    (microsoft.public.windows.server.clustering)
  • Re: Cluster between sites
    ... If the storage replication product with work with MSCS, ... using W2K3) create a 4 node cluster, 2 nodes attaches to storage unit in ... site 1 and 2 nodes attached to storage unit 2 and all be part of the same ...
    (microsoft.public.windows.server.clustering)