Re: Cluster Testing - Failure and Recovery taking longer than expected



"Brad Baker" <brad@xxxxxxxxxxxxx> wrote in message news:uGmcDhL9GHA.4224@xxxxxxxxxxxxxxxxxxxxxxx
We've just setup a 4 node Windows 2003 R2 cluster following this Microsoft Whitepaper: http://www.microsoft.com/downloads/details.aspx?familyid=96F76ED7-9634-4300-9159-89638F4B4EF7&displaylang=en

We're running through the advanced testing in the appendix (page 38). Specifically this test:

Test:
Move all resources to one node, and then remove the public network cable from that node. The IP Address resources should fail, and the groups will all failover to another node in the cluster.

Expected Result:
Resources should failover to another node in the cluster

When we place all groups on one node and unplug the public network cable the cluster does not immediately fail over to other nodes. It takes 7 minutes for the failure to register and for the cluster to recover.

Why does the failure take so long? Is there any way we can make the failure and recovery happen faster?

See KB 242600 and KB 176320 for more information on network failures and recovery.

If you are really concerned about network failures, you should using teaming for your public network interfaces. Set the teams in fault tolerant mode, and set each port to go to a different switch.


--
Russ Kaufmann
MVP - Windows Server - Clustering
ClusterHelp.com, a Microsoft Certified Gold Partner
Web http://www.clusterhelp.com
Blog http://msmvps.com/clusterhelp

The next ClusterHelp classes are:
Denver starting Nov 14th
NYC starting Dec 4th.


.



Relevant Pages

  • Re: Unable to failover when public network cables are removed
    ... determining node failure and subsequent resource failover. ... > Can you provide a little more information here......Is the Public Network ... There is only 1 standard cluster group. ... There is no other resources. ...
    (microsoft.public.windows.server.clustering)
  • Re: heartbeat network not sustaining cluster
    ... We have been seeing a problem where when the public network becomes unavailable, the cluster IP and name resources go offline and the cluster effectively stops. ... Messaging and Security, MCT, MCITP, MCTS and other stuff ...
    (microsoft.public.windows.server.clustering)
  • Re: Failover configuration best practice and HowTo
    ... I am sure you do not want to totally disrupt your cluster for a heartbeat ... It would be different if it were a Public network because ... immediately reacting to the failure of a public network. ... Exchange 'delays' it's response. ...
    (microsoft.public.windows.server.clustering)
  • Re: Cluster Testing - Failure and Recovery taking longer than expected
    ... The IP Address resources should fail, and the groups will all failover to another node in the cluster. ... It takes 7 minutes for the failure to register and for the cluster to recover. ...
    (microsoft.public.windows.server.clustering)
  • Re: Minimize failover time
    ... The cluster has to recognize the failure or the move request. ... The cluster then has to send SQL and the other resources shutdown signals ...
    (microsoft.public.sqlserver.clustering)