Network fails = manually restart resources groups always?

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



I've been pulling my hair out over this all day.

We have a 4-node cluster we're building for file sharing. Each node is
connected to four switches. 2 of the switches are used for the private
heartbeat, 2 are used for serving files to clients.

We're testing a variety of hypothetical network failures: a bad network
cable on one of the nodes, individual nodes losing all network
connectivity, one of the public switches going down, etc. In all cases
the cluster recovers and is able to serve files -- except one scenario.

If both public switches go down simultaneously, the nodes try
desperately to shuffle the resource groups around. All the resource
groups then fail, and can only be brought up manually. As a real-world
example, a person on our network team needed to reboot both switches
last night. This morning, all of the resource groups were down (no one
could connect via Cluster Administrator, no file shares were up, etc).
The only fix is to bring up the resource groups manually.

I've spent the day reading newsgroups and sites on the web. A number
of people seem to have this problem but I don't see any real solutions.
A few people say it's "by design" (if the cluster can't get any
network connection for a while, it should fail). Considering our
regular file servers handle a couple switches rebooting just fine, this
new setup's "design" is less than acceptable.

My questions:

1.) Is there any way to NOT have the resource groups fail to the the
point they need to be manually restarted if all nodes lose connectivity
for a few minutes? Is there a setting I can change somewhere (say,
"check every five minutes to see if network connectivity is back; if
it's down for 6 hours, fail the resources totally")?

2.) If they have to fail, is there a way to have the cluster try to
start up the failed resources when network connectivity is returned?

Basically, how have other people with clusters gotten around this? Do
they just manually restart the groups every time it happens?

Thanks in advance for any help.

.



Relevant Pages

  • Re: 3 and 4 node heartbeat network
    ... I suspect your team leads were using hubs or low end switches ... isolated vlan for just the nodes used in that cluster. ... and rely on the public network for redundancy. ... The vlans ensure ...
    (microsoft.public.windows.server.clustering)
  • Re: Clustering over GB Switches
    ... ->adapters using two Fibre crossover cables. ... -> I now need to move my existing cluster members off the Direct ... ->connection and onto the Switches. ... sysgen parameter LAN_FLAGS to 32 on ALL nodes which disables auto-neg for ...
    (comp.os.vms)
  • Re: Not on latest Roadmap: OpenVMS VAX Version 8.x "under investigation"
    ... The other node crashed because of lost connectivity with the first node. ... crashed *hard* when it lost cluster connectivity. ... I wouldn't expect a standalone VMS node to crash on a LAN outage. ...
    (comp.os.vms)
  • Re: Will failover be triggered by failure of fibrechannel HBA?
    ... Losing connectivity to the disk ... will cause that resource and all resources that depend on it to fail. ... will trigger a cluster failover. ... Microsoft SQL Server MVP ...
    (microsoft.public.sqlserver.clustering)
  • Re: Multi-homed MSCS servers
    ... unless you yank out all the Switches and replaced them with Hubs. ... segment) then you will gain the normal benefits of IP Segmenting. ... However even IP Segmenting only saves bandwith with respect to ... cluster IP addresses to a muti-homed server was a resonable undertaking. ...
    (microsoft.public.windows.server.networking)