RE: cluster completely unavailable

Tech Tip: Click here to run a free scan for Windows Errors and optimize PC performance



I tryed this solution:

There was a NIC Team and was "load balancing and failover".
I read somewhere that Microsoft do not support load balancing so now the NIC
are only teamed as failover, no more load balancing.
Could be this the cause of the problem?

Plase advise....



"Frank" wrote:

On 30.04.2009, during the night when nobody works, only the backup is
running, the 3 cluster nodes became completely unavailable.
The network peoples says that switches was ok, network was ok, dns and wins
was also ok, so network was up and running fine for all our 30 servers.
Only the cluster suffered from this situation.

The windows system event logs says:
The node lost communication with cluster node 'SRV0002A' on network
'CLS00002_Public'.
The node lost communication with cluster node 'SRV0002B' on network
'CLS00002_Public'.
The node lost communication with cluster node 'SRV0002C' on network
'CLS00002_Public'.
and then:
The interface for cluster node 'SRV0002C' on network 'CLS00002_Public' is
unreachable by at least one other cluster node attached to the network. the
server cluster was not able to determine the location of the failure. Look
for additional entries in the system event log indicating which other nodes
have lost communication with node SRV0002C. If the condition persists, check
the cable connecting the node to the network. Next, check for hardware or
software errors in the node's network adapter. Finally, check for failures in
any other network components to which the node is connected such as hubs,
switches, or bridges.

then:
Cluster network 'CLS00002_Public' is partitioned. Some attached server
cluster nodes cannot communicate with each other over the network. The server
cluster was not able to determine the location of the failure. Look for
additional entries in the system event log indicating which nodes have lost
communication. If the condition persists, check for failures in any network
components to which the nodes are connected such as hubs, switches, or
bridges. Also check for hardware or software errors in the adapters that
attach the nodes to the network.
then:
The interface for cluster node 'SRV0002C' on network 'CLS00002_Public' is
operational (up). The node can communicate with all other available cluster
nodes on the network.
The interface for cluster node 'SRV0002A' on network 'CLS00002_Public'
failed. If the condition persists, check the cable connecting the node to the
network. Next, check for hardware or software errors in node's network
adapter. Finally, check for failures in any network components to which the
node is connected such as hubs, switches, or bridges.
The interface for cluster node 'SRV0002B' on network 'CLS00002_Public' is
operational (up). The node can communicate with all other available cluster
nodes on the network.
Cluster network 'CLS00002_Public' is operational (up). All available server
cluster nodes attached to the network can communicate using it.

and finally the first RED ERROR:
The TCP/IP interface for Cluster IP Address 'SQL IP Address1(VSRVSQL)' has
failed.

after 1 minute:
This computer was not able to set up a secure session with a domain
controller in domain LUGANO due to the following:
There are currently no logon servers available to service the logon request.
This may lead to authentication problems. Make sure that this computer is
connected to the network. If the problem persists, please contact your domain
administrator. and:

The master browser has received a server announcement from the computer
SRV0AD01 that believes that it is the master browser for the domain on
transport NetBT_Tcpip_{8E9F3304-6211-4472-. The master browser is stopping or
an election is being forced.

Here is a part of the cluster log:


00000848.000009c0::2009/04/30-00:34:09.222 INFO [ClMsg] Received interface
unreachable event for node 1 network 1
00000848.00000944::2009/04/30-00:34:09.222 WARN [NM] Communication was lost
with interface 4358ed8c-0534-4a6b-b396-2fb47e46baad (node: SRV0002A, network:
CLS00002_Public)
00000848.00000fb4::2009/04/30-00:34:12.035 WARN [NM] Interface
0d770674-50a5-4dd6-9d40-ee7d643fe932 is unreachable (node: SRV0002C, network:
CLS00002_Public).
00000848.00000fb4::2009/04/30-00:34:12.035 WARN [NM] Interface
4358ed8c-0534-4a6b-b396-2fb47e46baad is unreachable (node: SRV0002A, network:
CLS00002_Public).
00000848.00000fb4::2009/04/30-00:34:12.035 WARN [NM] Interface
1acfb388-0ad4-4b78-8cb2-39c04a6b888a is unreachable (node: SRV0002B, network:
CLS00002_Public).
00000848.00000fb4::2009/04/30-00:34:12.035 WARN [NM] Network
c120d86b-291f-45b7-bed6-39eda87acc33 (CLS00002_Public) is partitioned.
00000848.00000fb4::2009/04/30-00:34:12.035 INFO [GUM] s_GumUpdateNode:
completed update seq 225906 type 2 context 15
00000a08.00000cb0::2009/04/30-00:34:12.035 WARN IP Address <Cluster IP
Address>: WorkerThread: NetInterface 1acfb388-0ad4-4b78-8cb2-39c04a6b888a
changed to state 2.
00000a08.00000cb0::2009/04/30-00:34:12.035 WARN IP Address <IP Address
10.1.0.153>: WorkerThread: NetInterface 1acfb388-0ad4-4b78-8cb2-39c04a6b888a
changed to state 2.
00000848.00000fb4::2009/04/30-00:34:15.785 INFO [GUM] s_GumUpdateNode:
dispatching seq 225907 type 2 context 15
00000848.00000fb4::2009/04/30-00:34:15.785 INFO [NM] Received update to set
state for network c120d86b-291f-45b7-bed6-39eda87acc33.
00000848.00000fb4::2009/04/30-00:34:15.785 INFO [NM] Interface
0d770674-50a5-4dd6-9d40-ee7d643fe932 is up (node: SRV0002C, network:
CLS00002_Public).
00000848.00000fb4::2009/04/30-00:34:15.785 WARN [NM] Interface
4358ed8c-0534-4a6b-b396-2fb47e46baad failed (node: SRV0002A, network:
CLS00002_Public).
00000848.00000fb4::2009/04/30-00:34:15.785 INFO [NM] Interface
1acfb388-0ad4-4b78-8cb2-39c04a6b888a is up (node: SRV0002B, network:
CLS00002_Public).
00000848.00000fb4::2009/04/30-00:34:15.785 WARN [NM] Network
c120d86b-291f-45b7-bed6-39eda87acc33 (CLS00002_Public) is up.
00000848.00000fb4::2009/04/30-00:34:15.785 INFO [GUM] s_GumUpdateNode:
completed update seq 225907 type 2 context 15
00000a08.00000cb0::2009/04/30-00:34:15.785 WARN IP Address <Cluster IP
Address>: WorkerThread: NetInterface 1acfb388-0ad4-4b78-8cb2-39c04a6b888a
changed to state 3.
00000a08.00000cb0::2009/04/30-00:34:15.785 WARN IP Address <IP Address
10.1.0.153>: WorkerThread: NetInterface 1acfb388-0ad4-4b78-8cb2-39c04a6b888a
changed to state 3.
00000848.00000944::2009/04/30-05:51:55.747 WARN [NM] Communication was lost
with interface 0d770674-50a5-4dd6-9d40-ee7d643fe932 (node: SRV0002C, network:
CLS00002_Public)
00000848.000009bc::2009/04/30-05:51:55.747 INFO [RGP] Node 2: RGP
Unicast: 0x2, 0x0, 0x0, 0x0.
00000848.000009bc::2009/04/30-05:51:55.747 INFO [RGP] Node 2: RGP Incoming
pkt: 0x3fff, 0x44, 0x3, 0x2.
00000848.000009bc::2009/04/30-05:51:55.747 INFO [RGP] Node 2: RGP recv
pkt : 0x440003, 0xc000c000, 0xc0000000, 0x1.
00000848.000009bc::2009/04/30-05:51:55.747 INFO [RGP] Node 2: RGP
Unicast: 0x3, 0x0, 0x0, 0x0.
00000848.00000944::2009/04/30-05:51:55.747 INFO [NM] Started connectivity
report timer (600ms) for network c120d86b-291f-45b7-bed6-39eda87acc33
(CLS00002_Public)
00000848.00000944::2009/04/30-05:51:55.747 WARN [NM] Communication was lost
with interface 8fd55b06-7512-4a1f-a231-fe6f7c406c26 (node: SRV0002C, network:
Private)00000848.000009c8::2009/04/30-05:52:00.497 WARN [EVT] EvtBroadcaster:
EvPropEvents for node 3 failed. status 1818
00000848.00000644::2009/04/30-05:52:01.622 WARN [ClNet] Tcpip is not bound
to adapter 2ADFF0E1-B750-4142-8C0E-9FEACE63A57D.
00000848.00000644::2009/04/30-05:52:01.622 WARN [ClNet] Tcpip is not bound
to adapter 7BFD012D-60C5-4CEF-8E9F-04809ADFADBF.
00000848.00000644::2009/04/30-05:52:01.622 WARN [ClNet] Tcpip is not bound
to adapter 45B8DC3C-D526-4467-8191-A1F877E01FC4.
00000848.00000ef8::2009/04/30-05:52:01.638 WARN [NM] Interface
8fd55b06-7512-4a1f-a231-fe6f7c406c26 is unavailable (node: SRV0002C, network:
Private).
00000848.00000ef8::2009/04/30-05:52:01.638 INFO [GUM] s_GumUpdateNode:
completed update seq 225916 type 2 context 15
00000848.00000ef8::2009/04/30-05:52:01.653 INFO [NM] Received request to get
ping address enum for interface 1acfb388-0ad4-4b78-8cb2-39c04a6b888a.
00000848.00000ef8::2009/04/30-05:52:01.669 INFO [NM] Received request to
ping targets for interface 1acfb388-0ad4-4b78-8cb2-39c04a6b888a.
00000848.00000ef8::2009/04/30-05:52:01.669 INFO [NM] Pinging targets for
interface 1acfb388-0ad4-4b78-8cb2-39c04a6b888a.
00000848.00000ef8::2009/04/30-05:52:01.669 INFO [NM] Pinging host 10.1.0.1
00000848.00000ef8::2009/04/30-05:52:01.669 INFO [NM] Ping of host 10.1.0.1
succeeded.
00000848.00000ef8::2009/04/30-05:52:01.669 INFO [NM] Finished pinging
targets for interface 1acfb388-0ad4-4b78-8cb2-39c04a6b888a.
00000848.00000ef8::2009/04/30-05:52:05.419 INFO [GUM] s_GumUpdateNode:
dispatching seq 225917 type 0 context 18
00000848.00000ef8::2009/04/30-05:52:05.419 INFO [FM]
FmpUpdateUseRandomizedNodeListForGroups: 3 node down has been processed
already...
00000848.00000ef8::2009/04/30-05:52:05.419 INFO [GUM] s_GumUpdateNode:
completed update seq 225917 type 0 context 18
00000848.00000ef8::2009/04/30-05:52:17.419 INFO [GUM] s_GumUpdateNode:
dispatching seq 225918 type 2 context 15
00000848.00000ef8::2009/04/30-05:52:17.419 INFO [NM] Received update to set
state for network c120d86b-291f-45b7-bed6-39eda87acc33.
00000848.00000ef8::2009/04/30-05:52:17.419 WARN [NM] Interface
0d770674-50a5-4dd6-9d40-ee7d643fe932 is unavailable (node: SRV0002C, network:
CLS00002_Public).
00000848.00000ef8::2009/04/30-05:52:17.419 INFO [GUM] s_GumUpdateNode:
completed update seq 225918 type 2 context 15
00000848.00000980::2009/04/30-05:55:00.190 INFO [Qfs] GetDiskFreeSpaceEx
Z:\MSCS\, status 0
00000848.0000084c::2009/04/30-05:59:19.448 INFO [CS] Received service
shutdown command
00000848.0000092c::2009/04/30-05:59:19.464 WARN [INIT] The cluster service
is shutting down.
00000848.0000092c::2009/04/30-05:59:19.464 INFO [EVT] EvShutdown
00000848.0000092c::2009/04/30-05:59:19.464 WARN [FM] Shutdown: Failover
Manager requested to shutdown groups.
00000848.0000092c::2009/04/30-05:59:19.464 INFO [FM] FmpCleanupGroups: Entry
00000848.00000408::2009/04/30-05:59:19.464 INFO [FM] FmpCleanupGroupsWorker:
Entry
00000848.00000408::2009/04/30-05:59:19.464 INFO [FM] FmpCleanupGroupsPhase1:
Entry, Group = 887f74f8-9777-4ace-97d9-dddfb6b15b4a
00000848.00000408::2009/04/30-05:59:19.464 INFO [FM]
FmpNotifyGroupStateChangeReason: Notifying group Cluster Group
[887f74f8-9777-4ace-97d9-dddfb6b15b4a] of state change reason 4...
00000848.00000408::2009/04/30-05:59:19.464 INFO [FM] FmpOfflineGroup,
Group=887f74f8-9777-4ace-97d9-dddfb6b15b4a
00000848.00000408::2009/04/30-05:59:19.464 INFO [FM] FmpOfflineResource:
Cluster Name depends on Cluster IP Address. Shut down first.
00000848.00000408::2009/04/30-05:59:19.464 INFO [FM] FmpRmOfflineResource:
InterlockedIncrement on gdwQuoBlockingResources for resource
866483a6-4c2d-49c7-b143-2d26d08b86b7
00000a08.00000a24::2009/04/30-05:59:19.464 INFO Network Name <Cluster Name>:
Taking resource offline...
00000a08.00000b40::2009/04/30-05:59:19.464 INFO Network Name <Cluster Name>:
Offline of resource continuing...
00000848.00000408::2009/04/30-05:59:19.464 INFO [FM] FmpRmOfflineResource:
RmOffline() for 866483a6-4c2d-49c7-b143-2d26d08b86b7 returned error 997
00000a08.00000c34::2009/04/30-05:59:19.464 INFO Network Name: time until
next DNS reg: 2009/04/30-15:07:48 (128855776686664899)
00000848.00000408::2009/04/30-05:59:19.464 INFO [GUM] GumSendUpdate: queuing
update type 0 context 8
00000a08.00000b40::2009/04/30-05:59:19.464 WARN Network Name <Cluster Name>:
Failed to delete server name CLS00002, status 5.
00000a08.00000b40::2009/04/30-05:59:19.464 WARN Network Name <Cluster Name>:
Failed to delete server name CLS00002, status 5.
00000a08.00000b40::2009/04/30-05:59:19.464 INFO Network Name <Cluster Name>:
Deleted workstation name CLS00002 from transport 0.
00000848.00000408::2009/04/30-05:59:19.464 INFO [GUM] GumSendUpdate:
Dispatching seq 225919 type 0 context 8 to node 2
00000848.00000408::2009/04/30-05:59:19.464 INFO [GUM] GumSendUpdate:
completed update seq 225919 type 0 context 8

What can be the cause of such a situation?
Where do you suggest to search for a cause?

Please advise.

.



Relevant Pages

  • Re: How to change the heartbeat rate or should I?
    ... Communication between Server Cluster nodes is critical for smooth cluster ... each cluster network must fail independently of all other ... traffic from the network adapter that is set to Internal Cluster ...
    (microsoft.public.windows.server.clustering)
  • Cluster x does not appear to have a dedicated heartbeat network connection. - Unterschiedliche A
    ... nachdem ich gerade in den letzten Zuegen der Migration auf Exchange 2007 CCR ... Cluster DEFRA-EX1MBX1 does not appear to have a dedicated heartbeat network ...
    (microsoft.public.de.exchange)
  • cluster completely unavailable
    ... The network peoples says that switches was ok, network was ok, dns and wins ... Only the cluster suffered from this situation. ... completed update seq 225906 type 2 context 15 ...
    (microsoft.public.windows.server.clustering)
  • Re: Windows 2008 bizarre cluster issue
    ... Then we went through all the network ... having run out of troubleshooting items for the server configuration ... First I would try to break any and all teaming, ... Cluster servers are IBM blades ...
    (microsoft.public.windows.server.clustering)
  • Re: implementing roles in OOP......
    ... >>differently based on some context. ... Then Message would only handle responsibilities that were ... I'm applying my own understanding of network messaging systems for lack ... It is a selection algorithm. ...
    (comp.object)