Re: Windows 2008 bizarre cluster issue
- From: "laurent" <laurentV@xxxxxxxxxx>
- Date: Tue, 3 Mar 2009 01:18:33 +0100
Hello,
I read the discussion and i'm really interesting in your solution.
I have the same trouble on my cluster :Windows 2008 SCC connected to a IBM DS 5100.
After a few days on the active node we are unable to browse using explorer, can't use powershell command but exchange flow is working correctly.
Can you give me more details about the settings MS told you to change ?
You said " MS then had us disable DMA, RSS, and TCP offloading on both the OS and NICs"
On the nics i see how to change the settings but in Os ?
Thanks a lot
Laurent
"Dale Kiefer" <DaleKiefer@xxxxxxxxxxxxxxxxxxxxxxxxx> a écrit dans le message de groupe de discussion : 5A03281D-DD78-41BE-81C4-EB1A80BF159C@xxxxxxxxxxxxxxxx
Mike and previously John speak the truth.
We had this issue and went through a lengthy troubleshooting process with
both Microsoft and our SAN vendor. Based on our cluster logs, MS stated that
we had an issue with persistent reservations and the problem was with our
storage. After much troubleshooting with IBM (MPIO updates, controller
firmware upgrades, continuous forwarding of logs), IBM stated that the
controller was working as expected. MS then had us disable DMA, RSS, and TCP
offloading on both the OS and NICs. This was the key for us as we didn't
have all the settings disabled on our NICs.
We did not change any flow control settings.
This was a very frustrating experience for us. We disabled these 3 settings
as per MS because they've seen "all kinds of networking/intra cluster
communication issues" with them. It would have been nice to have been told
this before they passed us off and we spent all our time with our SAN vendor.
Hopefully these postings will help save others a lot of time.
"Mike Gentile" wrote:
Hey guys,
I can understand your pain and frustration on this item as we were
experiencing this issue. We tried absolutely everything including phone
calls to microsoft and everyone always wants you to either upgrade the
drivers or the firmware to solve your problem. What seemed to stabilize our
environment was FLOW CONTROL. Here is how our environment is set up and what
you need to do:
We have 2 nodes in the cluster. Each node is an IBM Blade and they are
located in independent chassis'. Each node has 6 nics, 5 of which we are
using. We are teaming 2 on the public side, we use 2 nics on the iSCSI side
(SAN) utilizing MPIO and 1 nic is on the private net for a heartbeat. We use
Equalogics for our SAN Storage. Inside of the IBM Chassis we are using Cisco
Switches and our Core Switch Stack is composed of 3750's.
The way we came to our resolution, (although we did experience an issue with
our cluster yesterday but it doesn't look like it is related) is on our
Chassis we noticed that there were informative logs talking about duplicate
routes existing. So we decided to call our SAN vendor and they immediately
told us that since we didn't have Flow Control enabled on the iSCSI nics as
well as the switchports it was causing packet drops. THis would explain the
disconnects to the Quorum drive that was located on the SAN.
To solve this issue you need to make sure of the following:
-enable flow control on the virtual ports on the internal Chassis Switches
that your iSCSI nics are connected to. (you can do this for the entire iSCSI
Vlan)
-Enable Flow Control on the ports of your core switch that your Blade
Switches plug into.
-Within your iSCSI nic card settings make sure that Flow control is set for
RX & TX Enabled, Checksum Offload is set to none, and Large Send Offload is
set to Disable.
-Depending on your SAN will depend on how you set it up for Flow Control.
Since we are using Equalogics with the latest firmware they automatically
adjust to the Network Settings of the Switch Port.
-We use the iSCSI initiator. We made sure that we had 2 target portals
setup underneath the Discovery Tab. 1 for each iSCSI nic card going to the
group address of our SAN. Then for each of the Targets listed within the
Targets tab we made 2 connections for each.
This is what helped us and I would like to thank Jill Mansfield of Dell for
providing us with this fix. She saved us. Hopefully this will save you. As
an FYI you do not need to enable FLow control on your public facing nics or
virtual switchports.
Merry Christmas
.
- Prev by Date: RE: Can VMWare server allow setting up Windows 2003 cluster?
- Next by Date: RE: Win2K3 - Active-Active Configuration Doubts
- Previous by thread: RE: Can VMWare server allow setting up Windows 2003 cluster?
- Next by thread: MS DTC service is failing to start Event ID: 4383
- Index(es):
Relevant Pages
|