Re: Microsoft Network Load Balancing is Unreliable!!
- From: "George" <georgefilippakos@xxxxxxxxxxx>
- Date: Fri, 2 Mar 2007 19:11:32 +0200
Hi again,
Many many thanks for your help.
I paid for a support request from microsoft and they have helped me find
something that is different on the 3 servers.
On Server A under the Advanced NIC Driver Settings:
Receive Buffer Count = 128
On Server B & C under the Advanced NIC Driver Settings:
Receive Buffer Count = 40
I have changed the Buffer Count to 128 on Serevr B & C.
So far the cluster has not failed! It has been 24 hours so far! Fingers &
toes are crossed :)
What is the Buffer Count and how does it relate? Why should it cause the NLB
to fail?
Feature Request: It would be great if NLB periodically tried to reestablish
itself with the old cluster if communications fail. So after splitting into
2 seperate clusters, it tries to re-connect again periodically.
On another note:
My hosting company have been very helpful and provided logs and error
reports from the switches.
There are no errors or maxed out connections so the switch and ports look
fine.
They reported that they were *shocked* with the way MS NLB works, they said
something about sending lots of broadcast traffic and suggested that we
should use another switch just for the cluster connections.
Jeff Hughes suggested something about putting the cluster NICs on a Hub and
uplinking it to the switch.
I may look into this if we are still getting problems.
Thanks for all your help!!
"Rodney R. Fournier" <rod@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:373553B0-FD74-41E2-9E6F-0BA375821C20@xxxxxxxxxxxxxxxx
Ok, because you stated something different earlier or I misunderstood it.
Do you see anything different on the non-working server?
Cheers,
Rodney R. Fournier
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering Website
http://msmvps.com/clustering - Blog
http://www.clusterhelp.com - Cluster Training
ClusterHelp.com is a Microsoft Certified Gold Partner
"George" <georgefilippakos@xxxxxxxxxxx> wrote in message
news:es4ick$2hv$1@xxxxxxxxxxxxxxxxxx
All 3 NLB servers are DC's and are connected to the Internet. How else
would they be connected?
Yes, we have 3 DC's because we wanted all 3 NLB servers to have exactly
the same setup.
All 5 servers just run one website application - thats all.
"Rodney R. Fournier" <rod@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:97875483-0098-4965-97AE-FE91CB1A57D2@xxxxxxxxxxxxxxxx
3 NLB server but only 2 are DC's? You know that your DCs are directly
connected to the Internet using NLB in this way?
Cheers,
Rodney R. Fournier
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering Website
http://msmvps.com/clustering - Blog
http://www.clusterhelp.com - Cluster Training
ClusterHelp.com is a Microsoft Certified Gold Partner
"George" <georgefilippakos@xxxxxxxxxxx> wrote in message
news:es4gpa$1sa$1@xxxxxxxxxxxxxxxxxx
Dear Rodney
The WAN NIC comes before the CLUSTER NIC.
It had all been working well until now with this strange problem with
one server only.
Are you saying that you should not use NLB on Domain Controllers?
What is annoying is that we did not want a domain controller but we
wanted to use DFS-R in our webfarm - so we had to setup a domain
controller just to use DFS-R. Then you can not have just one DC - so we
had to have another DC as a backup.
Now, if we can't use a DC and NLB together then that means we need to
buy 2 additional servers just for Domain Controllers (one primary and
one backup) . That means to run our website we need:
3 NLB Applications Servers
1 Database server
1 Email/Ads Server
2 Domain Controllers
= a grand total of 7 Servers to run a website!
Please tell me this is not the case. I firmly belive that we can run
NLB on the DCs - Like I said it had all been working well up until now.
And even now, the problem is with only 1 server.
A net diag on one of the DC's shows:
Computer Name: AP02
DNS Host Name: AP02.MYDOMAIN.COM
System info : Microsoft Windows Server 2003 R2 (Build 3790)
Processor : x86 Family 15 Model 2 Stepping 9, GenuineIntel
List of installed hotfixes :
KB890046
KB893756
KB896358
KB896424
KB896428
KB898715
KB899587
KB899588
KB899589
KB899591
KB900725
KB901017
KB901214
KB902400
KB904706
KB905414
KB908519
KB908521
KB908531
KB910437
KB911164
KB911280
KB911562
KB911567
KB911897
KB911927
KB912919
KB914388
KB914389
KB917159
KB917344
KB917422
KB917537
KB917734
KB917953
KB918118
KB918439
KB918899
KB920213
KB920214
KB920670
KB920683
KB920685
KB921398
KB921883
KB922582
KB922616
KB922760
KB922819
KB923191
KB923414
KB923689
KB923694
KB923980
KB924191
KB924496
KB924667
KB925398_WMP64
KB925454
KB925486
KB926436
KB928090
KB928255
KB928843
KB929969
KB931836
Q147222
Netcard queries test . . . . . . . : Passed
Per interface results:
Adapter : WAN
Netcard queries test . . . : Passed
Host Name. . . . . . . . . : AP02
IP Address . . . . . . . . : X.Y.Z.151
Subnet Mask. . . . . . . . : 255.255.255.192
Default Gateway. . . . . . : X.Y.Z.129
Dns Servers. . . . . . . . : X.Y.Z.150
X.Y.Z.151
X.Y.Z.152
AutoConfiguration results. . . . . . : Passed
Default gateway test . . . : Passed
NetBT name test. . . . . . : Passed
[WARNING] At least one of the <00> 'WorkStation Service', <03>
'Messenge
r Service', <20> 'WINS' names is missing.
WINS service test. . . . . : Skipped
There are no WINS servers configured for this interface.
Adapter : CLUSTER
Netcard queries test . . . : Passed
Host Name. . . . . . . . . : AP02
IP Address . . . . . . . . : X.Y.Z.149
Subnet Mask. . . . . . . . : 255.255.255.192
Default Gateway. . . . . . :
NetBIOS over Tcpip . . . . : Disabled
Dns Servers. . . . . . . . :
AutoConfiguration results. . . . . . : Passed
Default gateway test . . . : Skipped
[WARNING] No gateways defined for this adapter.
NetBT name test. . . . . . : Skipped
NetBT is disabled on this interface. [Test skipped]
WINS service test. . . . . : Skipped
NetBT is disable on this interface. [Test skipped].
Global results:
Domain membership test . . . . . . : Passed
NetBT transports test. . . . . . . : Passed
List of NetBt transports currently configured:
NetBT_Tcpip_{SID_NUMBER_HIDDEN}
1 NetBt transport currently configured.
Autonet address test . . . . . . . : Passed
IP loopback ping test. . . . . . . : Passed
Default gateway test . . . . . . . : Passed
NetBT name test. . . . . . . . . . : Passed
[WARNING] You don't have a single interface with the <00>
'WorkStation Servi
ce', <03> 'Messenger Service', <20> 'WINS' names defined.
Winsock test . . . . . . . . . . . : Passed
DNS test . . . . . . . . . . . . . : Passed
PASS - All the DNS entries for DC are registered on DNS server
'X.Y.Z.150
' and other DCs also have some of the names registered.
PASS - All the DNS entries for DC are registered on DNS server
'X.Y.Z.151
' and other DCs also have some of the names registered.
PASS - All the DNS entries for DC are registered on DNS server
'X.Y.Z.152
' and other DCs also have some of the names registered.
Redir and Browser test . . . . . . : Passed
List of NetBt transports currently bound to the Redir
NetBT_Tcpip_{SID_NUMBER_HIDDEN}
The redir is bound to 1 NetBt transport.
List of NetBt transports currently bound to the browser
NetBT_Tcpip_{SID_NUMBER_HIDDEN}
The browser is bound to 1 NetBt transport.
DC discovery test. . . . . . . . . : Passed
DC list test . . . . . . . . . . . : Passed
Trust relationship test. . . . . . : Passed
Secure channel for domain 'MYDOMAIN' is to '\\AP01.MYDOMAIN.COM'.
Kerberos test. . . . . . . . . . . : Passed
LDAP test. . . . . . . . . . . . . : Passed
Bindings test. . . . . . . . . . . : Passed
WAN configuration test . . . . . . : Skipped
No active remote access connections.
Modem diagnostics test . . . . . . : Passed
IP Security test . . . . . . . . . : Skipped
Note: run "netsh ipsec dynamic show /?" for more detailed
information
The command completed successfully
Any ideas? The server seems to be working correctly.
If I run a net diag on the server that is having problems, the only
difference is that it skips the Trust relationship test
Trust relationship test. . . . . . : Skipped
Are you saying the you can't use NLB on a DC?
"Rodney R. Fournier" <rod@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:8122A811-10DC-4563-8128-2F9A6972B3CC@xxxxxxxxxxxxxxxx
Sounds like a recipe for disaster to me. What does your binding order
look like? NLB or regular NIC first? Which has the DG? Which has DNS?
What DNS does it use? Are you doing DNS forwarding?
I am not shocked that your NLBed DCs are having issues. I firmly
believe this is a very poor design.
Cheers,
Rodney R. Fournier
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering Website
http://msmvps.com/clustering - Blog
http://www.clusterhelp.com - Cluster Training
ClusterHelp.com is a Microsoft Certified Gold Partner
"George" <georgefilippakos@xxxxxxxxxxx> wrote in message
news:es4brc$vfa$1@xxxxxxxxxxxxxxxxxx
Yes, we are using NLB to create a Webfarm Cluster on Domain
Controllers
The 3 Domain Controllers all have 2 NICs
In DC1:
NIC-1 is the WAN NIC - it has an external IP address (X.Y.Z.101)
NIC-2 is the CLUSTER NIC - it has an external IP address (X.Y.Z.100)
In DC2:
NIC-1 is the WAN NIC - it has an external IP address (X.Y.Z.102)
NIC-2 is the CLUSTER NIC - it has an external IP address (X.Y.Z.100)
In DC3:
NIC-1 is the WAN NIC - it has an external IP address (X.Y.Z.103)
NIC-2 is the CLUSTER NIC - it has an external IP address (X.Y.Z.100)
I dont think there is anything wrong with the configuration as the
whole setup has been working without any problems for the last 3
months. It is only recently that the NLB has started to fail on one
server.
We have not changed any configurations other than install the latest
updates from Microsoft Windows Update. (we keep all servers upto date
with patches etc.)
"Rodney R. Fournier" <rod@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in
message news:4B38A2FF-B0D1-4596-B8FD-0D86241604B5@xxxxxxxxxxxxxxxx
You are using NLB on a DC?
--
Cheers,
Rodney R. Fournier
MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering Website
http://msmvps.com/clustering - Blog
http://www.clusterhelp.com - Cluster Training
ClusterHelp.com is a Microsoft Certified Gold Partner
"George" <georgefilippakos@xxxxxxxxxxx> wrote in message
news:es4abl$usp$1@xxxxxxxxxxxxxxxxxx
Dear Jeff,
Many many thanks for your response.
1) Unfortunately, I am unable to use 1 NIC in the Clustered servers
as you suggest because each server in the cluster is a domain
controller and each server in the cluster must also communicate
with a Database server and an email server. There are 5 servers in
total running my website application. 3 of these servers are
clustered using NLB and run an ASP.NET application. Please see:
Single network interface card limitations with WLBS:
http://support.microsoft.com/kb/197999/EN-US/
2) I can assume that the switch is working correctly because Sever
B and Server C (the other 2 servers in the Webfarm cluster) have
never had any communication problems with each.
We also thought it that it might be a hardware problem, so I asked
my hosting provider to change the network cables, switch ports and
network cards. Doing this did not fix the problem so we then
replaced the entire server. But this has still not fixed the
problem.
All 3 servers are exactly the same and have the same NIC hardware
and drivers.
3) I will start to work through the document you have provided to
try and find if there is a communication problem with Server A.
4) The reason why I say that NLB is unreliable is because if
communication problems occur and heartbeats are dropped (like you
suggest) then the offending server should be taken out of the
cluster and no more traffic sent to that server.
NLB should really do this: Detect that heartbeats have been lost in
Server A and suspend Server A in the cluster.
Instead, Server A, after loosing communications, sets up a whole
new conflicting cluster. It then Makes itself the 'default host'
and the only host in that conflicting cluster and then decides to
reject all HTTP traffic. So a third of our customers think the
website no longer exists. DNS: Error Server Not Found!
Here is our current solution:
a.. We wait till we notice a 33% drop in our website traffic (i.e.
a third of our customers can no longer access our website)
a.. We connect to Server A and load up NLB manager.
a.. We then add Host B or Host C to the Cluster; it converges and
starts working again.
a.. Once it is back up and running, it can run without problems
for maybe a few hours, sometimes a few days and then Server A drops
out of the cluster all over again.
Surely NLB should detect communication problems and suspend a
Server rather than setup a whole new conflicting cluster. NLB is
unreliable because it should gracefully fail or recover from a
communications error. Instead it creates a conflicting setup.
(Splits the webfarm into 2 conflicting clusters)
Here is that stage of events again:
a.. Server A reports that Servers B and C are 'leaving the
cluster'
b.. At the same time Server B & C report that Server A is
'converging for an unknown reason'
c.. Server A has now lost track of Servers B & C and makes itself
the 'default host' and reports that it is the only host in the
Cluster.
d.. Server B and Server C both start to converge - they continue
to converge forever. They also report that all 3 Servers (A, B & C)
are still in the Cluster. They report that Server A has converged
and B & C are converging.
e.. Server A can still communicate with servers B & C - in fact
none of the servers have lost communication at all.
f.. NLB has now split itself into 2 separate Clusters, with both
Clusters sharing the same IP address!
g.. NLB continues to send HTTP requests/traffic to all 3 servers,
yet Server A refuses to respond to the requests and just times out,
even though there are no communication problems. (It is the NLB
software that is preventing communication, not the hardware as we
can still communicate from the outside world to the server using a
ping and from the inside world to the other servers.)
"Jeff Hughes [Microsoft]" <jeffhugh@xxxxxxxxxxxxxxxxxxxx> wrote in
message news:82C33BD0-148F-44C2-8DFB-43F174B229C0@xxxxxxxxxxxxxxxx
NLB is actually quite reliable. I know I work for the the company
so I "have" to say that ;) but most if the issues I've seen with
NLB have to do with one of only a handful of areas:
NIC teaming
Old NIC drivers
Servers attached directly to layer 3 switch
Bad hardware at the network layer
Since you mentioned trying other ports, you've ruled out a bad
port so I'd start with the first 3 items as something to look at.
Also, you don't need a NIC for the cluster, just the WAN. I'd
remove that too to get rid of all the variables here. The reason
they are reconverging like you describe is that they aren't
reliably communicating with each other so I'd approach this just
like troubleshooting a point to point network issue.
There's some more troubleshooting steps in this KB
812870 Network Load Balancing cluster node does not successfully
converge
http://support.microsoft.com/default.aspx?scid=kb;EN-US;812870
--
Jeff Hughes, MCSE
Support Escalation Engineer
Microsoft Enterprise Platforms Support (Server Core/Cluster)
"George" <georgefilippakos@xxxxxxxxxxx> wrote in message
news:es3mnn$mj5$1@xxxxxxxxxxxxxxxxxx
Can anybody please help me with this?
We are using Microsoft Network Load Balancing to load balance 3
web servers running 1 website application. It is very unreliable
and is now causing us to loose business.
I have 3 Servers, A B & C, each server is running Windows 2003
Release 2.
Each server has 2 NICs one for the CLUSTER and one for WAN.
The 3 servers have been configured to work together as a Webfarm
Cluster.
The servers all run perfectly together, sometimes for days and
then, for no apparent reason:
1) Server A will re-converge
2) Server A reports that Server B & C are leaving the
cluster
3) Server A then stops responding to all external HTTP
traffic (yet the server can be pinged)
When this happens, the log files on both Servers B & C report
that Server A is 'converging for an unknown reason'.
Servers B & C then start converging themselves - they never stop
converging.
If I open Network Load Balancing Manager on Server A, it shows
Server A is the only server in the cluster.
If I open Network Load Balancing Manager on Server B or Server C,
it shows all 3 Servers in the cluster. Server A has converged,
yet servers B & C are constantly converging.
Website visitors are being directed to all 3 servers, yet any
visitor sent to Server A gets a DNS error "host not found" even
though the website is running perfectly on this server and the
server can be pinged!!
To fix the problem, I have to deleted the Webfarm Cluster on
Server A. Then connect to the Webfarm Cluster on Server B or
Server C and it works again for a day or two.
Then for an unknown reason, it stops working again.
The weird thing is that server B & C work perfectly, it is only
Server A that has the problem. We have swapped out server A for a
new server, we have checked the cables and switches, changed the
connections on the switch etc but we still get this problem.
Why does Microsoft Network Load Balancing do this? Surely if
there is an error on one server, it should not split into two
separate clusters and reject 1/3 of HTTP traffic!
NLB is supposed to increase reliability using fail over, yet it
causes unreliability problems itself by splitting a cluster into
2 and then refusing traffic and causing a DNS server not found
error!
How can I automatically get Server A to rejoin the Cluster after
it "converges for an unknown reason"
Could I setup some sort of monitoring so that if the cluster
splits, I could automatically force it to rejoin?
Also what other software or hardware solutions are available that
are more reliable than Microsoft NLB?
Can anyone please help me with this?
.
- Follow-Ups:
- Re: Microsoft Network Load Balancing is Unreliable!!
- From: Rodney R. Fournier [MVP]
- Re: Microsoft Network Load Balancing is Unreliable!!
- Prev by Date: Re: Physical disk hangs at "offline pending"
- Next by Date: Clustered MS-SQL 2000 Startup Parameters
- Previous by thread: Re: Microsoft Network Load Balancing is Unreliable!!
- Next by thread: Re: Microsoft Network Load Balancing is Unreliable!!
- Index(es):
Relevant Pages
|