Re: Cluster will not fail over.



You certainly need to look into their findings.

1. Could not recover the master database! The log startup parameter has a
missing colon ":" in the registry settings that locates the master
transaction log file.

2. TCP Bind error. This usually occurs because the IP was not reset on
startup; however, in this case, it may just be an artifact of the
non-recovery of the master database.

3. Geoff is right about the SP level: it is OS not the SQL Server that they
were referring to. However, you do show that you are only running SS SP3.
If you intend to stay on SP3 (for SQL Server), you should at least apply the
latest publicly available build, 878. Just ignore that the KB was directed
at the SQL Server supporting BizTalk Server.


You must install the SQL Server 2000 update that KB article 831950 describes
to run BizTalk Server 2004
http://support.microsoft.com/?kbid=838166


Also since I last responded, it looks like Microsoft has posted a Cumulative
post-SP4 hotfix that is publicly available:


A cumulative hotfix package is available for SQL Server 2000 Service Pack 4
build 2162
http://support.microsoft.com/?Kbid=904660


Sincerely,


Anthony Thomas


--

"Geoff N. Hiten" <SQLCraftsman@xxxxxxxxx> wrote in message
news:%234NP$hwCGHA.516@xxxxxxxxxxxxxxxxxxxxxxx
> The SP4 information refers to the OS, not SQL.
>
> The differences appear to be related to different patches on the OS and to
a
> posible mismatch in MDAC binaries. I would download MDAC 2.8 and install
it
> on each node. MDAC installs are not cluster aware.
>
> Procedures actually help with growth. They are a way of sharing
> institutional knowledge without long, mistake-filled training periods.
> Start with the biggest pain points and work outward from there. They
don't
> have to be complex or involved. They should help guide a professional
> towards the correct solution for a given situation. The entire concept
> needs high level approval and buy-in. Believe me, a few more episodes
like
> this and top management will buy into the idea.
>
>
> --
> Geoff N. Hiten
> Senior Database Administrator
> Microsoft SQL Server MVP
>
>
>
> "Admiral" <admiral@xxxxxxxxxxxxxxxxxxx> wrote in message
> news:%23sVLqVwCGHA.3920@xxxxxxxxxxxxxxxxxxxxxxx
> > Below is the response from Microsoft as promised. I haven't had a
chance
> > to look into their findings, but one thing is for certain, I did not
> > install SP4 for SQL, which they seem to think I have. I will be
> > scheduling to apply these fixes for this upcoming Friday night and let
you
> > guys know how things turn out.
> >
> > Jeff, the current growth of our company has been extremely fast. We are
> > definitely changing from a (mom-n-pop) setup to corporate. I will be
the
> > first to tell you that I am no-seasoned SQL veteran, but have been doing
> > my best to keep things afloat. I truly thank you all for your
responses.
> >
> >
> >
> > ANALYSIS OF MPS REPORTS
> >
> > =========================
> >
> > Node 1
> >
> > ======
> >
> > Node Name --------- EMPOWER01-SAN
> >
> > Node OS Version --- 2195 Windows 2000 Advanced Server
> >
> > Node Service Pack - Service Pack 4
> >
> >
> >
> > Node 2
> >
> > ======
> >
> > Node Name --------- EMPOWER02-SAN
> >
> > Node OS Version --- 2195 Windows 2000 Advanced Server
> >
> > Node Service Pack - Service Pack 4
> >
> >
> >
> > Default instance
> >
> > SQL Network Name(EMPOWER01)
> >
> >
> >
> > -- PAE enabled.
> >
> >
> >
> > NODE 1
> >
> > =======
> >
> >
> >
> > Network Name Network Priority
> >
> > ============ ================
> >
> > Encore LAN GIG 1
> >
> > Heartbeat 2
> >
> >
> >
> > 1. ACTION -- REVERSE THE NETWORK PRIORITY. First Heartbeat followed by
> > Encore.
> >
> >
> >
> >
============================================================================
==============
> >
> > Network Interfaces Binding
Order -
> > NO CHANGE REQUIRED.
> >
> >
============================================================================
==============
> >
> >
> >
> > 1. Encore LAN GIG
> >
> > 2. Heartbeat
> >
> > 3. NdisWanIp
> >
> >
> >
> >
> >
> >
> >
> > NODE 2
> >
> > =======
> >
> >
> >
> >
> >
> > Network Name Network Priority
> >
> > ============ ================
> >
> > Encore LAN GIG 1
> >
> > Heartbeat 2
> >
> >
> >
> >
> >
> >
============================================================================
==============
> >
> > Network Interfaces Binding
> > Order -- NO CHANGE REQUIRED.
> >
> >
============================================================================
==============
> >
> >
> >
> > 1. Encore LAN GIG
> >
> > 2. Heartbeat
> >
> > 3. NdisWanIp
> >
> >
> >
> >
> >
> > -- There are some hotfixes difference between the nodes. - We recommend
> > same hotfixes on both the nodes. There might be unexpected outages if we
> > have these differences.
> >
> >
> >
> > KB32911 -- missing from EMPOWER01-SAN_SQL
> >
> > KB82283 -- missing from EMPOWER02-SAN_SQL
> >
> >
> >
> > -- Found some files missing or version mismatches. - We recommend same
> > version of system files on both the nodes. There might be unexpected
> > outages if we have these differences.
> >
> >
> >
> > C:\WINNT\SYSTEM32\DLLCACHE\OLEDB32R.DLL
> >
> > has different versions on both the nodes
> >
> > On node 1.. file version is (2.70:9001.0)
> >
> > On node 2.. file version is (2.60:6526.0)
> >
> >
> >
> > C:\WINNT\SYSTEM32\DRIVERS\B57W2K.SYS
> >
> > On node 1.. file version is (7.43:0.0)
> >
> > On node 2.. file version is (6.34:4.0)
> >
> >
> >
> > C:\WINNT\SYSTEM32\MACROMED\FLASH\GETFLASH.EXE
> >
> > On node 1.. FILE NOT FOUND.
> >
> > On node 2.. file version is (7.0:19.0)
> >
> >
> >
> > C:\WINNT\SYSTEM32\NEWDEV.DLL
> >
> > On node 1.. file version is(5.0:2195.6754)
> >
> > On node 2.. file version is (5.0:2195.6666)
> >
> >
> >
> > C:\WINNT\SYSTEM32\SEAREACH4.DLL
> >
> > On node 1.. file version is (4.15:1022.0)
> >
> > On node 2.. FILE NOT FOUND.
> >
> >
> >
> > C:\WINNT\SYSTEM32\SKYLON.DLL
> >
> > On node 1.. file version is
> >
> > On node 2.. FILE NOT FOUND.
> >
> >
> >
> > C:\WINNT\SYSTEM32\SKYLON2.DLL
> >
> > On node 1.. file version is (2.11:1020.0)
> >
> > On node 2.. FILE NOT FOUND.
> >
> >
> >
> > C:\WINNT\SYSTEM32\WRITERPDF.DLL
> >
> > On node 1.. file version is (1.20:1047.0)
> >
> > On node 2.. FILE NOT FOUND.
> >
> >
> >
> > From SQL server logs
> >
> > log 6, 5,4,3,2,1
> >
> > -------------------
> >
> > 2005-12-19 01:19:43.26 server Failed to obtain
> > TransactionDispenserInterface: Result Code = 0x8004d01b
> >
> > 2005-12-19 01:19:43.26 spid3 Starting up database 'master'.
> >
> > 2005-12-19 01:19:43.28 spid3 Error: 9003, Severity: 20, State: 1.
> >
> > 2005-12-19 01:19:43.28 spid3 Cannot recover the master database.
> > Exiting.
> >
> >
> >
> > Current SQL logs
> >
> > -----------------------
> >
> > 2005-12-19 01:22:34.53 server Microsoft SQL Server 2000 - 8.00..760
> > (Intel X86)
> >
> > Enterprise Edition on Windows NT 5.0 (Build 2195: Service
Pack
> > 4)
> >
> > 2005-12-19 01:22:34.54 server Logging SQL Server messages in file
> > 'o:\logs\ERRORLOG'.
> >
> > 2005-12-19 01:22:34.54 server SQL Server is starting at priority
class
> > 'normal'(8 CPUs detected).
> >
> > 2005-12-19 01:22:34.98 server Using 'SSNETLIB.DLL' version
'8.0.766'..
> >
> > 2005-12-19 01:22:34.98 spid3 Server name is 'EMPOWER01'.
> >
> > 2005-12-19 01:22:35.03 server SuperSocket Info: Bind failed on TCP
port
> > 1433. --SQL is not listening on TCP --
> >
> > 2005-12-19 01:22:35.54 server SQL server listening on Shared Memory,
> > Named Pipes.
> >
> >
> >
> > SP4 setup
> >
> > 19:59:25 Installation Succeeded
> >
> >
> >
> > From Application and system logs.
> >
> >
> >
> > Application 12/19/2005 1:16 error 17052 MSSQLSERVER
> > "Failover" EMPOWER02-SAN [sqsrvres] OnlineThread:
> > Error 435 bringing resource online.\r\n\r\n
> >
> > Application 12/19/2005 1:16 error 17052 MSSQLSERVER
> > "Failover" EMPOWER02-SAN [sqsrvres] OnlineThread:
> > ResUtilsStartResourceService failed (status 435)\r\n\r\n
> >
> > Application 12/19/2005 1:16 error 17052 MSSQLSERVER
> > "Failover" EMPOWER02-SAN [sqsrvres]
> > StartResourceService: Failed to start MSSQLSERVER service.
CurrentState:
> > 1\r\n\r\n
> >
> > Application 12/19/2005 1:16 error 17055 MSSQLSERVER
> > "Server" EMPOWER01 18052 :\r\nError:
> > 5105, Severity: 16, State: 4.\r\n\r\n
> >
> > Application 12/19/2005 1:16 error 17055 MSSQLSERVER
> > "Server" EMPOWER01 17204 :\r\nFCB::Open
> > failed: Could not open device o\logs\mastlog.ldf for virtual device
number
> > (VDN) 2.\r\n\r\n
> >
> > Application 12/19/2005 1:16 error 17055 MSSQLSERVER
> > "Server" EMPOWER01 17207 :\r\nudopen:
> > Operating system error 3(The system cannot find the path specified.)
> > during the creation/opening of physical device
o\logs\mastlog.ldf.\r\n\r\n
> >
> >
> >
> > From node 2
> >
> > ==========
> >
> > HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\MSSQLServer\Parameters
> >
> > SQLArg0 REG_SZ -dn:\data\master.mdf
> >
> > SQLArg1 REG_SZ -eo:\logs\ERRORLOG
> >
> > SQLArg2 REG_SZ -lo\logs\mastlog.ldf
> >
> >
> >
> > From node 1
> >
> > ==========
> >
> >
> >
> > HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\MSSQLServer\Parameters
> >
> > SQLArg0 REG_SZ -dn:\data\master.mdf
> >
> > SQLArg1 REG_SZ -eo:\logs\ERRORLOG
> >
> > SQLArg2 REG_SZ -lo:\logs\mastlog.ldf
> >
> >
> >
> > RESOLUTION:- Add the colon in the path on node 2.
> >
> > regkey :-
> > HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\MSSQLServer\Parameters
> >
> > Value of SQLArg2 should be
> >
> > -lo:\logs\mastlog.ldf
> >
> >
> >
> >
> >
> > "Geoff N. Hiten" <SQLCraftsman@xxxxxxxxx> wrote in message
> > news:eTylq4vCGHA.1088@xxxxxxxxxxxxxxxxxxxxxxx
> >> Comments Inline
> >>
> >> "Admiral" <admiral@xxxxxxxxxxxxxxxxxxx> wrote in message
> >> news:%23C%23g6mvCGHA.1088@xxxxxxxxxxxxxxxxxxxxxxx
> >>> Anthony,
> >>>
> >>> I thank you for your response and have gone through the majority of
the
> >>> steps you've provided. I also almost applied SP4, but with the type
of
> >>> front end application we have that could have been disasterous if not
> >>> tested properly. Considering we are at month end, the powers above
have
> >>> blocked me from attempting any further fixes.
> >>
> >> Definitely the right idea. Service Packs generally help with
stability,
> >> but sometimes they break applications. Testing is important.
> >>
> >>>
> >>> I do believe I found the TCP/IP issue/fix, but to be safe it was time
to
> >>> call in Microsoft. I gathered all the information I had and have sent
> >>> it their way. I will definitely keep you all posted on what was found
> >>> and solution. I do agree and feel that this clustered environment
needs
> >>> a complete overhaul. It has had way too many cooks in the kitchen.
For
> >>> the time being, I am covering every avenue by keeping a warm standby

> >>> along with increasing my incrimentals throughout the day.
> >>>
> >> Your experience illustrates exactly where most organizations fall down
> >> when it comes to High Availability. They think a cluster does all the
> >> work and they can ignore the People and Process elements. I have found
> >> that strict procedures that place specific responsibilities on specific
> >> individuals are key to keeping user-generated problems from killing a
> >> system. Proper training and selection of these individuals is also
> >> important. Database operations isn't magic, we need to quit treating
it
> >> like it is. Develop procedures that support your operational goals.
> >> Make sure the people are familiar with these procedures and are
competent
> >> to implement them. Without that, the best technology in the world is
no
> >> more reliable than the cheapest piece of junk.
> >>
> >> --
> >> Geoff N. Hiten
> >> Senior Database Administrator
> >> Microsoft SQL Server MVP
> >>
> >>
> >>
> >>> Thanks Again and Happy Holidays!
> >>>
> >>>
> >>> "Anthony Thomas" <ALThomas@xxxxxxxxx> wrote in message
> >>> news:egJ0K5JCGHA.3812@xxxxxxxxxxxxxxxxxxxxxxx
> >>>> Yeah, that's where I was going, you can't even detach "accidentally,"
> >>>> or
> >>>> otherwise, the model database unless you are in single user mode or
> >>>> you've
> >>>> started the instance with the 3608 trace flag (by the way, just for
> >>>> grins,
> >>>> there is also 3607 and 3609, which behave the same way, but recover
> >>>> more or
> >>>> fewer databases).
> >>>>
> >>>> As far as the TCP/IP issue goes, you had to rebuild the cluster and
> >>>> were not
> >>>> able to restore the master database. That tells me that you are NOT
at
> >>>> the
> >>>> same service pack level as the master backup you were trying to use.
> >>>>
> >>>> Since you went ahead without the master database restored, go ahead
and
> >>>> try
> >>>> to reapply the SP3a service pack, or, better yet, apply SP4. If you
> >>>> are
> >>>> using AWE, you will need to apply the post-SP4, 2040 AWE hotfix as
> >>>> well.
> >>>> Also, SP4 breaks the SQLDiag (and it sounds like you will need this);
> >>>> so,
> >>>> you'll want to contact Microsoft Customer Support Services (yes,
> >>>> they've
> >>>> changed their name from MS PSS to MS CSS) to get the private 2148
> >>>> SQLDiag
> >>>> hotfix.
> >>>>
> >>>> Since you are technically on a brand new cluster (with a lot of old
> >>>> cluster
> >>>> settings laying around), I'd approach this as a fresh build and go
> >>>> through
> >>>> the entire installation configuration just as if nothing was running
on
> >>>> it.
> >>>>
> >>>> Geoff mentioned a few things, like comclust for MSDTC and validating
> >>>> the
> >>>> shared clustered disk configuration, but there are several other
items
> >>>> after
> >>>> a cluster installation you'll have to revisit. The new installation
> >>>> means
> >>>> that all of the registry settings have reverted back to a fresh
> >>>> installation. The fact you are on a new master database means a lot
of
> >>>> the
> >>>> sp_configure settings have reverted back to a fresh installation.
> >>>>
> >>>> You need to check two things, the version of the engine
(sqlservr.exe)
> >>>> and
> >>>> the version of the master, model, and msdb databases. You can
execute
> >>>> SELECT @@VERSION to get the prior and sp_helpdb master to get the
> >>>> later.
> >>>>
> >>>> Good luck. Keep us up to date with your progress.
> >>>>
> >>>> Sincerely,
> >>>>
> >>>>
> >>>> Anthony Thomas
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> "john clarke" <jclarke@xxxxxxxxxx> wrote in message
> >>>> news:u0PmezkBGHA.344@xxxxxxxxxxxxxxxxxxxxxxx
> >>>>> ouch...
> >>>>>
> >>>>> this situation happened a while ago for us. one of our techs
dettached
> >>>> both
> >>>>> the model AND msdb database (whilst in single user mode) and then
sql
> >>>> server
> >>>>> was shut down. Well it would not start up.
> >>>>>
> >>>>> I managed to execute the following command on the server itself:
> >>>>>
> >>>>> sqlservr -c -f -s <instancename> /T3608
> >>>>>
> >>>>> This worked bringing up the sql server in minimal mode. Thereupon i
> >>>>> used
> >>>>> query analyser to connect and reattached both the msdb and model
> >>>> databases.
> >>>>>
> >>>>> All was back to normal
> >>>>>
> >>>>> Best of luck!
> >>>>>
> >>>>> john
> >>>>>
> >>>>> "Admiral" <admiral@xxxxxxxxxxxxxxxxxxx> wrote in message
> >>>>> news:uXPvMDkBGHA.628@xxxxxxxxxxxxxxxxxxxxxxx
> >>>>> > Geoff,
> >>>>> >
> >>>>> > I truly appreciate your response. I actually did try to reattach
> >>>>> > the
> >>>>> > model using this method, unfortunately everytime I would attemp to
> >>>> attach
> >>>>> > the model db, SQL Server would immediately turn off. At the time
I
> >>>>> > was
> >>>>> > really pressed for time, which led to the decision for a
re-install.
> >>>>> > I
> >>>> do
> >>>>> > believe there is more to the story that was not told to me.
> >>>>> >
> >>>>> > Also, I do apologize if I led you to believe that the cluster is
on
> >>>>> > a
> >>>>> > Windows 2003 platform. It actually resides on a Windows 2000
> >>>>> > Advanced
> >>>>> > Server. The link you provided me, should still work for W2k?
Like
> >>>>> > I
> >>>>> > mentioned when it rains it pours and I've been putting out too
many
> >>>> fires
> >>>>> > for a Christmas week. I can't thank you enough for the response.
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > "Geoff N. Hiten" <SQLCraftsman@xxxxxxxxx> wrote in message
> >>>>> > news:OFkKlhbBGHA.1088@xxxxxxxxxxxxxxxxxxxxxxx
> >>>>> > First, what you should have done with a blown model DB:
> >>>>> >
> >>>>> > Start SQL Server in single user mode with trace flag -T3608. This
> >>>>> > stops
> >>>>> > SQL from recovering anything except the master database. Reattach
> >>>> Model.
> >>>>> > If necessary, use files copied from another installation at the
> >>>>> > exact
> >>>> same
> >>>>> > SP and Hotfix level. Stop SQL Server and restart normally.
Sorry,
> >>>>> > but
> >>>> it
> >>>>> > really is that simple. Oh, and lock whoever detached "model" out
of
> >>>>> > the
> >>>>> > system. HE is too dangerous to allow near your system.
> >>>>> >
> >>>>> > You didn't mention whether you blew the cluster away or not or
just
> >>>>> > rebuilt SQL. If you blew the cluster away, make sure that each
disk
> >>>>> > resource has the same drive letter on all nodes and the disk
> >>>>> > resources
> >>>>> > fail over correctly from node to node. Stop the resource group,
> >>>>> > move
> >>>> it,
> >>>>> > and start each disk resource independently to test.
> >>>>> >
> >>>>> > The Named Pipes only issue sounds like an incomplete SP3a install.
> >>>>> > Windows 2003 will prevent TCP/IP access if it detects a pre-SP3a
SQL
> >>>>> > installation. Follow this article and re-apply SP3a.
> >>>>> >
> >>>>> > http://support.microsoft.com/default.aspx?scid=kb;en-us;815431
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > Geoff N. Hiten
> >>>>> > Senior Database Administrator
> >>>>> > Microsoft SQL Server MVP
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > "Admiral" <admiral@xxxxxxxxxxxxxxxxxxx> wrote in message
> >>>>> > news:uoXFWWbBGHA.216@xxxxxxxxxxxxxxxxxxxxxxx
> >>>>> > We had an error over the weekend of mass porportions(Sunday 3pm
> >>>>> > PST).
> >>>>> > Long story short; the model database was detached and the SQL
Server
> >>>>> > was
> >>>>> > stopped, with it still detached. This happened to happen on our
> >>>>> > primary
> >>>>> > Production Database Clustered Server which is the bread-n-butter
of
> >>>>> > the
> >>>>> > compay. (OUCH!)
> >>>>> >
> >>>>> > It was time for some fast actions. We started the re-install SQL
> >>>> Server.
> >>>>> > In order to do so, the previous install had to be uninstalled.
This
> >>>>> > seemed to go smoothly enough, but when re-applying the SP3a, we
> >>>>> > encountered an error. After researching the error, apparently in
a
> >>>>> > clustered environment this will occur since the SP3a files still
> >>>>> > reside
> >>>> on
> >>>>> > the node(s). Microsoft states that if within a particular log file
> >>>>> > it
> >>>>> > results with an 'Installation was Successful', to disregard the
> >>>>> > error.
> >>>> I
> >>>>> > double checked the log file and sure enough the error was
> >>>>> > disregarded.
> >>>>> >
> >>>>> > We moved along with the installation. We were able to restore
all
> >>>>> > the
> >>>>> > user databases and all system databases with the exception of the
> >>>>> > master
> >>>>> > database. Unfortunately, even with starting SQL Server in
> >>>>> > single-user
> >>>>> > mode, the restore of the master database would not take. So it
was
> >>>>> > not
> >>>>> > restored, but all other databases were. Fortunately, I ran a
quick
> >>>> script
> >>>>> > to recover all the user logins previous to the disaster, which I
> >>>> reapplied
> >>>>> > to the new installation of SQL Server. Everything came back up
and
> >>>>> > the
> >>>> QA
> >>>>> > Team successfully tested the production Application (Monday 4am
> >>>>> > PST).
> >>>>> > (Fhweeh)
> >>>>> >
> >>>>> > After the succesful testing of the production environment, we
> >>>>> > tested
> >>>> the
> >>>>> > fail-over which resulted in SQL Server not starting on the
secondary
> >>>> node.
> >>>>> > All the resources came right up on it, but not SQL Server. The
only
> >>>> error
> >>>>> > that was that it was not able to locate the file on
> >>>> 'O\logs\mastlog.ldf'.
> >>>>> > This error did not make sense since SQL Server uses the same file
> >>>>> > for
> >>>> the
> >>>>> > primary node. We were pressed for time since it was closing to
> >>>>> > start of
> >>>>> > business East Coast time, so we left the server as is.
> >>>>> >
> >>>>> > Throughout the day there were other issues that arose, one in
> >>>> particular
> >>>>> > was certain systems were not able to connect to the server via
> >>>>> > TCP/IP.
> >>>> In
> >>>>> > order to have them connect they needed to create an alias of the
> >>>>> > server
> >>>>> > and use Name Pipes. This seems to be a rising concern because
there
> >>>>> > are
> >>>>> > users who need to connect via ODBC to a widely used particular
> >>>>> > Access
> >>>>> > Application, which seems to only like the TCP/IP route. I am
> >>>>> > somewhat
> >>>>> > sure this is related to the cluster failure.
> >>>>> >
> >>>>> > Anyway, this is the first time I've had to take a breathe to
> >>>>> > revisit
> >>>> the
> >>>>> > problem at hand. We have been dealing with another server that
> >>>>> > crashed
> >>>> on
> >>>>> > the same day, resulting in a brand new build of a SQL Server
Cluster
> >>>>> > environment (completely non related to the issue at hand).
> >>>>> >
> >>>>> > I'm sorry for the long winded story. Would you have any idea as
to
> >>>>> > why
> >>>>> > the cluster would fail on failover along with the TCP/IP issue?
> >>>>> >
> >>>>> > Thanks in Advanced..
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>


.