Re: Cluster will not fail over.



The SP4 information refers to the OS, not SQL.

The differences appear to be related to different patches on the OS and to a
posible mismatch in MDAC binaries. I would download MDAC 2.8 and install it
on each node. MDAC installs are not cluster aware.

Procedures actually help with growth. They are a way of sharing
institutional knowledge without long, mistake-filled training periods.
Start with the biggest pain points and work outward from there. They don't
have to be complex or involved. They should help guide a professional
towards the correct solution for a given situation. The entire concept
needs high level approval and buy-in. Believe me, a few more episodes like
this and top management will buy into the idea.


--
Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP



"Admiral" <admiral@xxxxxxxxxxxxxxxxxxx> wrote in message
news:%23sVLqVwCGHA.3920@xxxxxxxxxxxxxxxxxxxxxxx
> Below is the response from Microsoft as promised. I haven't had a chance
> to look into their findings, but one thing is for certain, I did not
> install SP4 for SQL, which they seem to think I have. I will be
> scheduling to apply these fixes for this upcoming Friday night and let you
> guys know how things turn out.
>
> Jeff, the current growth of our company has been extremely fast. We are
> definitely changing from a (mom-n-pop) setup to corporate. I will be the
> first to tell you that I am no-seasoned SQL veteran, but have been doing
> my best to keep things afloat. I truly thank you all for your responses.
>
>
>
> ANALYSIS OF MPS REPORTS
>
> =========================
>
> Node 1
>
> ======
>
> Node Name --------- EMPOWER01-SAN
>
> Node OS Version --- 2195 Windows 2000 Advanced Server
>
> Node Service Pack - Service Pack 4
>
>
>
> Node 2
>
> ======
>
> Node Name --------- EMPOWER02-SAN
>
> Node OS Version --- 2195 Windows 2000 Advanced Server
>
> Node Service Pack - Service Pack 4
>
>
>
> Default instance
>
> SQL Network Name(EMPOWER01)
>
>
>
> -- PAE enabled.
>
>
>
> NODE 1
>
> =======
>
>
>
> Network Name Network Priority
>
> ============ ================
>
> Encore LAN GIG 1
>
> Heartbeat 2
>
>
>
> 1. ACTION -- REVERSE THE NETWORK PRIORITY. First Heartbeat followed by
> Encore.
>
>
>
> ==========================================================================================
>
> Network Interfaces Binding Order -
> NO CHANGE REQUIRED.
>
> ==========================================================================================
>
>
>
> 1. Encore LAN GIG
>
> 2. Heartbeat
>
> 3. NdisWanIp
>
>
>
>
>
>
>
> NODE 2
>
> =======
>
>
>
>
>
> Network Name Network Priority
>
> ============ ================
>
> Encore LAN GIG 1
>
> Heartbeat 2
>
>
>
>
>
> ==========================================================================================
>
> Network Interfaces Binding
> Order -- NO CHANGE REQUIRED.
>
> ==========================================================================================
>
>
>
> 1. Encore LAN GIG
>
> 2. Heartbeat
>
> 3. NdisWanIp
>
>
>
>
>
> -- There are some hotfixes difference between the nodes. - We recommend
> same hotfixes on both the nodes. There might be unexpected outages if we
> have these differences.
>
>
>
> KB32911 -- missing from EMPOWER01-SAN_SQL
>
> KB82283 -- missing from EMPOWER02-SAN_SQL
>
>
>
> -- Found some files missing or version mismatches. - We recommend same
> version of system files on both the nodes. There might be unexpected
> outages if we have these differences.
>
>
>
> C:\WINNT\SYSTEM32\DLLCACHE\OLEDB32R.DLL
>
> has different versions on both the nodes
>
> On node 1.. file version is (2.70:9001.0)
>
> On node 2.. file version is (2.60:6526.0)
>
>
>
> C:\WINNT\SYSTEM32\DRIVERS\B57W2K.SYS
>
> On node 1.. file version is (7.43:0.0)
>
> On node 2.. file version is (6.34:4.0)
>
>
>
> C:\WINNT\SYSTEM32\MACROMED\FLASH\GETFLASH.EXE
>
> On node 1.. FILE NOT FOUND.
>
> On node 2.. file version is (7.0:19.0)
>
>
>
> C:\WINNT\SYSTEM32\NEWDEV.DLL
>
> On node 1.. file version is(5.0:2195.6754)
>
> On node 2.. file version is (5.0:2195.6666)
>
>
>
> C:\WINNT\SYSTEM32\SEAREACH4.DLL
>
> On node 1.. file version is (4.15:1022.0)
>
> On node 2.. FILE NOT FOUND.
>
>
>
> C:\WINNT\SYSTEM32\SKYLON.DLL
>
> On node 1.. file version is
>
> On node 2.. FILE NOT FOUND.
>
>
>
> C:\WINNT\SYSTEM32\SKYLON2.DLL
>
> On node 1.. file version is (2.11:1020.0)
>
> On node 2.. FILE NOT FOUND.
>
>
>
> C:\WINNT\SYSTEM32\WRITERPDF.DLL
>
> On node 1.. file version is (1.20:1047.0)
>
> On node 2.. FILE NOT FOUND.
>
>
>
> From SQL server logs
>
> log 6, 5,4,3,2,1
>
> -------------------
>
> 2005-12-19 01:19:43.26 server Failed to obtain
> TransactionDispenserInterface: Result Code = 0x8004d01b
>
> 2005-12-19 01:19:43.26 spid3 Starting up database 'master'.
>
> 2005-12-19 01:19:43.28 spid3 Error: 9003, Severity: 20, State: 1.
>
> 2005-12-19 01:19:43.28 spid3 Cannot recover the master database.
> Exiting.
>
>
>
> Current SQL logs
>
> -----------------------
>
> 2005-12-19 01:22:34.53 server Microsoft SQL Server 2000 - 8.00..760
> (Intel X86)
>
> Enterprise Edition on Windows NT 5.0 (Build 2195: Service Pack
> 4)
>
> 2005-12-19 01:22:34.54 server Logging SQL Server messages in file
> 'o:\logs\ERRORLOG'.
>
> 2005-12-19 01:22:34.54 server SQL Server is starting at priority class
> 'normal'(8 CPUs detected).
>
> 2005-12-19 01:22:34.98 server Using 'SSNETLIB.DLL' version '8.0.766'..
>
> 2005-12-19 01:22:34.98 spid3 Server name is 'EMPOWER01'.
>
> 2005-12-19 01:22:35.03 server SuperSocket Info: Bind failed on TCP port
> 1433. --SQL is not listening on TCP --
>
> 2005-12-19 01:22:35.54 server SQL server listening on Shared Memory,
> Named Pipes.
>
>
>
> SP4 setup
>
> 19:59:25 Installation Succeeded
>
>
>
> From Application and system logs.
>
>
>
> Application 12/19/2005 1:16 error 17052 MSSQLSERVER
> "Failover" EMPOWER02-SAN [sqsrvres] OnlineThread:
> Error 435 bringing resource online.\r\n\r\n
>
> Application 12/19/2005 1:16 error 17052 MSSQLSERVER
> "Failover" EMPOWER02-SAN [sqsrvres] OnlineThread:
> ResUtilsStartResourceService failed (status 435)\r\n\r\n
>
> Application 12/19/2005 1:16 error 17052 MSSQLSERVER
> "Failover" EMPOWER02-SAN [sqsrvres]
> StartResourceService: Failed to start MSSQLSERVER service. CurrentState:
> 1\r\n\r\n
>
> Application 12/19/2005 1:16 error 17055 MSSQLSERVER
> "Server" EMPOWER01 18052 :\r\nError:
> 5105, Severity: 16, State: 4.\r\n\r\n
>
> Application 12/19/2005 1:16 error 17055 MSSQLSERVER
> "Server" EMPOWER01 17204 :\r\nFCB::Open
> failed: Could not open device o\logs\mastlog.ldf for virtual device number
> (VDN) 2.\r\n\r\n
>
> Application 12/19/2005 1:16 error 17055 MSSQLSERVER
> "Server" EMPOWER01 17207 :\r\nudopen:
> Operating system error 3(The system cannot find the path specified.)
> during the creation/opening of physical device o\logs\mastlog.ldf.\r\n\r\n
>
>
>
> From node 2
>
> ==========
>
> HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\MSSQLServer\Parameters
>
> SQLArg0 REG_SZ -dn:\data\master.mdf
>
> SQLArg1 REG_SZ -eo:\logs\ERRORLOG
>
> SQLArg2 REG_SZ -lo\logs\mastlog.ldf
>
>
>
> From node 1
>
> ==========
>
>
>
> HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\MSSQLServer\Parameters
>
> SQLArg0 REG_SZ -dn:\data\master.mdf
>
> SQLArg1 REG_SZ -eo:\logs\ERRORLOG
>
> SQLArg2 REG_SZ -lo:\logs\mastlog.ldf
>
>
>
> RESOLUTION:- Add the colon in the path on node 2.
>
> regkey :-
> HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\MSSQLServer\Parameters
>
> Value of SQLArg2 should be
>
> -lo:\logs\mastlog.ldf
>
>
>
>
>
> "Geoff N. Hiten" <SQLCraftsman@xxxxxxxxx> wrote in message
> news:eTylq4vCGHA.1088@xxxxxxxxxxxxxxxxxxxxxxx
>> Comments Inline
>>
>> "Admiral" <admiral@xxxxxxxxxxxxxxxxxxx> wrote in message
>> news:%23C%23g6mvCGHA.1088@xxxxxxxxxxxxxxxxxxxxxxx
>>> Anthony,
>>>
>>> I thank you for your response and have gone through the majority of the
>>> steps you've provided. I also almost applied SP4, but with the type of
>>> front end application we have that could have been disasterous if not
>>> tested properly. Considering we are at month end, the powers above have
>>> blocked me from attempting any further fixes.
>>
>> Definitely the right idea. Service Packs generally help with stability,
>> but sometimes they break applications. Testing is important.
>>
>>>
>>> I do believe I found the TCP/IP issue/fix, but to be safe it was time to
>>> call in Microsoft. I gathered all the information I had and have sent
>>> it their way. I will definitely keep you all posted on what was found
>>> and solution. I do agree and feel that this clustered environment needs
>>> a complete overhaul. It has had way too many cooks in the kitchen. For
>>> the time being, I am covering every avenue by keeping a warm standby
>>> along with increasing my incrimentals throughout the day.
>>>
>> Your experience illustrates exactly where most organizations fall down
>> when it comes to High Availability. They think a cluster does all the
>> work and they can ignore the People and Process elements. I have found
>> that strict procedures that place specific responsibilities on specific
>> individuals are key to keeping user-generated problems from killing a
>> system. Proper training and selection of these individuals is also
>> important. Database operations isn't magic, we need to quit treating it
>> like it is. Develop procedures that support your operational goals.
>> Make sure the people are familiar with these procedures and are competent
>> to implement them. Without that, the best technology in the world is no
>> more reliable than the cheapest piece of junk.
>>
>> --
>> Geoff N. Hiten
>> Senior Database Administrator
>> Microsoft SQL Server MVP
>>
>>
>>
>>> Thanks Again and Happy Holidays!
>>>
>>>
>>> "Anthony Thomas" <ALThomas@xxxxxxxxx> wrote in message
>>> news:egJ0K5JCGHA.3812@xxxxxxxxxxxxxxxxxxxxxxx
>>>> Yeah, that's where I was going, you can't even detach "accidentally,"
>>>> or
>>>> otherwise, the model database unless you are in single user mode or
>>>> you've
>>>> started the instance with the 3608 trace flag (by the way, just for
>>>> grins,
>>>> there is also 3607 and 3609, which behave the same way, but recover
>>>> more or
>>>> fewer databases).
>>>>
>>>> As far as the TCP/IP issue goes, you had to rebuild the cluster and
>>>> were not
>>>> able to restore the master database. That tells me that you are NOT at
>>>> the
>>>> same service pack level as the master backup you were trying to use.
>>>>
>>>> Since you went ahead without the master database restored, go ahead and
>>>> try
>>>> to reapply the SP3a service pack, or, better yet, apply SP4. If you
>>>> are
>>>> using AWE, you will need to apply the post-SP4, 2040 AWE hotfix as
>>>> well.
>>>> Also, SP4 breaks the SQLDiag (and it sounds like you will need this);
>>>> so,
>>>> you'll want to contact Microsoft Customer Support Services (yes,
>>>> they've
>>>> changed their name from MS PSS to MS CSS) to get the private 2148
>>>> SQLDiag
>>>> hotfix.
>>>>
>>>> Since you are technically on a brand new cluster (with a lot of old
>>>> cluster
>>>> settings laying around), I'd approach this as a fresh build and go
>>>> through
>>>> the entire installation configuration just as if nothing was running on
>>>> it.
>>>>
>>>> Geoff mentioned a few things, like comclust for MSDTC and validating
>>>> the
>>>> shared clustered disk configuration, but there are several other items
>>>> after
>>>> a cluster installation you'll have to revisit. The new installation
>>>> means
>>>> that all of the registry settings have reverted back to a fresh
>>>> installation. The fact you are on a new master database means a lot of
>>>> the
>>>> sp_configure settings have reverted back to a fresh installation.
>>>>
>>>> You need to check two things, the version of the engine (sqlservr.exe)
>>>> and
>>>> the version of the master, model, and msdb databases. You can execute
>>>> SELECT @@VERSION to get the prior and sp_helpdb master to get the
>>>> later.
>>>>
>>>> Good luck. Keep us up to date with your progress.
>>>>
>>>> Sincerely,
>>>>
>>>>
>>>> Anthony Thomas
>>>>
>>>>
>>>> --
>>>>
>>>> "john clarke" <jclarke@xxxxxxxxxx> wrote in message
>>>> news:u0PmezkBGHA.344@xxxxxxxxxxxxxxxxxxxxxxx
>>>>> ouch...
>>>>>
>>>>> this situation happened a while ago for us. one of our techs dettached
>>>> both
>>>>> the model AND msdb database (whilst in single user mode) and then sql
>>>> server
>>>>> was shut down. Well it would not start up.
>>>>>
>>>>> I managed to execute the following command on the server itself:
>>>>>
>>>>> sqlservr -c -f -s <instancename> /T3608
>>>>>
>>>>> This worked bringing up the sql server in minimal mode. Thereupon i
>>>>> used
>>>>> query analyser to connect and reattached both the msdb and model
>>>> databases.
>>>>>
>>>>> All was back to normal
>>>>>
>>>>> Best of luck!
>>>>>
>>>>> john
>>>>>
>>>>> "Admiral" <admiral@xxxxxxxxxxxxxxxxxxx> wrote in message
>>>>> news:uXPvMDkBGHA.628@xxxxxxxxxxxxxxxxxxxxxxx
>>>>> > Geoff,
>>>>> >
>>>>> > I truly appreciate your response. I actually did try to reattach
>>>>> > the
>>>>> > model using this method, unfortunately everytime I would attemp to
>>>> attach
>>>>> > the model db, SQL Server would immediately turn off. At the time I
>>>>> > was
>>>>> > really pressed for time, which led to the decision for a re-install.
>>>>> > I
>>>> do
>>>>> > believe there is more to the story that was not told to me.
>>>>> >
>>>>> > Also, I do apologize if I led you to believe that the cluster is on
>>>>> > a
>>>>> > Windows 2003 platform. It actually resides on a Windows 2000
>>>>> > Advanced
>>>>> > Server. The link you provided me, should still work for W2k? Like
>>>>> > I
>>>>> > mentioned when it rains it pours and I've been putting out too many
>>>> fires
>>>>> > for a Christmas week. I can't thank you enough for the response.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > "Geoff N. Hiten" <SQLCraftsman@xxxxxxxxx> wrote in message
>>>>> > news:OFkKlhbBGHA.1088@xxxxxxxxxxxxxxxxxxxxxxx
>>>>> > First, what you should have done with a blown model DB:
>>>>> >
>>>>> > Start SQL Server in single user mode with trace flag -T3608. This
>>>>> > stops
>>>>> > SQL from recovering anything except the master database. Reattach
>>>> Model.
>>>>> > If necessary, use files copied from another installation at the
>>>>> > exact
>>>> same
>>>>> > SP and Hotfix level. Stop SQL Server and restart normally. Sorry,
>>>>> > but
>>>> it
>>>>> > really is that simple. Oh, and lock whoever detached "model" out of
>>>>> > the
>>>>> > system. HE is too dangerous to allow near your system.
>>>>> >
>>>>> > You didn't mention whether you blew the cluster away or not or just
>>>>> > rebuilt SQL. If you blew the cluster away, make sure that each disk
>>>>> > resource has the same drive letter on all nodes and the disk
>>>>> > resources
>>>>> > fail over correctly from node to node. Stop the resource group,
>>>>> > move
>>>> it,
>>>>> > and start each disk resource independently to test.
>>>>> >
>>>>> > The Named Pipes only issue sounds like an incomplete SP3a install.
>>>>> > Windows 2003 will prevent TCP/IP access if it detects a pre-SP3a SQL
>>>>> > installation. Follow this article and re-apply SP3a.
>>>>> >
>>>>> > http://support.microsoft.com/default.aspx?scid=kb;en-us;815431
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Geoff N. Hiten
>>>>> > Senior Database Administrator
>>>>> > Microsoft SQL Server MVP
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > "Admiral" <admiral@xxxxxxxxxxxxxxxxxxx> wrote in message
>>>>> > news:uoXFWWbBGHA.216@xxxxxxxxxxxxxxxxxxxxxxx
>>>>> > We had an error over the weekend of mass porportions(Sunday 3pm
>>>>> > PST).
>>>>> > Long story short; the model database was detached and the SQL Server
>>>>> > was
>>>>> > stopped, with it still detached. This happened to happen on our
>>>>> > primary
>>>>> > Production Database Clustered Server which is the bread-n-butter of
>>>>> > the
>>>>> > compay. (OUCH!)
>>>>> >
>>>>> > It was time for some fast actions. We started the re-install SQL
>>>> Server.
>>>>> > In order to do so, the previous install had to be uninstalled. This
>>>>> > seemed to go smoothly enough, but when re-applying the SP3a, we
>>>>> > encountered an error. After researching the error, apparently in a
>>>>> > clustered environment this will occur since the SP3a files still
>>>>> > reside
>>>> on
>>>>> > the node(s). Microsoft states that if within a particular log file
>>>>> > it
>>>>> > results with an 'Installation was Successful', to disregard the
>>>>> > error.
>>>> I
>>>>> > double checked the log file and sure enough the error was
>>>>> > disregarded.
>>>>> >
>>>>> > We moved along with the installation. We were able to restore all
>>>>> > the
>>>>> > user databases and all system databases with the exception of the
>>>>> > master
>>>>> > database. Unfortunately, even with starting SQL Server in
>>>>> > single-user
>>>>> > mode, the restore of the master database would not take. So it was
>>>>> > not
>>>>> > restored, but all other databases were. Fortunately, I ran a quick
>>>> script
>>>>> > to recover all the user logins previous to the disaster, which I
>>>> reapplied
>>>>> > to the new installation of SQL Server. Everything came back up and
>>>>> > the
>>>> QA
>>>>> > Team successfully tested the production Application (Monday 4am
>>>>> > PST).
>>>>> > (Fhweeh)
>>>>> >
>>>>> > After the succesful testing of the production environment, we
>>>>> > tested
>>>> the
>>>>> > fail-over which resulted in SQL Server not starting on the secondary
>>>> node.
>>>>> > All the resources came right up on it, but not SQL Server. The only
>>>> error
>>>>> > that was that it was not able to locate the file on
>>>> 'O\logs\mastlog.ldf'.
>>>>> > This error did not make sense since SQL Server uses the same file
>>>>> > for
>>>> the
>>>>> > primary node. We were pressed for time since it was closing to
>>>>> > start of
>>>>> > business East Coast time, so we left the server as is.
>>>>> >
>>>>> > Throughout the day there were other issues that arose, one in
>>>> particular
>>>>> > was certain systems were not able to connect to the server via
>>>>> > TCP/IP.
>>>> In
>>>>> > order to have them connect they needed to create an alias of the
>>>>> > server
>>>>> > and use Name Pipes. This seems to be a rising concern because there
>>>>> > are
>>>>> > users who need to connect via ODBC to a widely used particular
>>>>> > Access
>>>>> > Application, which seems to only like the TCP/IP route. I am
>>>>> > somewhat
>>>>> > sure this is related to the cluster failure.
>>>>> >
>>>>> > Anyway, this is the first time I've had to take a breathe to
>>>>> > revisit
>>>> the
>>>>> > problem at hand. We have been dealing with another server that
>>>>> > crashed
>>>> on
>>>>> > the same day, resulting in a brand new build of a SQL Server Cluster
>>>>> > environment (completely non related to the issue at hand).
>>>>> >
>>>>> > I'm sorry for the long winded story. Would you have any idea as to
>>>>> > why
>>>>> > the cluster would fail on failover along with the TCP/IP issue?
>>>>> >
>>>>> > Thanks in Advanced..
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


.