Re: Physical Disk goes offline when cluster node reboots




Edwin vMierlo wrote:
Darrek,

Just to confirm that we have the symptom right

- All groups are online on Node 1 (therefore all disks are online on Node 1)
- you reboot Node 2
- All disks on Node 1 go offline on Node 1 during reboot/Post of Node 2

Please confirm that this is what you are experiencing

and two questions:
Q: are the disks who go offline on Node 2, do they fail or do they go
offline ? (please specify, as there is a difference)
Q: Do you see any "reservation lost" messages/events in the system event log
?

rgds,
Edwin.


Yes. All groups are online and running fine on Node 1. During Node 2
POST Node 1 reports errors like this in the event log:

(One for each LUN on the SAN)
Event Type: Error
Event Source: Disk
Event Category: None
Event ID: 15
Description:
The device, \Device\Harddisk1, is not ready for access yet.

And then...one of these...

Event Type: Error
Event Source: ClusSvc
Event Category: Physical Disk Resource
Event ID: 1038
Description:
Reservation of cluster disk 'Disk T - QASQLBTmp' has been lost. Please
check your system and disk configuration.


And then...several of these...

Event Type: Warning
Event Source: Ntfs
Event Category: None
Event ID: 50
Description:
{Delayed Write Failed} Windows was unable to save all the data for the
file . The data has been lost. This error may be caused by a failure of
your computer hardware or network connection. Please try to save this
file elsewhere.

More Event 15's, and 1038' for other LUNs

A couple of these mixed in...

Event Type: Information
Event Source: Application Popup
Event Category: None
Event ID: 26
Description:
Application popup: Windows - Delayed Write Failed : Windows was unable
to save all the data for the file Q:\$Mft. The data has been lost. This
error may be caused by a failure of your computer hardware or network
connection. Please try to save this file elsewhere.

One of these:

Event Type: Warning
Event Source: Ftdisk
Event Category: Disk
Event ID: 57
Description:
The system failed to flush data to the transaction log. Corruption may
occur.

At this point Cluster Admin begins sending service stop commands to
SQL.
And I get these:

Event Type: Error
Event Source: ClusSvc
Event Category: Physical Disk Resource
Event ID: 1036
Description:
Cluster disk resource '' did not respond to a SCSI maintenance command.


Followed by several more 57's:

I even managed one of these:

Event Type: Error
Event Source: ClusSvc
Event Category: Physical Disk Resource
Event ID: 1034
Description:
The disk associated with cluster disk resource 'Disk Q:' could not be
found. The expected signature of the disk was BED1F8F9. If the disk was
removed from the server cluster, the resource should be deleted. If the
disk was replaced, the resource must be deleted and created again in
order to bring the disk online. If the disk has not been removed or
replaced, it may be inaccessible at this time because it is reserved by
another server cluster node.

Followed by one of these:

Event Type: Error
Event Source: ClusSvc
Event Category: Startup/Shutdown
Event ID: 1009

Description:
Cluster service could not join an existing server cluster and could not
form a new server cluster. Cluster service has terminated.




The drivers I'm using are Emulex Storport FC2243
5-1.11X1 11/07/2005 WS2K3 32 bit (elxadjct.sys & elxstor.sys)
5.1.3.2 (elxstod.dll)

The MSA 1000 is on firmware 4.48.


Thanks for your help!

-DK

.



Relevant Pages

  • SUMMARY: changed WWID on cluster member boot disk
    ... disk and quorum disk of a single-member cluster, ... I could no longer boot from the cluster disks, ... the pre-cluster stand-alone system disk; ... the root1_domain on LUN containing the member boot disk was found ...
    (Tru64-UNIX-Managers)
  • Join an existing cluster
    ... I had a cluster setup with 2 computers running windows ... shared disk array. ... Creating a dummy Local Quorum resource. ... on the same storage bus as the boot disk... ...
    (microsoft.public.windows.server.clustering)
  • Re: Creating a wide area VMS Cluster
    ... > My goal is to provide a disaster tolerant cluster for both OS and data. ... disrupting the balance of the effect of votes between sites A and B. ... You have the option of a single shadowed system disk between the ...
    (comp.os.vms)
  • RE: Cluster IP Address Does not fail over
    ... The cluster IP has no dependicies at all. ... Node1 disk manager sees LUN5. ... [DiskArb] ...
    (microsoft.public.windows.server.clustering)
  • Re: Windows 2003/SQL 2000 Cluster SAN Migration
    ... Isin't the W2K3 Cluster Server Recovery tool designed to ... > 4) assume the Old disk drive is O: and the New disk Drive is N: ... > 4) Create a new Disk Resource for the new disk and have that in the SQL ... >>just fine but not the data drives, basically on step 14 my old drives ...
    (microsoft.public.windows.server.clustering)

Loading