Re: Disk failure and event id 1146



I haven't see this sequence of events before, but for troubleshooting
purposes, you might want to run your two disk resources in a seperate
resource monitor (you check this box in the properties of the resource in
cluadmin), and then see if this changes behaviour, in regards of what
resource monitor is failing.

Also, the log sniplet below details the following:
00000918.000016f0::2007/01/05-17:09:31.750 INFO [RM] RmpSetResourceStatus,
Posting state 3 notification for resource <Disk U:>

state 3 is not a failed state, it is an offline state, which follows after
the flushing and dismounting the disk, so this looks quite normal, and
doesn't look like a failure of the disk resource.

What is suspect is that you get the error about flushing buffers, you might
want to disable any hardware monitoring software which might have a handle
open to disk, and therefore preventing the exclusive volume lock for
flushing. Monitoring software both from your server manufacturer or
third-party could (not saying it is....) cause such behaviour.

HTH,
Edwin.


"Franco" <Franco@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:3382D545-C8BF-4617-BEA2-091BF7B4A493@xxxxxxxxxxxxxxxx
Hello all, this is my environment:

2 nodes MSCS with Windows Server 2003 SP1.
Node 1 host SQL Server, Cluster group, File system and Lotus Domino
Node 2 host Oracle 9i in a failsafe confoguration.
IBM SAN DS4400 with firmware code and device drivers for all devices at
n-1.

This is not always happening but I would like to share this strange
scenario
with you:

When I decide to "take one group offline", in this case the Oracle group,
I
end up in this situation:

the 2 disks of the Oracle group (T:\ and U:\) gets failed.

In the system event log I receive the following errors:
"The cluster resource monitor died unexpectedly, an attempt will be made
to
restart it."

This error is followed by a series of errors like this one:

"The system failed to flush data to the transaction log. Corruption may
occur."

In the cluster log I have:
"00000918.000016f0::2007/01/05-17:09:30.516 INFO Physical Disk <Disk U:>:
Offline, ResourceEntry @ 000C4A90 Valid 1
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [FM] FmpRmOfflineResource:
RmOffline() for 85f5dcef-4d92-4734-b330-01c0c9ba46ac returned error 997
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [GUM] GumSendUpdate:
queuing
update type 0 context 8
00000918.000013c8::2007/01/05-17:09:30.516 INFO Physical Disk <Disk U:>:
[PnP] Stop watching PnP events for disk a184bf5a
00000918.000013c8::2007/01/05-17:09:30.516 INFO Physical Disk <Disk U:>:
[PnP] Stop watching disk a184bf5a - processed
00000918.000016f0::2007/01/05-17:09:30.516 INFO Physical Disk <Disk U:>:
Offline, Processing disk number 7.
00000918.000016f0::2007/01/05-17:09:30.516 INFO Physical Disk <Disk U:>:
Offline, Using handle array.
00000918.000016f0::2007/01/05-17:09:30.516 INFO Physical Disk <Disk U:>:
Offline, Opening device \Device\Harddisk7\Partition1.
00000918.000016f0::2007/01/05-17:09:30.516 INFO Physical Disk <Disk U:>:
Offline, FlushFileBuffers for \Device\Harddisk7\Partition1.
00000918.000016f0::2007/01/05-17:09:30.516 INFO Physical Disk <Disk U:>:
Offline, Locking volume for \Device\Harddisk7\Partition1.
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [GUM] GumSendUpdate:
Dispatching seq 160634 type 0 context 8 to node 2
00000918.000013c8::2007/01/05-17:09:30.516 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_LOCK for U (Partition1) - Received
00000918.000013c8::2007/01/05-17:09:30.516 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_LOCK for U (Partition1) - Processed
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [GUM] GumSendUpdate:
completed update seq 160634 type 0 context 8
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [FM]
FmpPropagateResourceState: resource 85f5dcef-4d92-4734-b330-01c0c9ba46ac
pending event.
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [FM]
FmpSetResourcePersistentState: Setting persistent state for resource
85f5dcef-4d92-4734-b330-01c0c9ba46ac...
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [GUM] GumSendUpdate:
queuing
update type 1 context 4098
00000918.000013c8::2007/01/05-17:09:30.516 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_LOCK_FAILED for U (Partition1) - Received
00000918.000013c8::2007/01/05-17:09:30.516 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_LOCK_FAILED for U (Partition1) - Processed
00000918.000016f0::2007/01/05-17:09:30.516 WARN Physical Disk <Disk U:>:
Offline, Locking volume failed, error 5.
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [GUM] GumSendUpdate:
Dispatching seq 160635 type 1 context 4098 to node 2
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [DM] DmWriteToQuorumLog
Entry Seq#=160635 Type=4098 Size=152
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [DM] DmUpdateSetValue
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [DM] Setting value of
PersistentState for key Resources\85f5dcef-4d92-4734-b330-01c0c9ba46ac to
0x00000000
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [DM] DmWriteToQuorumLog
Entry Seq#=160635 Type=4098 Size=152
000006c4.00000a34::2007/01/05-17:09:30.516 INFO [GUM] GumSendUpdate:
completed update seq 160635 type 1 context 4098
000006c4.0000084c::2007/01/05-17:09:30.516 INFO [GUM] GumSendUpdate:
queuing
update type 0 context 11
000006c4.0000084c::2007/01/05-17:09:30.516 INFO [GUM] GumSendUpdate:
Dispatching seq 160636 type 0 context 11 to node 2
000006c4.0000084c::2007/01/05-17:09:30.516 INFO [GUM] GumSendUpdate:
completed update seq 160636 type 0 context 11
000006c4.0000084c::2007/01/05-17:09:30.516 INFO [FM]
FmpRmChangeResourceParams for resource
<85f5dcef-4d92-4734-b330-01c0c9ba46ac>
called...
00000918.000013c8::2007/01/05-17:09:31.125 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_LOCK for U (Partition1) - Received
00000918.000013c8::2007/01/05-17:09:31.125 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_LOCK for U (Partition1) - Processed
00000918.000013c8::2007/01/05-17:09:31.125 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_LOCK_FAILED for U (Partition1) - Received
00000918.000013c8::2007/01/05-17:09:31.125 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_LOCK_FAILED for U (Partition1) - Processed
00000918.000016f0::2007/01/05-17:09:31.125 WARN Physical Disk <Disk U:>:
Offline, Locking volume failed, error 5.
00000918.000016f0::2007/01/05-17:09:31.719 INFO Physical Disk <Disk U:>:
Offline, Dismounting volume \Device\Harddisk7\Partition1.
00000918.000013c8::2007/01/05-17:09:31.719 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_DISMOUNT for U (Partition1) - Received
00000918.000013c8::2007/01/05-17:09:31.719 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_DISMOUNT for U (Partition1) - Processed
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
Offline, Dismount complete, volume \Device\Harddisk7\Partition1.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
DiskCleanup started.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[DiskArb] StopPersistentReservations is called.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[DiskArb] Stopping reservation thread.
00000918.00000cc4::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[DiskArb] CompletionRoutine, status 0.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[ArbCleanup] Verifying sector size.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[ArbCleanup] Reading arbitration block.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[DiskArb] Successful read (sector 12) [SRV0002B:120604]
(0,5f5f2dea:01c72da1).
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[ArbCleanup] Writing arbitration block.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[DiskArb] Successful write (sector 12) [:0] (0,00000000:00000000).
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[ArbCleanup] Returning status 0.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
[DiskArb] StopPersistentReservations is complete.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
DisksDismountDrives: letter mask is 00100000.
00000918.000016f0::2007/01/05-17:09:31.735 INFO Physical Disk <Disk U:>:
DiskCleanup returning final error 0
00000918.000013c8::2007/01/05-17:09:31.735 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_UNLOCK for U (Partition1) - Received
00000918.000013c8::2007/01/05-17:09:31.735 INFO Physical Disk: [PnP] Event
GUID_IO_VOLUME_UNLOCK for U (Partition1) - Processed
00000918.000016f0::2007/01/05-17:09:31.750 INFO Physical Disk <Disk U:>:
Offline, Returning final error 0.
00000918.000016f0::2007/01/05-17:09:31.750 INFO [RM] RmpSetResourceStatus,
Posting state 3 notification for resource <Disk U:>
000006c4.00001798::2007/01/05-17:09:31.750 INFO [FM]
NotifyCallBackRoutine:
enqueuing event
000006c4.00001798::2007/01/05-17:09:31.750 INFO [FM] Calling
RmNotifyChanges
in monitor 0918.
000006c4.00000848::2007/01/05-17:09:31.750 INFO [CP] CppResourceNotify for
resource Disk U:"

I do not have more than 100 disks,
(http://support.microsoft.com/kb/314753),
so this is not my situation.

I do not find in the cluster log the following note:
"WARN Network Name Enterprise Network Name: Failed to obtain
IDirectoryObject pointer to computer account ServerName, status 80005000"
so this is not also my case (http://support.microsoft.com/kb/886652)

Any clue?
Thank you very much and kind regards.





Franco


.



Relevant Pages

  • Re: Issue with Clustering
    ... If you are using a MNS quorum then you do not need a Quorum disk in your ... AddVolume: Adding Name ... waiting for notification from PnP thread that all volumes exist... ... Posting state 4 notification for resource ...
    (microsoft.public.windows.server.clustering)
  • RE: Phisical disk resource
    ... disk is reporting that it is busy. ... Status 170 = The requested resource is in use. ... I'm setting up a 2 node cluster under 2003 ent. ... INFO Physical Disk <Data Disk>: ...
    (microsoft.public.windows.server.clustering)
  • RE: Cluster IP Address Does not fail over
    ... Did you look at the network interfaces etc under cluster configuration in ... Node1 disk manager sees LUN5. ... Posting state 4 notification for resource ... Stop watching PnP events for disk cf2bda88 ...
    (microsoft.public.windows.server.clustering)
  • Re: 2003 Cluster Fails to start after a reboot
    ... It seems as though it cannot see the quorum disk on the ... Quorum resource is in the same ... FmpHandleResStateChangeProc: Exit... ... FmpNotifyGroupStateChangeReason: Notifying group Cluster ...
    (microsoft.public.windows.server.clustering)
  • Re: Cluster Freezes
    ... When testing the offline times I notice that the file shares go offline in a ... the Physical disk seem to take atleast 30-45 seconds to ... The failover time for all the resources is really good maybe 20 seconds tops ... the real problem is coming back online. ...
    (microsoft.public.windows.server.clustering)