Re: Correcting corrupt $MFT on shared clustered disk.



I had seen the "hangrecoveryaction" issue and set it after this happened one other time (the disk didn't get corrupted though). I'm not sure why it didn't kick in this time.

As for the other part, I saw the "extended maintenance mode" KB article, but I wasn't quite sure how it would work with running chkdsk. I was hoping maybe someone has done this already and could give me some pointers.

I hadn't thought about using EMC for this, I guess I figured since it was more of a Windows issue, not hardware or SAN related and that it wouldn't be in their area.


"Rodney R. Fournier [MVP]" <rod@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message news:uUb$R9HAIHA.4836@xxxxxxxxxxxxxxxxxxxxxxx
First of all sounds like you need to do this ASAP - http://support.microsoft.com/kb/815267

cluster.exe /cluster:clustername /prop hangrecoveryaction=3

That should take care of the server when it goes into a coma.

Next I am not sure if you are aware but http://support.microsoft.com/kb/903650 Extended Maintenance mode is now available when you need to run chkdsk on a drive.

Lastly I would engage EMC on this issue, they have wonderfully talented support professionals that can help you!

Cheers,

Rodney R. Fournier

MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering Website
http://msmvps.com/clustering - Blog
http://www.clusterhelp.com - Cluster Training
ClusterHelp.com is a Microsoft Certified Gold Partner


"Mike O." <MikeO@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message news:F3CB8885-569A-493A-BAE1-597ED71C3D4A@xxxxxxxxxxxxxxxx
I'm trying to find out some information about using CHKDSK on a clustered
drive.
We have a two node cluster (active/passive) running Windows 2003 R2
enterprise 32 bit with SP1. The cluster has three shared drives located on
an EMC CX700 SAN. The three drives are a 500MB for the quorum, and two data
drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a basic disk, the
2.4TB drive W is a GPT disk. They're both about 70% full The E drive has
been active for about a year, the W: one was added around June.

Yesterday the active node became sluggish and then stopped serving data. It
still responded to low level stuff like PING, users were getting errors on
the server. Logging in gave a blank screen. This has happened a couple of
times before (that's a separate issue we're looking into).

We went to the inactive node and did a "move group" in the cluster
administrator. We've done this before for various reasons with no problems,
it usually takes about 20 seconds to bring the resources up on the other node.

This time when the resources came on line on the 2nd node, we started
getting an application popup that "Windows - Corrupt File : The file or
directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk utility."
The drive seems to be running OK with users accessing the information
normally. I did some research and it appears that Windows will use the
duplicate copy of the MFT if the primary one is corrupted.

I know we need to run CHKDSK soon, but unfortunately, running chkdsk and
taking the drive off line for several hours is not something we can do
during daytime hours. If necessary we could run it overnight, but with that
size of drive I don't know if it would finish by the next morning.

The server has dual fiber connections (we're using the EMC Powerpath
software for SAN failover), and we didn't have anything happen with the SAN
at that time, so based on the timing I'm assuming the MFT corruption was
related to the cluster failover, not a physical hardware issue, so I wasn't
planning on running the sector scan. I would imagine a sector scan on a
1.5TB "disk" would run for a while.
At this point I'm planning on running CHKDSK over the weekend. I've never
run it on a clustered disk before and I'm looking for some information about
it. I've read Microsoft KB176970 and KB903650, but frankly they're a little
confusing with the issues about "maintenance mode".

Also, is my understanding about the mirrored/secondary MFT valid? Since
users appear to be getting information correctly can the CHKDSK wait until
the weekend?. Our backup policy does a full backup each week and an
incremental daily, so if something really bad happens we should be able to
recover.

Any information on this would be appreciated.

Mike O.




.



Relevant Pages

  • Re: Correcting a corrupted $MFT on a shared clustered disk
    ... Also, the problem I'm having is on the smaller basic disk, the GPT one is ... When I do run chkdsk, are there any special issues with the cluster? ... The three drives are a 500MB for the quorum, ...
    (microsoft.public.windows.server.general)
  • Re: Correcting a corrupted $MFT on a shared clustered disk
    ... Phase 1 went through pretty fast, it found and corrected the 60 or so corrupted attribute & orphaned records that the read-only chkdsk passes were detecting. ... Also, the problem I'm having is on the smaller basic disk, the GPT one is ... are there any special issues with the cluster? ... The three drives are a 500MB for the quorum,> and two ...
    (microsoft.public.windows.server.general)
  • Re: Correcting a corrupted $MFT on a shared clustered disk
    ... Unfortunately I didn't redirect the output, and it doesn't look chkdsk logs the errors, and that's not long enough to be useful. ... Also, the problem I'm having is on the smaller basic disk, the GPT one is ... are there any special issues with the cluster? ... The three drives are a 500MB for the quorum,> and two ...
    (microsoft.public.windows.server.general)
  • Re: Correcting a corrupted $MFT on a shared clustered disk
    ... I did see an issue related to the "security id" and chkdsk if you have over 4 million, but the hotfix is a few years old and the version of the system dll's on the server are later than the ones in the hotfix, so it appears that the fix is already there. ... are there any special issues with the cluster? ... The three drives are a 500MB for the> quorum, ...
    (microsoft.public.windows.server.general)
  • Re: Correcting a corrupted $MFT on a shared clustered disk
    ... Unfortunately I didn't redirect the output, and it doesn't look chkdsk logs the errors, and that's not long enough to be useful. ... Also, the problem I'm having is on the smaller basic disk, the GPT one is ... are there any special issues with the cluster? ... The three drives are a 500MB for the quorum,> and two ...
    (microsoft.public.windows.server.general)