Re: Correcting a corrupted $MFT on a shared clustered disk



It finished around 1:00am. So it took 7-1/2 hours for the first 1%, then 90 minutes for the other 99..

It what looked like a couple of hundred "resetting security id to default" messages, but I've spot checked the drive and don't see anything out of the ordinary. Unfortunately I didn't redirect the output, and it doesn't look chkdsk logs the errors (other than the event log entry), and that's not long enough to be useful.

I was thinking about the restore option, and if it was still running this morning, I was going to start investigating that option.

Thanks for all the comments.


"Mathieu CHATEAU" <gollum123@xxxxxxx> wrote in message news:uWMYOe0AIHA.5960@xxxxxxxxxxxxxxxxxxxxxxx
keep going !

Else can you format the drive and restore from backup ? It may go faster, depending of your backup storage and type of files (bigger the better)

--
Cordialement,
Mathieu CHATEAU
English blog: http://lordoftheping.blogspot.com
French blog: http://www.lotp.fr


"Mike O" <put_the_spam@xxxxxxx> wrote in message news:uCaiwtxAIHA.4612@xxxxxxxxxxxxxxxxxxxxxxx
It's gone from 1% to 76% in the last hour. As I'm watching it, it now seems to be moving about 1% every two minutes...

The optimism is starting to slowly come back..


"Mike O" <put_the_spam@xxxxxxx> wrote in message news:OE%23FbKxAIHA.4232@xxxxxxxxxxxxxxxxxxxxxxx
This is not looking good at all. It just jumped all the way to 1%. It's been running phase 2 since about 4:00pm today. 7-1/2 hours for 1% isn't a good sign.



"Mike O" <put_the_spam@xxxxxxx> wrote in message news:u5mYq7vAIHA.4984@xxxxxxxxxxxxxxxxxxxxxxx
It's not looking too good at the moment. I started it about 3:15 this afternoon. Phase 1 went through pretty fast, it found and corrected the 60 or so corrupted attribute & orphaned records that the read-only chkdsk passes were detecting.

However, it started phase 2 around 4:00, and now at 9:00 it's still at 0 percent... I seem to remember that the stage 2 steps go in 10% increments (at least I hope so!), and I know that this stage isn't linear, and that it might move erratically, but I was hoping to see something besides 0 by now.. According to task manager the chkdsk is running, it shows the process running about 30+cpu time.

Assuming it doesn't finish by the end of our maintenance window, do you know if there would be any problems cancelling the process? I know it wouldn't fix the main problem, but at least we could get the system up and running until another time (or relocate the data to another drive).


"Mathieu CHATEAU" <gollum123@xxxxxxx> wrote in message news:e4mBYyhAIHA.5124@xxxxxxxxxxxxxxxxxxxxxxx
it's ok, i understand that you are under pressure !

I was just trying to make you think about this current pressure, that may be lower if you would only have to make offline a part and not the whole cake ;)

Let's us know how it's going after the chkdsk
--
Cordialement,
Mathieu CHATEAU
http://lordoftheping.blogspot.com


"Mike O" <put_the_spam@xxxxxxx> wrote in message news:%23NFNazWAIHA.5960@xxxxxxxxxxxxxxxxxxxxxxx
By the way, after posting the message below, I realized some of my wording may have come off sounding a little cranky.. It's been a long, tiring week, in addition to this I've had a couple of other issues and I may have overreacted a little bit.



"Mike O." <MikeO@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message news:B655B129-B516-4C8A-8993-B045062AE633@xxxxxxxxxxxxxxxx
Per the KB284134, clustering supports GPT if you apply the hotfix, which we
did prior to connecting the GPT disk. I don't believe that applying the
Microsoft supported hotfix to correct the issue is "forcing" it.

Also, the problem I'm having is on the smaller basic disk, the GPT one is
fine.

We thought about breaking the "drive" into smaller partitions, but the
issues we run into are space allocation. Eventually we'll end up with one
partition running out of space and another one with space to spare. Our
backup system is an enterprise system, running over 1Gb ethernet (we're
looking at backing up over the SAN soon), so backing up a 1-2 TB is not a
problem.

As for your other questions, I'm not sure where you got the "performance
problems" part. The server was working fine, performance was acceptable then
it quickly (over 30 minutes) failed. I'm still investigating it, but I'm
wondering if a memory leak in one of the drivers or other processes running
on it caused the issue.

We can't exclude real time virus scanning since these are user files. We've
had McAfee products and a support contract with them for years. According to
the tech there are no problems with Virusscan 8.x on the cluster.

We don't have any large access databases on this system. I'm sure there are
some, but it's primarily a user file server, not supposed to be for
applications.

As for the error I'm receiving, should I be able to wait until this weekend
for the CHKDSK, or is it something that's only going to get worse? From some
Microsoft KB articles (and other stuff I found), it seems that NTFS keeps two
copies of the MFT and will use the other one if the primary is corrupted. Is
this correct?

When I do run chkdsk, are there any special issues with the cluster? I
know normally Windows can't chkdsk on an active disk and would have to when
the server is rebooted. The problem is that when the server reboots it
doesn't see the clustered disk until the cluster service starts, so chkdsk
can't access the disk "pre" bootup.

We have a test cluster (with a 50G shared disk). I ran chkdsk /f on it and
it said the drive needed to be unmounted and offered to do that for me. I
told it yes and it seemed to work OK. Of course the disk was unavailable
while the chkdsk was running, but it can back on line as soon as it finished.

"Mathieu CHATEAU" wrote:

HEllo,

GPT disk and cluster are not friend by default, forcing them to be friend
may lead to issue...

By default, server clusters do not support GPT shared disks in Windows
Server 2003
http://support.microsoft.com/kb/284134/en-us

That's the problem with so big data volumes....You should have in mind data
recovery, defrag & chkdsk when sizing data volumes...
You will start having issue when raising 4 Millions of files too

Now, it's clear you have to run the chkdsk. Downtime for downtime, run it on
both if you can

For the performance part:
-did you exclude all shared data from real time antivirus scan on cluster
node ?
-Do you have huge MS Access database ?
-Any monitoring/graphing tool to get some history on ram;cpu;network usage?





--
Cordialement,
Mathieu CHATEAU
http://lordoftheping.blogspot.com


"Mike O." <MikeO@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@xxxxxxxxxxxxxxxx
> (This was also posted on the server clustering group)
>
> I'm trying to find out some information about using CHKDSK on a > clustered
> drive.
> We have a two node cluster (active/passive) running Windows 2003 > R2
> enterprise 32 bit with SP1. The cluster has three shared drives > located
> on
> an EMC CX700 SAN. The three drives are a 500MB for the quorum, > and two
> data
> drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a > basic disk,
> the
> 2.4TB drive W is a GPT disk. They're both about 70% full The E > drive has
> been active for about a year, the W: one was added around June.
>
> Yesterday the active node became sluggish and then stopped > serving data.
> It
> still responded to low level stuff like PING, users were getting > errors on
> the server. Logging in gave a blank screen. This has happened a > couple
> of
> times before (that's a separate issue we're looking into).
>
> We went to the inactive node and did a "move group" in the > cluster
> administrator. We've done this before for various reasons with > no
> problems,
> it usually takes about 20 seconds to bring the resources up on > the other
> node.
>
> This time when the resources came on line on the 2nd node, we > started
> getting an application popup that "Windows - Corrupt File : The > file or
> directory E:\$Mft is corrupt and unreadable. Please run the > Chkdsk
> utility."
> The drive seems to be running OK with users accessing the > information
> normally. I did some research and it appears that Windows will > use the
> duplicate copy of the MFT if the primary one is corrupted.
>
> I know we need to run CHKDSK soon, but unfortunately, running > chkdsk and
> taking the drive off line for several hours is not something we > can do
> during daytime hours. If necessary we could run it overnight, > but with
> that
> size of drive I don't know if it would finish by the next > morning.
>
> The server has dual fiber connections (we're using the EMC > Powerpath
> software for SAN failover), and we didn't have anything happen > with the
> SAN
> at that time, so based on the timing I'm assuming the MFT > corruption was
> related to the cluster failover, not a physical hardware issue, > so I
> wasn't
> planning on running the sector scan. I would imagine a sector > scan on a
> 1.5TB "disk" would run for a while…
> At this point I'm planning on running CHKDSK over the weekend. > I've never
> run it on a clustered disk before and I'm looking for some > information
> about
> it. I've read Microsoft KB176970 and KB903650, but frankly > they're a
> little
> confusing with the issues about "maintenance mode".
>
> Also, is my understanding about the mirrored/secondary MFT valid? > Since
> users appear to be getting information correctly can the CHKDSK > wait until
> the weekend?. Our backup policy does a full backup each week and > an
> incremental daily, so if something really bad happens we should > be able to
> recover.
>
> Any information on this would be appreciated.
>
> Mike O.









.



Relevant Pages

  • Re: Correcting a corrupted $MFT on a shared clustered disk
    ... Phase 1 went through pretty fast, it found and corrected the 60 or so corrupted attribute & orphaned records that the read-only chkdsk passes were detecting. ... Also, the problem I'm having is on the smaller basic disk, the GPT one is ... are there any special issues with the cluster? ... The three drives are a 500MB for the quorum, ...
    (microsoft.public.windows.server.general)
  • Re: Correcting a corrupted $MFT on a shared clustered disk
    ... Phase 1 went through pretty fast, it found and corrected the 60 or so corrupted attribute & orphaned records that the read-only chkdsk passes were detecting. ... Also, the problem I'm having is on the smaller basic disk, the GPT one is ... are there any special issues with the cluster? ... The three drives are a 500MB for the quorum, ...
    (microsoft.public.windows.server.general)
  • Re: Correcting a corrupted $MFT on a shared clustered disk
    ... Phase 1 went through pretty fast, it found and corrected the 60 or so corrupted attribute & orphaned records that the read-only chkdsk passes were detecting. ... Also, the problem I'm having is on the smaller basic disk, the GPT one is ... are there any special issues with the cluster? ... The three drives are a 500MB for the quorum, ...
    (microsoft.public.windows.server.general)
  • Re: Scheduling CHKDSK and DEFRAG for unattended execution
    ... A single hard disk can have multiple volumes, ... can create an unlimited number of logical drives per disk. ... Description of Enhanced Chkdsk, Autochk, and Chkntfs Tools ...
    (microsoft.public.windowsxp.perform_maintain)
  • Re: Scheduling CHKDSK and DEFRAG for unattended execution
    ... A single hard disk can have multiple volumes, ... can create an unlimited number of logical drives per disk. ... Description of Enhanced Chkdsk, Autochk, and Chkntfs Tools ...
    (microsoft.public.windowsxp.perform_maintain)