Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: "Marlon Brown" <MarlonBrown@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Fri, 20 Jan 2006 07:11:42 -0800
Thanks John ! Let me set the perfmon counters very soon.
However, one more observation is that MOM is returning the same High Latency
alerts against two other Exchange pre-production servers, which are
virtually with 0 mailboxes on it, no usage at all. Therefore I am wondering
how the disk could be highly utilized if I have no users accessing the
pre-production servers. It is also returning the same high latency against
the Exch2003 SMTP Connector server, which is not clustered.
"John Fullbright" <fullbrij@xxxxxxxxxxx> wrote in message
news:%23klicnYHGHA.1124@xxxxxxxxxxxxxxxxxxxxxxx
> Being on a SAN doesn't exempt you from performance problems. I guess the
> first thing to do is confirm that you have a performance problem and
> determine how severe it is.
>
> First, over a period of a few days, collect permon counters. Key counters
> to collect include physical disk - Avg. Disk sec/read, Avg. Disk
> sec/write, Avg. Disk sec/transfer, reads/sec, writes/sec,
> transactions/sec, and split IO/sec. You aslo want to include Database -
> log record stalls/sec and MSExchangeIS - Client Latency > 5 sec RPCs.
>
> The Microsoft Whitepaper, "Optimizing Storage for Exchange Server 2003"
> http://www.microsoft.com/technet/prodtechnol/exchange/guides/StoragePerformance/fa839f7d-f876-42c4-a335-338a1eb04d89.mspx,
> is an excellent place to start when interpreting the data you collect.
> The paper lists two specific criteria you can use to determine if your
> disk subsystem is performing poorly:
>
> 1. Average read and write latencies over 20ms
>
> 2. Latency spikes over 50ms that last more than a few seconds.
>
> The Avg. Disk sec/read, Avg. Disk sec/write, and Avg. Disk sec/transfer
> counters will give you the data to compare to this standard. I believe
> something close to this was the basis of the MOM alert; PhysicalDisk:
> Avg. Disk sec/Read: 0 C: value = 0.059590036231884. The average over last
> 10 samples is 0.05959. Once you confirm a problem exists, you need to
> determine what the impact of the problem is. This is where log record
> stalls/sec and client latency > 10 sec RPCs come in. Log record stalls
> happen when incoming data ills the log buffer in RAM to the high water
> mark. A forced commit begins writng the log buffers in RAM to the current
> log file on disk and continues until data in the log buffers falls to the
> low water mark. During the forced commit, all client IO is quiesed. This
> is known as a log stall. KB 328466 was one of the first to actually
> define when log stalls are a problem. The criteria in 328466 are:
>
> 1. Average value is more than 10 per second
>
> 2. Spikes (maximum values) are higher than 100 per second
>
> MSExchangeIS - Log Record stalls/sec will give you the data to evaluate
> these criteria. KB 839862 says:
>
> "When Outlook 2002 and later versions request data from an Exchange Server
> computer, Outlook calls a function that wraps the RPC to the server. This
> new wrapper is the Cancelable RPC wrapper. By default, the Cancelable RPC
> wrapper starts a timer and issues the RPC. When the RPC is complete, the
> wrapper closes the timer, cleans up, and quits. However, if the RPC for
> data takes more than 5 seconds to return the data, the wrapper produces
> the Cancel Request dialog box. The Cancel Request dialog box remains on
> the screen until the RPC is answered or until the user clicks Cancel. If
> the action that the user performs in Outlook causes multiple RPCs to be
> made, the user may receive a Cancel Request dialog box for each RPC."
>
> If we see a high number of client latencies over 5 seconds, clients are
> definitely seeing the dreaded "requesting data" dialog box and calling the
> helpdesk. MSExchangeIS - client latencies > 5 sec RPCs will give you the
> data to evaluate this criteria.
>
> Once you determine that you have a problem and the problem is impacting
> users, it's time to look for the source. KB 839869 is somewhat helpful,
> and does list a plethora possible causes, in my experience disk is the
> culprit 99% of the time. Slow IO times will be obious from the
> sec/transaction, sec/read, and sec/write. If spikes in log stalls
> correlate with spikes in slow disk access, the log stall is occuring
> because the log buffers cannot be flushed to disk fast enough. If there
> is no correlation, or only a weak correlation, it is most likly the impact
> of large messages. You can mitigate the impact of large messages somewhat
> by increasing the number of log buffers per storage group or the number of
> storage groups (and thereby the number of log buffers; log buffers are set
> on a per storage group basis) but no amount of buffers will solve the
> problem if the storage is just simply too slow.
>
> A storage subsystem is either intentionally or unintentionally designed to
> support a specific IO load. The closer you get to the maximum IOPS
> capacity, the longer it takes for each IO to complete. When the average
> read and write latencies exceed 20 ms, we say the disk subsystem is
> overloaded. If we believe the disk subsystem is overloaded, we need to
> know how much load we are placing on it. Avg. Disk reads/sec, writes/sec,
> and Transactions/sec for each LUN that Exchange reads or writes to will
> tell us that. This includes the OS volume, the location of the page file,
> the temp directory location, the location of the smtp directories, the
> MTA, the logs, the databases, the SRS database if you have one, the
> working directory, the MSSearch Service gatherer logs, and the Message
> tracking logs. After determining what load we are placing on the various
> components, we need to figure out what load the storage location of these
> components is capable of supporting. You can use the following formulats
> for the RAID type of each LUN:
>
> P = perfomance of a single spindle. For 10K RPM SCSI drives at a target
> 20 ms IO use 85, for 15K use 110.
> N = number of spindles in the RAID set.
> N' = number of data spindles in the RAID set
>
> For RAID 1/10/0+1
>
> read performance = P*N
> write performance = P*N/2
>
> For RAID 5
>
> read performance = P*N'
> write performance = P*N'/4
>
> (depending on the controller and the caching scheme used and the amount of
> write cache, you may be able to use P*N'/3 for performance calculation)
>
> For RAID 4/RAID DP (as Implemented by Network Appliance)
>
> read performance = P*N'
> write performance = P*N'
>
> In all cases except RAID 4/RAID DP, you will need to determine your
> read/write ratio in order to apply the "write penalty. Use the reads/sec,
> writes/sec and transactions/sec to determine your ratio. For example:
>
> A RAID 5 array consisting of 3 spindles.
> A workload with a 3:1 read/write ratio.
>
> Write performce = 85*2 = 170 IOPS
> Read performance = 85*2/4 = 42.5 IOPS
> Workload performance = ( .75*170) + (.25*42.5) = 127.5 + 10.625 = 138.125
> IOPS.
>
> You can now determine if the load you are placing on the disk subsystem is
> at or over the IOPS capacity of the disk subsystem. This works well for
> direct attached sorage, but you mentioned a SAN. When you consolidate
> storage on a SAN, you have to be very careful to minimize the impact of
> comingling. Comingling occurs whe two or more LUNs reside on the same set
> of physical spindles, and heavy IO activity against one LUN negatively
> impact IO activity on another LUN that shares the same set of spindles.
> Heavy database activity could impact the logs if both LUNs share the same
> set of physical spindles. Likewise, heavy IO against on server can impact
> another unrelated server if both servers have disk whose LUNs reside on
> the same set of physical spindles. Unless the SAN vendor has a feature to
> limit the IO against comingling LUNs, you only real choice is physical
> isolation of the spindles. This is the basis of the MS recommendation
> that log files and databases reside on seperate spindles. The smoke and
> mirrors of virtualization can result in unintended and difficult to
> troubleshoot comingling situations. The only vendor that I am aware of
> that has a feature to limit the impact of comingling is Network Appliance.
> In Data ONTAP 7.X there is logice to distribute IO between Flexvols that
> exist within a given Aggregate.
>
> Well, it's not quite a troubleshooting guide, but hopefully it will get
> you going in the right direction.
>
> John Fullbright
>
>
>
>
>
> "Marlon Brown" <MarlonBrown@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
> news:uv4ruCVHGHA.1032@xxxxxxxxxxxxxxxxxxxxxxx
>> The Troubleshooting Analyzer tool just showed one single user which
>> presents higher RPC latency than normal, what doesn't indicate a broad
>> problem in the server according to the report.
>>
>> Can you confirm whether high Physical Disk/%Disk Time is a good
>> indication of bottlenecks ?
>>
>> "Marlon Brown" <MarlonBrown@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
>> news:OTPdwvUHGHA.524@xxxxxxxxxxxxxxxxxxxxxxx
>>> Darn. I think this is an actual issue.
>>> I am running Perfmon Physical Disk/% DiskTime:
>>> The paritition in which the information store is mounted shows %DiskTime
>>> utilization of 80-100 steady. I will run the Troubleshooting Analyzer to
>>> see what I get.
>>>
>>> The servers have the databases installed on a SAN drive, Raid 1+0, with
>>> 143GB free disk space (70% free disk space). Not sure what could be
>>> causing such latency...
>>>
>>>
>>> "Andy David - MVP" <adavid@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in
>>> message news:5flvs1lebbmao8t6ps1eufgmt1rncukoud@xxxxxxxxxx
>>>> On Thu, 19 Jan 2006 09:58:04 -0800, Marlon Brown
>>>> <MarlonBrown@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>>>MOM2005SP1 keeps warning that various Exchange 2003 Servers experience
>>>>>the
>>>>>issue below:
>>>>>
>>>>>High Disk Read Latencies for the past 10 minutes
>>>>>PhysicalDisk: Avg. Disk sec/Read: 0 C: value = 0.059590036231884. The
>>>>>average over last 10 samples is 0.05959.
>>>>>
>>>>>Can you indicate what's the best way to troubleshoot and fix this ?
>>>>>I think the first step would be by running Perfmon and setting manual
>>>>>Disk
>>>>>counters to see whether this is an actual problem. Can you point me a
>>>>>document or tell me the counter parameters I should be looking for to
>>>>>determine whether this an actual issue ? Also, if you know how to
>>>>>improve
>>>>>this in case it is an actual issue, I would appreciate your input.
>>>>
>>>>
>>>> Start here:
>>>> http://www.microsoft.com/Downloads/details.aspx?familyid=4BDC1D6B-DE34-4F1C-AEBA-FED1256CAF9A&displaylang=en
>>>>
>>>> Microsoft Exchange Server Performance Troubleshooting Analyzer Tool
>>>> v1.0
>>>>
>>>>
>>>> Make sure you have the latest firmware and updates for your Raid
>>>> Controllers and disks.
>>>>
>>>>
>>>> Have you sized your servers appropriately?
>>>>
>>>> http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/highavailgde.mspx
>>>>
>>>
>>>
>>
>>
>
>
.
- Follow-Ups:
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: John Fullbright
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- References:
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: Andy David - MVP
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: Marlon Brown
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: Marlon Brown
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: John Fullbright
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- Prev by Date: Re: Exchange2003 backup folder limits at 2GB
- Next by Date: Outlook reminder
- Previous by thread: Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- Next by thread: Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- Index(es):
Relevant Pages
|