Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: "John Fullbright" <fullbrij@xxxxxxxxxxx>
- Date: Thu, 19 Jan 2006 22:22:11 -0800
Being on a SAN doesn't exempt you from performance problems. I guess the
first thing to do is confirm that you have a performance problem and
determine how severe it is.
First, over a period of a few days, collect permon counters. Key counters
to collect include physical disk - Avg. Disk sec/read, Avg. Disk sec/write,
Avg. Disk sec/transfer, reads/sec, writes/sec, transactions/sec, and split
IO/sec. You aslo want to include Database - log record stalls/sec and
MSExchangeIS - Client Latency > 5 sec RPCs.
The Microsoft Whitepaper, "Optimizing Storage for Exchange Server 2003"
http://www.microsoft.com/technet/prodtechnol/exchange/guides/StoragePerformance/fa839f7d-f876-42c4-a335-338a1eb04d89.mspx,
is an excellent place to start when interpreting the data you collect. The
paper lists two specific criteria you can use to determine if your disk
subsystem is performing poorly:
1. Average read and write latencies over 20ms
2. Latency spikes over 50ms that last more than a few seconds.
The Avg. Disk sec/read, Avg. Disk sec/write, and Avg. Disk sec/transfer
counters will give you the data to compare to this standard. I believe
something close to this was the basis of the MOM alert; PhysicalDisk: Avg.
Disk sec/Read: 0 C: value = 0.059590036231884. The average over last 10
samples is 0.05959. Once you confirm a problem exists, you need to
determine what the impact of the problem is. This is where log record
stalls/sec and client latency > 10 sec RPCs come in. Log record stalls
happen when incoming data ills the log buffer in RAM to the high water mark.
A forced commit begins writng the log buffers in RAM to the current log file
on disk and continues until data in the log buffers falls to the low water
mark. During the forced commit, all client IO is quiesed. This is known as
a log stall. KB 328466 was one of the first to actually define when log
stalls are a problem. The criteria in 328466 are:
1. Average value is more than 10 per second
2. Spikes (maximum values) are higher than 100 per second
MSExchangeIS - Log Record stalls/sec will give you the data to evaluate
these criteria. KB 839862 says:
"When Outlook 2002 and later versions request data from an Exchange Server
computer, Outlook calls a function that wraps the RPC to the server. This
new wrapper is the Cancelable RPC wrapper. By default, the Cancelable RPC
wrapper starts a timer and issues the RPC. When the RPC is complete, the
wrapper closes the timer, cleans up, and quits. However, if the RPC for data
takes more than 5 seconds to return the data, the wrapper produces the
Cancel Request dialog box. The Cancel Request dialog box remains on the
screen until the RPC is answered or until the user clicks Cancel. If the
action that the user performs in Outlook causes multiple RPCs to be made,
the user may receive a Cancel Request dialog box for each RPC."
If we see a high number of client latencies over 5 seconds, clients are
definitely seeing the dreaded "requesting data" dialog box and calling the
helpdesk. MSExchangeIS - client latencies > 5 sec RPCs will give you the
data to evaluate this criteria.
Once you determine that you have a problem and the problem is impacting
users, it's time to look for the source. KB 839869 is somewhat helpful, and
does list a plethora possible causes, in my experience disk is the culprit
99% of the time. Slow IO times will be obious from the sec/transaction,
sec/read, and sec/write. If spikes in log stalls correlate with spikes in
slow disk access, the log stall is occuring because the log buffers cannot
be flushed to disk fast enough. If there is no correlation, or only a weak
correlation, it is most likly the impact of large messages. You can
mitigate the impact of large messages somewhat by increasing the number of
log buffers per storage group or the number of storage groups (and thereby
the number of log buffers; log buffers are set on a per storage group basis)
but no amount of buffers will solve the problem if the storage is just
simply too slow.
A storage subsystem is either intentionally or unintentionally designed to
support a specific IO load. The closer you get to the maximum IOPS
capacity, the longer it takes for each IO to complete. When the average
read and write latencies exceed 20 ms, we say the disk subsystem is
overloaded. If we believe the disk subsystem is overloaded, we need to know
how much load we are placing on it. Avg. Disk reads/sec, writes/sec, and
Transactions/sec for each LUN that Exchange reads or writes to will tell us
that. This includes the OS volume, the location of the page file, the temp
directory location, the location of the smtp directories, the MTA, the logs,
the databases, the SRS database if you have one, the working directory, the
MSSearch Service gatherer logs, and the Message tracking logs. After
determining what load we are placing on the various components, we need to
figure out what load the storage location of these components is capable of
supporting. You can use the following formulats for the RAID type of each
LUN:
P = perfomance of a single spindle. For 10K RPM SCSI drives at a target 20
ms IO use 85, for 15K use 110.
N = number of spindles in the RAID set.
N' = number of data spindles in the RAID set
For RAID 1/10/0+1
read performance = P*N
write performance = P*N/2
For RAID 5
read performance = P*N'
write performance = P*N'/4
(depending on the controller and the caching scheme used and the amount of
write cache, you may be able to use P*N'/3 for performance calculation)
For RAID 4/RAID DP (as Implemented by Network Appliance)
read performance = P*N'
write performance = P*N'
In all cases except RAID 4/RAID DP, you will need to determine your
read/write ratio in order to apply the "write penalty. Use the reads/sec,
writes/sec and transactions/sec to determine your ratio. For example:
A RAID 5 array consisting of 3 spindles.
A workload with a 3:1 read/write ratio.
Write performce = 85*2 = 170 IOPS
Read performance = 85*2/4 = 42.5 IOPS
Workload performance = ( .75*170) + (.25*42.5) = 127.5 + 10.625 = 138.125
IOPS.
You can now determine if the load you are placing on the disk subsystem is
at or over the IOPS capacity of the disk subsystem. This works well for
direct attached sorage, but you mentioned a SAN. When you consolidate
storage on a SAN, you have to be very careful to minimize the impact of
comingling. Comingling occurs whe two or more LUNs reside on the same set
of physical spindles, and heavy IO activity against one LUN negatively
impact IO activity on another LUN that shares the same set of spindles.
Heavy database activity could impact the logs if both LUNs share the same
set of physical spindles. Likewise, heavy IO against on server can impact
another unrelated server if both servers have disk whose LUNs reside on the
same set of physical spindles. Unless the SAN vendor has a feature to limit
the IO against comingling LUNs, you only real choice is physical isolation
of the spindles. This is the basis of the MS recommendation that log files
and databases reside on seperate spindles. The smoke and mirrors of
virtualization can result in unintended and difficult to troubleshoot
comingling situations. The only vendor that I am aware of that has a
feature to limit the impact of comingling is Network Appliance. In Data
ONTAP 7.X there is logice to distribute IO between Flexvols that exist
within a given Aggregate.
Well, it's not quite a troubleshooting guide, but hopefully it will get you
going in the right direction.
John Fullbright
"Marlon Brown" <MarlonBrown@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:uv4ruCVHGHA.1032@xxxxxxxxxxxxxxxxxxxxxxx
> The Troubleshooting Analyzer tool just showed one single user which
> presents higher RPC latency than normal, what doesn't indicate a broad
> problem in the server according to the report.
>
> Can you confirm whether high Physical Disk/%Disk Time is a good indication
> of bottlenecks ?
>
> "Marlon Brown" <MarlonBrown@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
> news:OTPdwvUHGHA.524@xxxxxxxxxxxxxxxxxxxxxxx
>> Darn. I think this is an actual issue.
>> I am running Perfmon Physical Disk/% DiskTime:
>> The paritition in which the information store is mounted shows %DiskTime
>> utilization of 80-100 steady. I will run the Troubleshooting Analyzer to
>> see what I get.
>>
>> The servers have the databases installed on a SAN drive, Raid 1+0, with
>> 143GB free disk space (70% free disk space). Not sure what could be
>> causing such latency...
>>
>>
>> "Andy David - MVP" <adavid@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in
>> message news:5flvs1lebbmao8t6ps1eufgmt1rncukoud@xxxxxxxxxx
>>> On Thu, 19 Jan 2006 09:58:04 -0800, Marlon Brown
>>> <MarlonBrown@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>>>MOM2005SP1 keeps warning that various Exchange 2003 Servers experience
>>>>the
>>>>issue below:
>>>>
>>>>High Disk Read Latencies for the past 10 minutes
>>>>PhysicalDisk: Avg. Disk sec/Read: 0 C: value = 0.059590036231884. The
>>>>average over last 10 samples is 0.05959.
>>>>
>>>>Can you indicate what's the best way to troubleshoot and fix this ?
>>>>I think the first step would be by running Perfmon and setting manual
>>>>Disk
>>>>counters to see whether this is an actual problem. Can you point me a
>>>>document or tell me the counter parameters I should be looking for to
>>>>determine whether this an actual issue ? Also, if you know how to
>>>>improve
>>>>this in case it is an actual issue, I would appreciate your input.
>>>
>>>
>>> Start here:
>>> http://www.microsoft.com/Downloads/details.aspx?familyid=4BDC1D6B-DE34-4F1C-AEBA-FED1256CAF9A&displaylang=en
>>>
>>> Microsoft Exchange Server Performance Troubleshooting Analyzer Tool
>>> v1.0
>>>
>>>
>>> Make sure you have the latest firmware and updates for your Raid
>>> Controllers and disks.
>>>
>>>
>>> Have you sized your servers appropriately?
>>>
>>> http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/highavailgde.mspx
>>>
>>
>>
>
>
.
- Follow-Ups:
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: Marlon Brown
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- References:
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: Andy David - MVP
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: Marlon Brown
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- From: Marlon Brown
- Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- Prev by Date: OWA Admin Tool - Options Greyed out
- Next by Date: Re: OWA Admin Tool - Options Greyed out
- Previous by thread: Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- Next by thread: Re: How to verify/fix High Disk Read Latencies in Exch2003 ?
- Index(es):
Relevant Pages
|