8 Mar, 2009

iostat and disk utilization monitoring nirvana

Posted by

In my neverending quest of performance monitoring, I have been constantly trying to find better ways to monitor disk utilization on a server. At Directi we use the usual medley of tools at our disposal viz. iostat, sar, sysstat. I made serious progress last week, when Dushyanth from my team shared this post on IO Monitoring on Linux, by the folks over at Pythian, on our internal mailing list. Here are my notes on the subject.

Performance measurement and Capacity planning are a science. It is common practice at Directi to attempt to determine what the performance bottlenecks in any given application are. A usual generalization is to determine whether an application is cpu-bound / memory-bound / IO bound.

IO bound applications end up wasting cpu cycles, especially incase of Disk IO, since most programming languages do not have Async Disk IO support today. Therefore in order to maximize performance and optimize resource utilization one should try and reduce iowait time of a CPU and tweak a deployment to make an application cpu-bound.

When your CPU seems to be spending a lot of time on iowait you need to make some changes. However an iowait can occur either because there is a lot of Disk/Network IO taking place, or because the disk subsystem is saturated and cannot provide greater throughput. iostat allows you to determine which one it is. A regular iostat output consists of the following fields -

# iostat -dkx 60

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0              0.00     0.00  611.40  414.23 20286.60  1656.93 42.79    17.50   17.33   0.96  98.57

Explanation of the above fields:

  • Device: the block device whose performance counters are being reported
  • r/s and w/s: number of read and write requests issued per second to the device (in this case 611 and 414)
  • rsec/s and wsec/s – number of sectors read/written per second
  • rkB/s and wkB/s – number of kilobytes read/written per second
  • avgrq-sz – average number of sectors per request (for both reads and writes). ie (rsec + wsec) / (r + w)
  • avgqu-sz – average queue length in the monitoring interval (in this case 42.79)
  • await – average time that each IO Request took to complete. This includes the time that the request was waiting in the queue and the time that the request took to be serviced by the device
  • svctim – average time each IO request took to complete  during the monitoring interval
  • Note: await includes svctim. Infact await (average time taken for each IO Request to complete) = the average time that each request was in queue (lets call it queuetime) PLUS the average time each request took to process (svctim)
  • %util: This number depicts the percentage of time that the device spent in servicing requests. This can be calculated with the above values. In the above example the total number of reads and writes issued per second is 611 + 414 => 1025. Each request takes 0.96 ms to process. Therefore 1025 requests would take 1025 x 0.96 => 984 ms to process. So out of the 1 second that these requests were sent to the device in, 984 ms were taken to process the requests. This means the device utilization is 984/1000 * 100 => ~98.4%. As you can see in the above iostat output the %util does show ~ 98.5%

Interpreting iostat values

Lets take the above example

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0              0.00     0.00  611.40  414.23 20286.60  1656.93 42.79    17.50   17.33   0.96  98.57

  • avg time that each request spent in queue (qtime) = await – svctime = 17.33 – 0.96 => 16.37 ms
  • avg time tha each request spent being serviced = 0.96 ms
  • so averagely each IO request spent 17.33ms to et processed of which 16.37 ms were spent just waiting in queue
  • %util can be calculated as (r/s + w/s) * svctim / 1000ms * 100  => 1025*0.96/1000 * 100 => 98.5%
  • This simple means that in a 1 second interval, 1025 requests were sent to disk, each of which took 0.96ms for the disk to process resulting in 984 ms of disk utilization time in a period of 1 s (or 1000 ms). This means the disk is greater than 98% utilized

On this disk subsystem, it is clear that the disk cannot process more IO requests than what it is getting

Lets take another example -

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               6.33   139.07   46.30   19.53   526.27   634.40    35.26 0.54    8.17   3.30  21.74

  • avg time that each request spent in queue (qtime) = await – svctime = 8.17 – 3.30 => 4.87 ms
  • avg time tha each request spent being serviced = 3.30 ms
  • so averagely each IO request spent 8.17 ms to et processed of which 4.87 ms (a little more than half) were spent waiting in queue
  • %util can be calculated as (r/s + w/s) * svctim / 1000ms * 100  => 65.83 * 3.3/1000 * 100 => 21.72%
  • This simple means that in a 1 second interval, 65 requests were sent to disk, each of which took 3.30ms for the disk to process resulting in 217 ms of disk utilization time in a period of 1 s (or 1000 ms). This means the disk is around 21.7 % utilized

On this disk subsystem, it is clear that the disk is not fully utilized. While due to the nature of the requests, averagely requests are spending half their time in queue, that is not so bad. This disk subsystem is capable of greater throughput.

Notes

On every Linux box the following should be graphed at 5 minute averages

  • %util: When this figure is consistently approaching above 80% you will need to take any of the following actions -
    • increasing RAM so dependence on disk reduces
    • increasing RAID controller cache so disk dependence decreases
    • increasing number of disks so disk throughput increases (more spindles working parallely)
    • horizontal partitioning
  • (await-svctim)/await*100: The percentage of time that IO operations spent waiting in queue in comparison to actually being serviced. If this figure goes above 50% then each IO request is spending more time waiting in queue than being processed. If this ratio skews heavily upwards (in the >75% range) you know that your disk subsystem is not being able to keep up with the IO requests and most IO requests are spending a lot of time waiting in queue. In this scenario you will again need to take any of the actions above
  • %iowait: This number shows the % of time the CPU is wasting in waiting for IO. A part of this number can result from network IO, which can be avoided by using an Async IO library. The rest of it is simply an indication of how IO-bound your application is. You can reduce this number by ensuring that disk IO operations take less time, more data is available in RAM, increasing disk throughput by increasing number of disks in a RAID array, using SSD (Check my post on Solid State drives vs Hard Drives) for portions of the data or all of the data etc
Tags: , , , , , ,
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
  • Comment by N on March 9, 2009 @ 7:36 am
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
  • Comment by Atsushi on March 31, 2009 @ 2:30 am
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
  • Comment by Abid on July 16, 2009 @ 6:10 am
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
  • Comment by Thiru on September 29, 2009 @ 9:12 am
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
  • Comment by Travis Bear on April 27, 2010 @ 3:34 pm

Comments
comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
N
March 9, 2009

Very few web hosting companies seem to do a detailed analysis. For shared hosting, you can see a lot of variation in response time with even the larger web hosting companies.

comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
tinkertim
March 12, 2009

Have you checked out dm-ioband? Not only are the metrics easier to obtain, but great QoS can be achieved.

comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
Atsushi
March 31, 2009

Very clear description of %util of a disk.
It is totally what I’ve looked for.
Thank you.

comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
Abid
July 16, 2009

Good article though I’m confused whether or not it applies to an issue at my hand.

To put it simple, I’ve a java application wherein I spawn some 200 threads and each thread has to read/write its own file inside one common directory say “temp”. However I’ve a requirement that only one thread (and in terms of latest linux kernel I may say each process) do read/write operation on “temp” at a time. So, I streamline the operation by writing a code that can be represented by following pseudo code:

lock(“temp”){

do read/write

}

So, now if 200 threads concurrently try to execute this code they won’t be able to do so; one thread will execute this piece of code at a time and others will get blocked.

So, I was just wondering if thread-1 starts now and say after 3 seconds thread-200 gets to acquire the lock and perform read/write operation, will this waiting time of 3 seconds affect “await” of iostat?

comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
Thiru
September 29, 2009

Great. . . Gud keep it up. . .

comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
Travis Bear
April 27, 2010

Excellent explanation. Just what I needed. Thank you!

-Travis

Leave a comment

(required)

(required)

Spam protection by WP Captcha-Free