9 Jul, 2009
A Compendium of Load Balancing Strategies
Posted by Bhavin Turakhia | (1) Comments
I have drafted a detailed article comparing various load balancing strategies, including conventional load balancers, wackamole and NLB.
You can revew the article on our Directi Wiki – A Compendium of Load Balancing Strategies. An interesting read for anyone who is involved with scalable and higly available applications.
15 Mar, 2009
Selecting a Solid State Drive technology
Posted by Bhavin Turakhia | (4) Comments
This is my 3rd post on Solid State drive technology (read Solid State Drives vs Hard disk drives). Offlate I have been making a ton of posts on storage, given that in a high-performance, data-intensive environment, applications eventually bottleneck at the slowest component in the chain – the disk (no surprise there considering it is the only device with moving parts).
With the numerous SSD options floating around in the market, it is hard to figure out which way to go without adequate research. Infact, a wrong choice can result in slowing down your application (check Solid State Hard drives have poor random write performance).
Here are my notes on selecting your SSD platform for maximum performance -
Check the random write IOPs
Different SSDs have different performance. SSDs fare worst on random writes (4K or smaller blocks). Therefore when checking an SSD – always check the maximum random write IOPs it delivers
Single cell (SLC) vs Multi-cell (MLC)
SSDs are either single cell (SLC) or multi-cell (MLC). SLC memory has the advantage of faster transfer speeds, lower power consumption and higher cell endurance. However, since they store less data per cell, SLC costs more per megabyte of storage than MLC.
Managed Flash Technology
I first wrote about Managed Flash Technology in my earlier article on SSDs - Solid State Drives vs Hard disk drives. MFT was developed by EasyCo. MFT essentially converts multiple random writes into a single linear write by collating them and writing them out together. MFT can pretty much be used in conjunction with any SSD. EasyCo’s website has interesting benchmark comparisons of SSDs with and without MFT. For instance, on the Mtron PRO 7000, MFT increased the random IOPs for 4K blocks from 123 to a whopping 16,180
Fusion-io
Fusion-io claims to have produced the fastest SSD in the world. With Steve Wozniak on their team, I would be inclined to believe their claims. The spec sheet of their latest drive – the iodrive duo – boasts a random write performance of 180,530 IOPs. That is insane. Additionally multiple ioDrives can be configured into a RAID array for additional performance and reliability.
8 Mar, 2009
iostat and disk utilization monitoring nirvana
Posted by Bhavin Turakhia | (6) Comments
In my neverending quest of performance monitoring, I have been constantly trying to find better ways to monitor disk utilization on a server. At Directi we use the usual medley of tools at our disposal viz. iostat, sar, sysstat. I made serious progress last week, when Dushyanth from my team shared this post on IO Monitoring on Linux, by the folks over at Pythian, on our internal mailing list. Here are my notes on the subject.
Performance measurement and Capacity planning are a science. It is common practice at Directi to attempt to determine what the performance bottlenecks in any given application are. A usual generalization is to determine whether an application is cpu-bound / memory-bound / IO bound.
IO bound applications end up wasting cpu cycles, especially incase of Disk IO, since most programming languages do not have Async Disk IO support today. Therefore in order to maximize performance and optimize resource utilization one should try and reduce iowait time of a CPU and tweak a deployment to make an application cpu-bound.
When your CPU seems to be spending a lot of time on iowait you need to make some changes. However an iowait can occur either because there is a lot of Disk/Network IO taking place, or because the disk subsystem is saturated and cannot provide greater throughput. iostat allows you to determine which one it is. A regular iostat output consists of the following fields -
# iostat -dkx 60
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 611.40 414.23 20286.60 1656.93 42.79 17.50 17.33 0.96 98.57
Explanation of the above fields:
- Device: the block device whose performance counters are being reported
- r/s and w/s: number of read and write requests issued per second to the device (in this case 611 and 414)
- rsec/s and wsec/s – number of sectors read/written per second
- rkB/s and wkB/s – number of kilobytes read/written per second
- avgrq-sz – average number of sectors per request (for both reads and writes). ie (rsec + wsec) / (r + w)
- avgqu-sz – average queue length in the monitoring interval (in this case 42.79)
- await – average time that each IO Request took to complete. This includes the time that the request was waiting in the queue and the time that the request took to be serviced by the device
- svctim – average time each IO request took to complete during the monitoring interval
- Note: await includes svctim. Infact await (average time taken for each IO Request to complete) = the average time that each request was in queue (lets call it queuetime) PLUS the average time each request took to process (svctim)
- %util: This number depicts the percentage of time that the device spent in servicing requests. This can be calculated with the above values. In the above example the total number of reads and writes issued per second is 611 + 414 => 1025. Each request takes 0.96 ms to process. Therefore 1025 requests would take 1025 x 0.96 => 984 ms to process. So out of the 1 second that these requests were sent to the device in, 984 ms were taken to process the requests. This means the device utilization is 984/1000 * 100 => ~98.4%. As you can see in the above iostat output the %util does show ~ 98.5%
Interpreting iostat values
Lets take the above example
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 611.40 414.23 20286.60 1656.93 42.79 17.50 17.33 0.96 98.57
- avg time that each request spent in queue (qtime) = await – svctime = 17.33 – 0.96 => 16.37 ms
- avg time tha each request spent being serviced = 0.96 ms
- so averagely each IO request spent 17.33ms to et processed of which 16.37 ms were spent just waiting in queue
- %util can be calculated as (r/s + w/s) * svctim / 1000ms * 100 => 1025*0.96/1000 * 100 => 98.5%
- This simple means that in a 1 second interval, 1025 requests were sent to disk, each of which took 0.96ms for the disk to process resulting in 984 ms of disk utilization time in a period of 1 s (or 1000 ms). This means the disk is greater than 98% utilized
On this disk subsystem, it is clear that the disk cannot process more IO requests than what it is getting
Lets take another example -
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 6.33 139.07 46.30 19.53 526.27 634.40 35.26 0.54 8.17 3.30 21.74
- avg time that each request spent in queue (qtime) = await – svctime = 8.17 – 3.30 => 4.87 ms
- avg time tha each request spent being serviced = 3.30 ms
- so averagely each IO request spent 8.17 ms to et processed of which 4.87 ms (a little more than half) were spent waiting in queue
- %util can be calculated as (r/s + w/s) * svctim / 1000ms * 100 => 65.83 * 3.3/1000 * 100 => 21.72%
- This simple means that in a 1 second interval, 65 requests were sent to disk, each of which took 3.30ms for the disk to process resulting in 217 ms of disk utilization time in a period of 1 s (or 1000 ms). This means the disk is around 21.7 % utilized
On this disk subsystem, it is clear that the disk is not fully utilized. While due to the nature of the requests, averagely requests are spending half their time in queue, that is not so bad. This disk subsystem is capable of greater throughput.
Notes
On every Linux box the following should be graphed at 5 minute averages
- %util: When this figure is consistently approaching above 80% you will need to take any of the following actions -
- increasing RAM so dependence on disk reduces
- increasing RAID controller cache so disk dependence decreases
- increasing number of disks so disk throughput increases (more spindles working parallely)
- horizontal partitioning
- (await-svctim)/await*100: The percentage of time that IO operations spent waiting in queue in comparison to actually being serviced. If this figure goes above 50% then each IO request is spending more time waiting in queue than being processed. If this ratio skews heavily upwards (in the >75% range) you know that your disk subsystem is not being able to keep up with the IO requests and most IO requests are spending a lot of time waiting in queue. In this scenario you will again need to take any of the actions above
- %iowait: This number shows the % of time the CPU is wasting in waiting for IO. A part of this number can result from network IO, which can be avoided by using an Async IO library. The rest of it is simply an indication of how IO-bound your application is. You can reduce this number by ensuring that disk IO operations take less time, more data is available in RAM, increasing disk throughput by increasing number of disks in a RAID array, using SSD (Check my post on Solid State drives vs Hard Drives) for portions of the data or all of the data etc
15 Feb, 2009
Notes on Amazon EC2
Posted by Bhavin Turakhia | (4) Comments
Sandeep Shetty from our Products team introduced me to Scalr – an opensource self-scaling hosting platform based on the Amazon EC2 cloud. I decided to take a quick look under the hood and figure out how Amazon EC2 would function as a hosting platform. Here are my quick notes -
Intro
- EC2 offers the ability to instantly provision Virtual machines using an image (called AMIs) through an API
- Each instance is like a VPS with a certain amount of RAM, CPU, Disk capacity
- CPU capacity of an instance is measured in the form of EC2 Compute Units. From their FAQ – each EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor
Pricing
- There are various types of instances
- As an eg, an instance with 4 EC2 Compute Units, 7.5 GB RAM and 850 GB storage would cost 40 cents per hour => ~$300 per month
- Data Transfer costs 10 cents per GB for outbound transfers and between 10-17 cents per GB for inbound. Assuming only outbound data transfer (typical case for a web app) and the lowest rate on EC2 (10 cents per GB) the cost per Mbps per month for EC2 works out to be approximately $32.
- For any persistent storage over and above that provided in the instance one can use Amazon Elastic Block Storage or Amazon S3.
- Amazon EBS costs are 10 cents per GB per month + 10 cents per million I/O requests
Notes
- Many people tend to wrongly assume that EC2 (which stands for Elastic Compute Cloud) allows you to provision resources in an elastic manner and scale your application ad infinitum without any changes to the application. While in theory you can provision instances dynamically upon need, each EC2 instance acts like an independent machine with an independent OS, memory, CPU etc. It is identical therefore to provisioning multiple hardware boxes and any partitioning / load balancing etc would need to be done by the application developer at the App layer
- The elasticity does have considerable advantages in as much as provisioning is fully automated and each instance can be added / removed at a moment’s notice (about 10 minutes to boot up a new instance according to their FAQ) thus taking care of peaks dynamically
- Additionally no hardware setup is required to add / remove an instance
- Instances are provisioned through images which take care of complete setup thus relieving any system administration effort in setting up a machine
- Amazon EC2 provides the ability to place instances in multiple locations. Amazon offers multiple regions (USA / Europe) and various Availability Zones within these regions. Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. Using instances in separate Availability Zones, one can protect applications from failure of a single location.
- Setting up an EC2 instance is quite easy. Once you create your AWS account, you can use the online AWS Console or simply download the offline command line tools to start provisioning your instances. Check the AWS Console Video for more details on the AWS Console.
Applications
- I can think of using EC2 instances for DNS infrastructure. Easy to deploy, and can scale dynamically to manage high loads. Additionally DNS servers do not require lots of storage and are innately redundant by virtue of the DNS protocol. Lastly, since EC2 provides multiple regions and zones, the DNS infrastructure can be scaled out resulting in geographical redundancy
- EC2 is great for prototyping as well as benchmarking
- One can also use EC2 deploying small web apps reducing time to market and allowing quick setup
- Many applications have monthly report generation requirements which too can be run of an EC2 instance. EC2 offers SQL Server instances incase you want a commercial database to run reports / crunch data.
- We have also been thinking of using EC2 for CodeChef. Since at CodeChef we plan on running a programming contest each mont, visitors as well as computing resources required to manage submissions increase considerably around the time each contest is announced. This makes Amazon EC2 a perfect candidate for dynamic resource deployment during contest-week
Other Amazon Web Services
- Amazon S3 provides a scalable, high-available and redundant NAS that can be used to store and retrieve any amount of data
- Amazon CloudFront is their CDN layered onm top of S3. It delivers content using a global network of edge locations. Requests for objects are automatically routed to the nearest edge location
Resources









