21 May, 2011
Unix domain sockets vs TCP Sockets
Posted by Bhavin Turakhia | (0) Comments
Here are two interesting links I found comparing the features and performance differences between using Unix Domain Sockets and TCP Loopback Sockets
http://lists.freebsd.org/pipermail/freebsd-performance/2005-February/001143.html
Excerpt: IP sockets over localhost are basically looped back network on-the-wireIP. There is intentionally “no special knowledge” of the fact that the connection is to the same system, so no effort is made to bypass the normal IP stack mechanisms for performance reasons. For example, transmission over TCP will always involve two context switches to get to the remote socket, as you have to switch through the netisr, which occurs following the “loopback” of the packet through the synthetic loopback interface. Likewise, you get all the overhead of ACKs, TCP flow control, encapsulation/decapsulation, etc. Routing will be performed in order to decide if the packets go to the localhost. Large sends will have to be broken down into MTU-size datagrams, which also adds overhead for large writes. It’s really TCP, it just goes over a loopback interface by virtue of a special address, or discovering that the address requested is served locally rather than over an ethernet (etc).
UNIX domain sockets have explicit knowledge that they’re executing on the same system. They avoid the extra context switch through the netisr, and a sending thread will write the stream or datagrams directly into the receiving socket buffer. No checksums are calculated, no headers are inserted, no routing is performed, etc. Because they have access to the remote socket buffer, they can also directly provide feedback to the sender when it is filling, or more importantly, emptying, rather than having the added overhead of explicit acknowledgement and window changes. The one piece of functionality that UNIX domain sockets don’t provide that TCP does is out-of-band data. In practice, this is an issue for almost noone.
http://osnet.cs.binghamton.edu/publications/TR-20070820.pdf
Excerpt: It was hypothesized that pipes would have the highest throughtput due to its limited functionality, since it is half-duplex, but this was not true. For almost all of the data sizes transferred, Unix domain sockets performed better than both TCP sockets and pipes, as can be seen in Figure 1 below. Figure 1 shows the transfer rates for the IPC mechanisms, but it should be noted that they do not represent the speeds obtained by all of the test machines. The transfer rates are consistent across the machines with similar hardware configurations though. On some machines, Unix domain sockets reached transfer rates as high as 1500 MB/s.
14 Nov, 2009
RAM Speed
Posted by Bhavin Turakhia | (6) Comments
To test the speed of RAM, I got Ramki to run a small program that writes a set of bytes into memory a billion times and ran 4 instances of it on a dual proc quad core machine. Below are the results of running four instances of the program simultaneously.
Result
output.1: User time (seconds): 545.99
output.1: System time (seconds): 1.33
output.1: Elapsed (wall clock) time (h:mm:ss or m:ss): 9:07.38
output.1: Involuntary context switches: 820
output.2: User time (seconds): 250.90
output.2: System time (seconds): 1.18
output.2: Elapsed (wall clock) time (h:mm:ss or m:ss): 4:12.12
output.2: Involuntary context switches: 378
output.3: User time (seconds): 250.30
output.3: System time (seconds): 1.15
output.3: Elapsed (wall clock) time (h:mm:ss or m:ss): 4:11.49
output.3: Involuntary context switches: 373
output.4: User time (seconds): 563.62
output.4: System time (seconds): 1.31
output.4: Elapsed (wall clock) time (h:mm:ss or m:ss): 9:25.00
output.4: Involuntary context switches: 845
Observations
- The write speed was between 0.25 seconds per million writes to 0.55 seconds
- Output.2 and .3 took half the time as that of .1 or .4
- Don’t have a specific theory on why 2 of the cores did better than the other two
- No processor affinity was set, and the processes were being scheduled on random processors after every context switch.
- Seemingly the processes were accessing RAM simultaneously. In my limited knowledge that could mean a few things – Multi-channel FSB (Dual) and additionally while oneprocess was computing stuff the other processes could access the FSB. The program was using lrand48 to generate a random number to write data to random locations so as to ensure that we do not rely too much on the L1/L2 cache
Some reading
22 Jul, 2009
Column Oriented DBMS
Posted by Bhavin Turakhia | (1) Comments
Conventionally we take DBMS for granted as a structured data store that stores data in the form of rows. Infact most application developers can begin visualizing their data as rows in an RDBMS quite naturally.
While RDBMS serve the purposes of OLTP applications well, OLAP / data anlytics type applications have conventionally not been able to obtain the type of performance needed from RDBMses. This is where column oriented DBMS can help.
In the simplest form the difference between a conventional RDBMS and a column oriented database is that the latter stores data in a column form rather than a row form when persisted to disk. Another way to look at this is that the storage in a column oriented DBMS transposes the rows and columns of the storage in a conventional RDBMS.
For eg
| ID | Name | Age |
| 1 | Bhavin | 29 |
| 2 | Roger | 30 |
This would be persisted in a conventional RDBMS as follows -
1,Bhavin,29|2,Roger,30
In a column oriented DBMS this would be persisted as -
1,2|Bhavin,Roger|29,30
It is common knowledge that the slowest piece of a DB query is its disk seek time. While the RDBMS favors queries which require fetching all data of a given row, the latter model favors queries which require aggregates. For instance – count of all users with age >20, or sum of ages of all users, and so on. These type of queries will run much faster on a column oriented DBMS due to lesser seek time required to obtain the data.
OLAP and BI applications mostly consist of data aggregation and would therefore run faster on column oriented databases.
For a list of column-oriented DBMSes refer to http://en.wikipedia.org/wiki/Column-oriented_DBMS
8 Mar, 2009
iostat and disk utilization monitoring nirvana
Posted by Bhavin Turakhia | (6) Comments
In my neverending quest of performance monitoring, I have been constantly trying to find better ways to monitor disk utilization on a server. At Directi we use the usual medley of tools at our disposal viz. iostat, sar, sysstat. I made serious progress last week, when Dushyanth from my team shared this post on IO Monitoring on Linux, by the folks over at Pythian, on our internal mailing list. Here are my notes on the subject.
Performance measurement and Capacity planning are a science. It is common practice at Directi to attempt to determine what the performance bottlenecks in any given application are. A usual generalization is to determine whether an application is cpu-bound / memory-bound / IO bound.
IO bound applications end up wasting cpu cycles, especially incase of Disk IO, since most programming languages do not have Async Disk IO support today. Therefore in order to maximize performance and optimize resource utilization one should try and reduce iowait time of a CPU and tweak a deployment to make an application cpu-bound.
When your CPU seems to be spending a lot of time on iowait you need to make some changes. However an iowait can occur either because there is a lot of Disk/Network IO taking place, or because the disk subsystem is saturated and cannot provide greater throughput. iostat allows you to determine which one it is. A regular iostat output consists of the following fields -
# iostat -dkx 60
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 611.40 414.23 20286.60 1656.93 42.79 17.50 17.33 0.96 98.57
Explanation of the above fields:
- Device: the block device whose performance counters are being reported
- r/s and w/s: number of read and write requests issued per second to the device (in this case 611 and 414)
- rsec/s and wsec/s – number of sectors read/written per second
- rkB/s and wkB/s – number of kilobytes read/written per second
- avgrq-sz – average number of sectors per request (for both reads and writes). ie (rsec + wsec) / (r + w)
- avgqu-sz – average queue length in the monitoring interval (in this case 42.79)
- await – average time that each IO Request took to complete. This includes the time that the request was waiting in the queue and the time that the request took to be serviced by the device
- svctim – average time each IO request took to complete during the monitoring interval
- Note: await includes svctim. Infact await (average time taken for each IO Request to complete) = the average time that each request was in queue (lets call it queuetime) PLUS the average time each request took to process (svctim)
- %util: This number depicts the percentage of time that the device spent in servicing requests. This can be calculated with the above values. In the above example the total number of reads and writes issued per second is 611 + 414 => 1025. Each request takes 0.96 ms to process. Therefore 1025 requests would take 1025 x 0.96 => 984 ms to process. So out of the 1 second that these requests were sent to the device in, 984 ms were taken to process the requests. This means the device utilization is 984/1000 * 100 => ~98.4%. As you can see in the above iostat output the %util does show ~ 98.5%
Interpreting iostat values
Lets take the above example
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 611.40 414.23 20286.60 1656.93 42.79 17.50 17.33 0.96 98.57
- avg time that each request spent in queue (qtime) = await – svctime = 17.33 – 0.96 => 16.37 ms
- avg time tha each request spent being serviced = 0.96 ms
- so averagely each IO request spent 17.33ms to et processed of which 16.37 ms were spent just waiting in queue
- %util can be calculated as (r/s + w/s) * svctim / 1000ms * 100 => 1025*0.96/1000 * 100 => 98.5%
- This simple means that in a 1 second interval, 1025 requests were sent to disk, each of which took 0.96ms for the disk to process resulting in 984 ms of disk utilization time in a period of 1 s (or 1000 ms). This means the disk is greater than 98% utilized
On this disk subsystem, it is clear that the disk cannot process more IO requests than what it is getting
Lets take another example -
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdb 6.33 139.07 46.30 19.53 526.27 634.40 35.26 0.54 8.17 3.30 21.74
- avg time that each request spent in queue (qtime) = await – svctime = 8.17 – 3.30 => 4.87 ms
- avg time tha each request spent being serviced = 3.30 ms
- so averagely each IO request spent 8.17 ms to et processed of which 4.87 ms (a little more than half) were spent waiting in queue
- %util can be calculated as (r/s + w/s) * svctim / 1000ms * 100 => 65.83 * 3.3/1000 * 100 => 21.72%
- This simple means that in a 1 second interval, 65 requests were sent to disk, each of which took 3.30ms for the disk to process resulting in 217 ms of disk utilization time in a period of 1 s (or 1000 ms). This means the disk is around 21.7 % utilized
On this disk subsystem, it is clear that the disk is not fully utilized. While due to the nature of the requests, averagely requests are spending half their time in queue, that is not so bad. This disk subsystem is capable of greater throughput.
Notes
On every Linux box the following should be graphed at 5 minute averages
- %util: When this figure is consistently approaching above 80% you will need to take any of the following actions -
- increasing RAM so dependence on disk reduces
- increasing RAID controller cache so disk dependence decreases
- increasing number of disks so disk throughput increases (more spindles working parallely)
- horizontal partitioning
- (await-svctim)/await*100: The percentage of time that IO operations spent waiting in queue in comparison to actually being serviced. If this figure goes above 50% then each IO request is spending more time waiting in queue than being processed. If this ratio skews heavily upwards (in the >75% range) you know that your disk subsystem is not being able to keep up with the IO requests and most IO requests are spending a lot of time waiting in queue. In this scenario you will again need to take any of the actions above
- %iowait: This number shows the % of time the CPU is wasting in waiting for IO. A part of this number can result from network IO, which can be avoided by using an Async IO library. The rest of it is simply an indication of how IO-bound your application is. You can reduce this number by ensuring that disk IO operations take less time, more data is available in RAM, increasing disk throughput by increasing number of disks in a RAID array, using SSD (Check my post on Solid State drives vs Hard Drives) for portions of the data or all of the data etc









