18 Jun, 2009

Nginx XSLT Module for REST Servers

Posted by Bhavin Turakhia | (0) Comments

I just came across the Nginx XSLT module and had an epiphany. The module essentially accepts an HTTP request, passes it through to the backend server, receives XML from the backend server, and converts the XML to HTML by applying XSLT transforms as per XSLT stylesheets available.

So now one can essentially focus solely on a REST-XML-HTTP API when building out an application, and expose the same as an API as well as a web app by simply creating XSLT files that transform the XML into HTML. Kickass!!!

Tags:, , , ,

12 Jun, 2009

Directi Launches Designchef.com

Posted by Bhavin Turakhia | (0) Comments

After the tremendous success of CodeChef - a one-of-its-kind, non-commercial, multi-platform online coding competition - we have launched another community contest intiiative - DesignChef.com

DesignChef will feature ongoing community contests targetted to design professionals, usability experts, ux specialists, interaction designers, front-end developers and the likes.

Saunter by and check out our July contest at DesignChef - and win cash prizes. Like CodeChef, DesignChef is a non-profit initiative, currently managed and sponsored by Directi. We hope to grow it into a community of thousands of usability and design professionals worldwide.

Tags:, , , ,

3 May, 2009

Skills that make a good developer

Posted by Bhavin Turakhia | (3) Comments

Joel Spolsky captures the essence of a good (read: recruitment material) developer in his succint mantra - “Smart and Gets Things Done“. My own personal part-plagiarised part-modified version has always been - “Smart, Takes Initiative, Gets things done, Paranoid about Perfection and is a Nice Person”. I believe Joel’s shorter version does not capture all these aspects - for instance being Nice and being Smart are mutually exclusive.

Both versions (mine and Joel’s), in their brevity, have a quotable-charm, but fail to provide a more detailed perspective. As a parallel effort, I wanted to list down, in micro-detail, a significantly more extensive document, of skills that I find good developers possess.

The current work-in-progress version of it has been put up at - What skills doth a good developer possess? within our Wiki. Granted that all developers at Directi do not possess all the skills listed. However the document serves as a “skills-to-acquire” list for our existing team, as well as a reference list for aspiring applicants. As someone who wants to join our organization, you should have several of these mastered, and be prepared to tackle the rest.

Excerpt from the document - What skills doth a good developer possess?

  • Algorithmic skills
    • Understand and dissect complex problems quickly
    • Understand trade-offs between space / time complexity
    • Come up with solutions with minimal space / time complexity
    • … <snip>
  • Data Structures
    • Basic Knowledge of data structures - Hashmaps, Binary tree, B-Tree, B+Tree, Linked Lists etc
    • Understanding of trade-offs between various data structures etc
    • … <snip>
  • RDBMS
  • Caching
  • Networking
  • … <snip>

For further details visit the complete document - What skills doth a good developer possess?

To apply for a tech position at Directi visit our Careers Portal

Tags:, ,

15 Mar, 2009

Selecting a Solid State Drive technology

Posted by Bhavin Turakhia | (1) Comments

This is my 3rd post on Solid State drive technology (read Solid State Drives vs Hard disk drives). Offlate I have been making a ton of posts on storage, given that in a high-performance, data-intensive environment, applications eventually bottleneck at the slowest component in the chain - the disk (no surprise there considering it is the only device with moving parts).

With the numerous SSD options floating around in the market, it is hard to figure out which way to go without adequate research. Infact, a wrong choice can result in slowing down your application (check Solid State Hard drives have poor random write performance).

Here are my notes on selecting your SSD platform for maximum performance -

Check the random write IOPs
Different SSDs have different performance. SSDs fare worst on random writes (4K or smaller blocks). Therefore when checking an SSD - always check the maximum random write IOPs it delivers

Single cell (SLC) vs Multi-cell (MLC)
SSDs are either single cell (SLC) or multi-cell (MLC). SLC memory has the advantage of faster transfer speeds, lower power consumption and higher cell endurance. However, since they store less data per cell, SLC costs more per megabyte of storage than MLC.

Managed Flash Technology
I first wrote about Managed Flash Technology in my earlier article on SSDs - Solid State Drives vs Hard disk drives. MFT was developed by EasyCo. MFT essentially converts multiple random writes into a single linear write by collating them and writing them out together. MFT can pretty much be used in conjunction with any SSD. EasyCo’s website has interesting benchmark comparisons of SSDs with and without MFT. For instance, on the Mtron PRO 7000, MFT increased the random IOPs for 4K blocks from 123 to a whopping 16,180

Fusion-io
Fusion-io claims to have produced the fastest SSD in the world. With Steve Wozniak on their team, I would be inclined to believe their claims. The spec sheet of their latest drive - the iodrive duo - boasts a random write performance of 180,530 IOPs. That is insane. Additionally multiple ioDrives can be configured into a RAID array for additional performance and reliability.

Tags:, ,

8 Mar, 2009

iostat and disk utilization monitoring nirvana

Posted by Bhavin Turakhia | (3) Comments

In my neverending quest of performance monitoring, I have been constantly trying to find better ways to monitor disk utilization on a server. At Directi we use the usual medley of tools at our disposal viz. iostat, sar, sysstat. I made serious progress last week, when Dushyanth from my team shared this post on IO Monitoring on Linux, by the folks over at Pythian, on our internal mailing list. Here are my notes on the subject.

Performance measurement and Capacity planning are a science. It is common practice at Directi to attempt to determine what the performance bottlenecks in any given application are. A usual generalization is to determine whether an application is cpu-bound / memory-bound / IO bound.

IO bound applications end up wasting cpu cycles, especially incase of Disk IO, since most programming languages do not have Async Disk IO support today. Therefore in order to maximize performance and optimize resource utilization one should try and reduce iowait time of a CPU and tweak a deployment to make an application cpu-bound.

When your CPU seems to be spending a lot of time on iowait you need to make some changes. However an iowait can occur either because there is a lot of Disk/Network IO taking place, or because the disk subsystem is saturated and cannot provide greater throughput. iostat allows you to determine which one it is. A regular iostat output consists of the following fields -

# iostat -dkx 60

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0              0.00     0.00  611.40  414.23 20286.60  1656.93 42.79    17.50   17.33   0.96  98.57

Explanation of the above fields:

  • Device: the block device whose performance counters are being reported
  • r/s and w/s: number of read and write requests issued per second to the device (in this case 611 and 414)
  • rsec/s and wsec/s – number of sectors read/written per second
  • rkB/s and wkB/s – number of kilobytes read/written per second
  • avgrq-sz – average number of sectors per request (for both reads and writes). ie (rsec + wsec) / (r + w)
  • avgqu-sz – average queue length in the monitoring interval (in this case 42.79)
  • await – average time that each IO Request took to complete. This includes the time that the request was waiting in the queue and the time that the request took to be serviced by the device
  • svctim – average time each IO request took to complete  during the monitoring interval
  • Note: await includes svctim. Infact await (average time taken for each IO Request to complete) = the average time that each request was in queue (lets call it queuetime) PLUS the average time each request took to process (svctim)
  • %util: This number depicts the percentage of time that the device spent in servicing requests. This can be calculated with the above values. In the above example the total number of reads and writes issued per second is 611 + 414 => 1025. Each request takes 0.96 ms to process. Therefore 1025 requests would take 1025 x 0.96 => 984 ms to process. So out of the 1 second that these requests were sent to the device in, 984 ms were taken to process the requests. This means the device utilization is 984/1000 * 100 => ~98.4%. As you can see in the above iostat output the %util does show ~ 98.5%

Interpreting iostat values

Lets take the above example

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-0              0.00     0.00  611.40  414.23 20286.60  1656.93 42.79    17.50   17.33   0.96  98.57

  • avg time that each request spent in queue (qtime) = await - svctime = 17.33 - 0.96 => 16.37 ms
  • avg time tha each request spent being serviced = 0.96 ms
  • so averagely each IO request spent 17.33ms to et processed of which 16.37 ms were spent just waiting in queue
  • %util can be calculated as (r/s + w/s) * svctim / 1000ms * 100  => 1025*0.96/1000 * 100 => 98.5%
  • This simple means that in a 1 second interval, 1025 requests were sent to disk, each of which took 0.96ms for the disk to process resulting in 984 ms of disk utilization time in a period of 1 s (or 1000 ms). This means the disk is greater than 98% utilized

On this disk subsystem, it is clear that the disk cannot process more IO requests than what it is getting

Lets take another example -

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               6.33   139.07   46.30   19.53   526.27   634.40    35.26 0.54    8.17   3.30  21.74

  • avg time that each request spent in queue (qtime) = await - svctime = 8.17 - 3.30 => 4.87 ms
  • avg time tha each request spent being serviced = 3.30 ms
  • so averagely each IO request spent 8.17 ms to et processed of which 4.87 ms (a little more than half) were spent waiting in queue
  • %util can be calculated as (r/s + w/s) * svctim / 1000ms * 100  => 65.83 * 3.3/1000 * 100 => 21.72%
  • This simple means that in a 1 second interval, 65 requests were sent to disk, each of which took 3.30ms for the disk to process resulting in 217 ms of disk utilization time in a period of 1 s (or 1000 ms). This means the disk is around 21.7 % utilized

On this disk subsystem, it is clear that the disk is not fully utilized. While due to the nature of the requests, averagely requests are spending half their time in queue, that is not so bad. This disk subsystem is capable of greater throughput.

Notes

On every Linux box the following should be graphed at 5 minute averages

  • %util: When this figure is consistently approaching above 80% you will need to take any of the following actions -
    • increasing RAM so dependence on disk reduces
    • increasing RAID controller cache so disk dependence decreases
    • increasing number of disks so disk throughput increases (more spindles working parallely)
    • horizontal partitioning
  • (await-svctim)/await*100: The percentage of time that IO operations spent waiting in queue in comparison to actually being serviced. If this figure goes above 50% then each IO request is spending more time waiting in queue than being processed. If this ratio skews heavily upwards (in the >75% range) you know that your disk subsystem is not being able to keep up with the IO requests and most IO requests are spending a lot of time waiting in queue. In this scenario you will again need to take any of the actions above
  • %iowait: This number shows the % of time the CPU is wasting in waiting for IO. A part of this number can result from network IO, which can be avoided by using an Async IO library. The rest of it is simply an indication of how IO-bound your application is. You can reduce this number by ensuring that disk IO operations take less time, more data is available in RAM, increasing disk throughput by increasing number of disks in a RAID array, using SSD (Check my post on Solid State drives vs Hard Drives) for portions of the data or all of the data etc
Tags:, , , , , ,

15 Feb, 2009

Pros and Cons of Diskless Servers booting off a SAN

Posted by Bhavin Turakhia | (2) Comments

In our constant efforts towards Architecture nirvana we are often faced with the question of whether a cluster of application servers should have their own hard disks or should they PxE boot off a SAN. This short article explores the options

Notes

  • If a cluster of machines have their own OS hard drives, and one cannot afford a machine going down then each of the machines would need a RAID 1 config which requires a RAID card and 2 hard drives each resulting in a considerable cost (high-cost)
  • In the scenario where multiple machines boot off a partition on a SAN device, the machines do not need any harddrives. However if for any reason the connectivity to the SAN goes down or the SAN device itself crashes (rare) then all the machines in the cluster would be down (marginal redundancy concern)

Conclusion

  • In the scenario where the data partition of the cluster of machines is residing on a SAN device, it makes sense to boot those machines off the SAN device too since as such the SAN going down would render the entire cluster useless, and this way one can save the cost of 2x’n’ hard drives and ‘n’ RAID cards (assuming we have ‘n’ machines in the cluster)
  • In the scenario that a cluster of machines does not have any data on a SAN device, one may want to invest in hard drives for the machine itself, since a downtime of the SAN device will not render the cluster inoperational. Additionally, if the cluster consists of 10-15 machines, the cost of 2 SATA drives and 1 RAID card per machine may not be much higher than the cost of a SAN device if one needs to be exclusively setup for these machines.
  • This may change however if one has spare and redundant SAN devices lying around, with spare capacity in their network
  • Ideally if a cluster of machines are to PXE boot off a SAN, one should try and ensure that the boot partitions are spread across separate SAN Devices each of which are accessible through different network paths, so that the downtime of any given SAN Device does not compromise the cluster
Tags:, ,

15 Feb, 2009

Notes on Amazon EC2

Posted by Bhavin Turakhia | (2) Comments

Sandeep Shetty from our Products team introduced me to Scalr - an opensource self-scaling hosting platform based on the Amazon EC2 cloud. I decided to take a quick look under the hood and figure out how Amazon EC2 would function as a hosting platform. Here are my quick notes -

Intro

  • EC2 offers the ability to instantly provision Virtual machines using an image (called AMIs) through an API
  • Each instance is like a VPS with a certain amount of RAM, CPU, Disk capacity
  • CPU capacity of an instance is measured in the form of EC2 Compute Units. From their FAQ - each EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor

Pricing

  • There are various types of instances
  • As an eg, an instance with 4 EC2 Compute Units, 7.5 GB RAM and 850 GB storage would cost 40 cents per hour => ~$300 per month
  • Data Transfer costs 10 cents per GB for outbound transfers and between 10-17 cents per GB for inbound. Assuming only outbound data transfer (typical case for a web app) and the lowest rate on EC2 (10 cents per GB) the cost per Mbps per month for EC2 works out to be approximately $32.
  • For any persistent storage over and above that provided in the instance one can use Amazon Elastic Block Storage or Amazon S3.
  • Amazon EBS costs are 10 cents per GB per month + 10 cents per million I/O requests

Notes

  • Many people tend to wrongly assume that EC2 (which stands for Elastic Compute Cloud) allows you to provision resources in an elastic manner and scale your application ad infinitum without any changes to the application. While in theory you can provision instances dynamically upon need, each EC2 instance acts like an independent machine with an independent OS, memory, CPU etc. It is identical therefore to provisioning multiple hardware boxes and any partitioning / load balancing etc would need to be done by the application developer at the App layer
  • The elasticity does have considerable advantages in as much as provisioning is fully automated and each instance can be added / removed at a moment’s notice (about 10 minutes to boot up a new instance according to their FAQ) thus taking care of peaks dynamically
  • Additionally no hardware setup is required to add / remove an instance
  • Instances are provisioned through images which take care of complete setup thus relieving any system administration effort in setting up a machine
  • Amazon EC2 provides the ability to place instances in multiple locations. Amazon offers multiple regions (USA / Europe) and various Availability Zones within these regions. Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. Using instances in separate Availability Zones, one can protect applications from failure of a single location.
  • Setting up an EC2 instance is quite easy. Once you create your AWS account, you can use the online AWS Console or simply download the offline command line tools to start provisioning your instances. Check the AWS Console Video for more details on the AWS Console.

Applications

  • I can think of using EC2 instances for DNS infrastructure. Easy to deploy, and can scale dynamically to manage high loads. Additionally DNS servers do not require lots of storage and are innately redundant by virtue of the DNS protocol. Lastly, since EC2 provides multiple regions and zones, the DNS infrastructure can be scaled out resulting in geographical redundancy
  • EC2 is great for prototyping as well as benchmarking
  • One can also use EC2 deploying small web apps reducing time to market and allowing quick setup
  • Many applications have monthly report generation requirements which too can be run of an EC2 instance. EC2 offers SQL Server instances incase you want a commercial database to run reports / crunch data.
  • We have also been thinking of using EC2 for CodeChef. Since at CodeChef we plan on running a programming contest each mont, visitors as well as computing resources required to manage submissions increase considerably around the time each contest is announced. This makes Amazon EC2 a perfect candidate for dynamic resource deployment during contest-week

Other Amazon Web Services

  • Amazon S3 provides a scalable, high-available and redundant NAS that can be used to store and retrieve any amount of data
  • Amazon CloudFront is their CDN layered onm top of S3. It delivers content using a global network of edge locations. Requests for objects are automatically routed to the nearest edge location

Resources

Tags:, , , , ,

4 Feb, 2009

Introduction of New TLDs will NOT increase costs for Trademark Holders

Posted by Bhavin Turakhia | (2) Comments

As an ICANN accredited Registrar, a consultant to Registrars and Registries, and an erstwhile chair of the Registrars Constituency, I am very closely involved with the ICANN bottoms up consensus processes. Amongst the most interesting endeavors of ICANN, and a fundamental element of ICANN’s goal is the creation of new gTLDs. Some of the recent comments on the new gTLD applicant guidebook seem to suggest that creation of new gTLDs will result in a cost increase to existing trademark holders who will have to register their trademark in various TLDs as a defensive mechanism.

Paul Stahura published a great report demonstrating that trademark holders have historically not been blocking their names across multiple Top-Level Domains (TLDs). I have always been a fan of number crunching—”numbers never lie”.

Since Paul has already done a remarkable job of statistical analysis, I am going to wear my theorist hat and prove a reworded form of the Hypothesis using logical deduction and common sense.

Hypothesis – Introduction of new TLDs will not increase the sum total registration cost that trademark holders need to spend on domain names.

Methodology – Logical deduction.

Fact:
There are currently over 280 TLDs of which a little over 250 are ccTLDs in the IANA root zone.

Assumptions:
Individuals and companies spend money for economic gain. Therefore whether a registrant is an organization, a speculator, a cyber squatter, or a phisher, their purpose in registering a domain name is to derive economic gain that outweighs the cost of the domain name.

Description:
Let us start by analyzing why one would want to register a domain name in each additional TLD outside of the primary TLD that they use for their business. Lets take the example of a company—Extra Cautious Inc.—who uses the domain name extracautious.com. They now need to evaluate whether it makes sense for them to register the string extracautious in other TLDs. Here is the reasoning that the CFO of Extra Cautious Inc. would go through.

Traffic expectation:
It makes sense for the CFO to register extracautious.biz or extracautious.info in case an adequate number of people are expected to type in extracautious.biz in their browser directly. The number of type-ins needed to make it worthwhile to register this domain name is negligible given that .biz and .info domains cost substantially under $10 per year. If there is a clear traffic value to be derived, then as Paul has pointed out in his elaborate report, the registration of this additional domain name is not a cost but rather a revenue generation opportunity for Extra Cautious Inc, who otherwise would have missed out on the hits. Therefore in case of a Traffic Expectancy the hypothesis above holds true.

Source of traffic:
A typical website gets traffic in two ways. Either through direct type-ins, or via hyperlinks. Both the former and the latter are primarily a function of the domain name that an organization promotes. When Extra Cautious Inc. promotes extracautious.com as its website on its stationery, advertising etc., it expects people to type in that domain name to reach their website. It also expects search engines to index that domain name, and other directories and websites to link to that domain name. In other words traffic through type-ins and hyperlinks would directly end up on their website.

Next let’s explore the possibility of direct type-in traffic on other TLDs. Users on the Internet type-in a domain name directly if they expect to find the website or information they were looking for. The most common case of this is appending a .com to a company/product name. It is common behavior on the Internet to append a “.com” to the end of a company name to look for its website. In some cases people even append a .net or a .org. However, given Google magic, that is the limit of a user’s patience. One does not have to be Einstein to conclude that no users are trying out 280+ TLD combinations to get to a company’s website. It can therefore be assumed that if 50 new TLDs, each quite different sounding from the other, were to be launched, that users on the internet would not begin to iterate through those 50 TLDs to find a company.

ccTLDs type-ins:
In fact the only other type of domain that tends to get type-in traffic is ccTLD equivalents. This is based on two behavior patterns. Users seeking for a company that they know is based in India, could try to reach that company’s website by appending “.in” to the company name as a last resort after attempting a .com / .net / .org search. Similarly, users from India, who are used to seeing “.in” domains may append “.in” to a company name (e.g. dell.in) to find its local website. By this logic, many companies should ideally have registered their domain names in several ccTLDs, especially those of highly populated countries like India and China. Yet the TLD Zones of these ccTLDs have little overlap with the global trademark registry as well as with the .com zone, barring generics and some fortune 500 companies.

Many new TLDs have a specific purpose:
Add to this the fact that many of the proposed new TLDs have varying creative purposes. We have heard of business models such as .wiki, .blog etc. which have such specific purposes. Type-in traffic on those TLDs for a specific trademark such as Extra Cautious Inc, is highly unlikely, since users would not expect Extra Cautious’ website to be available at extracautious.wiki.

No traffic expectation:
Going back to our first point—in case no one is expected to type in extracautious.newTLD, it makes little sense for Extra Cautious Inc. themselves to register extracautious.newTLD. This for instance applies to specific TLDs like .aero. Since extracautious is in the business of making fireworks ;) … they do not expect any of their existing or potential customers to type in extracautious.aero. Similarly since Extra Cautious Inc. largely operates in the US, it may block extracautious.us but chooses not to block extracautious.in. The likelihood of individuals typing in extracautious.biz and extracautious.info ad-hoc is ZERO so they do not need to block those domains. If there is a traffic expectancy on any TLD option, it is a no brainer to block those domains since the potential revenue would outweigh the cost.

What about cybersquatters:
The next argument typically made by IP constituencies is that if a speculator / cybersquatter / phisher were to register extracautious.newTLD then they could create nuisance value and the company may be prompted to block their domain name (defensive registrations) to prevent this nuisance value.

It is important to understand that CyberSquatters / Speculators / Phishers register non-generic trademark domain names for specific economic reasons. Let’s explore these.

Type-in traffic on trademark names:
If a trademarked domain gets type-in traffic, a speculator maybe prompted to register this domain to monetize the traffic. However in this case, as we have discussed before, a trademark holder themselves would wish to register it prior to a speculator since the revenue outweighs the cost. If a speculator can earn more than the cost of the domain name by simply monetizing traffic to that domain name, then it is assumed that the actual trademark holder can earn significantly higher revenue and therefore is not bearing any cost by registering his domain name in that TLD. Therefore Extra Cautious Inc. chooses to register extracautious.au since it has an office in Australia and expects type-in traffic from Australia. This is not an extra cost for them since through this additional domain they get traffic that they would have otherwise not received.

Defensive registrations to prevent misrepresentation or blackmail:
Some folks argue that even if a domain name has no traffic potential, speculators can choose to register the same to either fraudulently pretend to be the trademark holder (phishing etc.) or otherwise to try and sell the domain name to the trademark holder for a premium. Let’s analyze both these arguments.

Mr Scrupulous registers extracautious.info and puts up a website on it to sell fireworks. He intends to spam thousands of users, pretending to be Extra Cautious Inc. and leverage on the advertising campaign of Extra Cautious Inc. to earn money. It can be argued that if Extra Cautious Inc. had registered their .info domain name this could have been prevented. However this argument is flawed, since Mr. Scrupulous could have registered extracautiousweb.com, extracautiousonline.com, extracautiousfireworks.com, extracautiouscrackers.com, extracautiousoffers.com, extracautiousshop.com and a gazillion other variants within the .com space itself. By this logic the CFO of Extra Cautious Inc. would need to register every combination of extracautious in the .com and .net and .org TLD spaces. Therefore new TLDs are no more expensive than existing TLDs when it comes to protecting one’s trademark from identity theft/phishing. In fact I would go so far as to submit that phishers and spammers would rather register <company&rt;online.com or <company&rt;web.com or some such variant in the .com TLD space in order to commit identity theft, than to register a .info / .biz domain name, since .com domain names are easier to relate to for users. While I have conducted no statistical analysis, gut feeling tells me that one will find more variants of Fortune 500 company brand names in the .com TLD than defensive registrations of those trademarks in all other TLDs.

Let’s take a look at the second argument, wherein Mr. Scrupulous registers extracautious.info with the sole purpose of reselling it to Extra Cautious Inc. for a profit. This has already been covered in our previous assertion. The CFO of Extra Cautious Inc. would only buy extracautious.info at a certain price if the expected profit from the purchase was higher, in which case the purchase does not result in a cost increase. Additionally, Extra Cautious always has the option of filing a dispute, instead of purchasing the domain from Mr. Scrupulous, and this knowledge is by itself sufficient to prevent widespread blackmail of this form. If extracautious.info is getting no traffic, then Extra Cautious Inc. has no reason to purchase extracautious.info either directly or from Mr. Scrupulous

Conclusions:

  • Trademark holders have no reason to register a domain name in a newTLD if the domain name is not going to get any traffic
  • Speculators have no reason to register a domain name in a newTLD if the domain name is not going to get any traffic, since they will be unable to generate revenue from it or sell it to the trademark holder
  • Spammers and phishers have adequate options for registering similar sounding domain names in existing TLDs without having to bother with new TLDs
  • Thus, it can be concluded that the Introduction of new TLDs is not increasing the sum total registration cost that trademark holders need to spend on domain names
Tags:, , , ,

20 Jan, 2009

HTTP vs REST vs SOAP

Posted by Bhavin Turakhia | (7) Comments

I have been an active proponent of SOAP since its inception. SOAP revolutionzed RPC and loose coupling to a great extent. However off late I have been giving APIs and interfaces considerable thought and am leaning a lot more towards simple HTTP based APIs with an XML or JSON response format as opposed to SOAP. In this post I pen down some random thoughts on the merits and demerits of each.

Introduction
Let me first clarify the terminology -

  • SOAP refers to Simple Object Access Protocol
  • HTTP based APIs refer to APIs that are exposed as one or more HTTP URIs and typical responses are in XML / JSON. Response schemas are custom per object
  • REST on the other hand adds an element of using standrdized URIs, and also giving importance to the HTTP verb used (ie GET / POST / PUT etc)

Typing
SOAP provides relatively stronger typing since it has a fixed set of supported data types. It therefore guarantees that a return value will be available directly in the corresponding native type in a particular platform. Incase of HTTP based APIs the return value needs to be de-serialized from XML, and then type-casted. This may not represent much effort, especially for dynamic languages. Infact, even incase of copmlex objects, traversing an object is very similar to traversing an XML tree, so there is no definitive advantage in terms of ease of client-side coding.

Client-side effort
Making calls to an HTTP API is significantly easier than making calls to a SOAP API. The latter requires a client library, a stub and a learning curve. The former is native to all programming languages and simply involves constructing an HTTP request with appropriate parameters appended to it. Even psychologically the former seems like much less effort.

Testing and Troubleshooting
It is also easy to test and troubleshoot an HTTP API since one can construct a call with nothing more than a browser and check the response inside the browser window itself. No troubleshooting tools are required to generate a request / response cycle. In this lies the primary power of HTTP based APIs

Server-side effort
Most Programming languages make it extremely easy to expose a method using SOAP. The serialization and deserialization is handled by the SOAP Server library. To expose an object’s methods as an HTTP API can be relatively more challenging since it may require serialization of output to XML. Making the API Rest-y involves additional work to map URI paths to specific handlers and to import the meaning of the HTTP request in the scheme of things. Offcourse many frameworks exist to make this task easier. Nevertheless, as of today, it is still easier to expose a set of methods using SOAP than it is to expose them using regular HTTP.

Caching
Since HTTP based / Rest-ful APIs can be consumed using simple GET requests, intermediate proxy servers / reverse-proxies can cache their response very easily. On the other hand, SOAP requests use POST and require a complex XML request to be created which makes response-caching difficult

Conclusions
In the end I believe SOAP requires greater implementation effort and understanding on the client side while HTTP based or REST based APIs require greater implementation effort on the server side. API adoption can increase considerably if a HTTP based interface is provided. Infact an HTTP-based API with XML/JSON responses represents the best of both breeds and is easy to implement on the server as well as easy to consume from a client

Tags:, , , , , , ,

1 Jan, 2009

Solid State Drives vs Hard disk drives

Posted by Bhavin Turakhia | (8) Comments

Intro

  • A solid state drive stores its data in solid-state memory (Flash / SRAM / DRAM)
  • Flash does not require constant power and is non-volatile while SRAM and DRAM are volatile

Speeds

  • Flash maybe slower than even tradition HDDs on big file access
  • Flash is considerably slower than conventional disks for small writes. This is partly due to their large erase block size of 0.5-1 MB
  • SSDs are faster than HDDs for small random reads due to negligible seek time (no moving parts)
  • Check the comparison table at http://www.storagesearch.com/ssd-ram-v-flash.html.  When Flash based SSDs are used for equal reads and writes they are actually slower than HDDs. However if small random reads far outweigh writes, the performance gains can be upto 100x!!?
  • Download the paper - Comparison of Drive Technologies for High-Transaction Databases. Findings below -
    • HDDs: Small reads - 175 iops/s, Small writes - 280 iops/s
    • Flash SSDs: Small reads - 1075 iops/s (6x), Small writes - 21 iops/s (0.1x)
    • DRAM SSDs: Small reads - 4091 iops/s (23x), Small writes - 4184 iops/s (14x)
  • Another whitepaper on Flash vs HDDs is Understanding Flash SSD Performance. Findings below -
    • Read performance: Flash outperforms hdds by a large magnitude for small block size
    • It is with write performance that Flash SSDs become problematic. The issue here is the internal structure used within the Flash storage array. This structure includes a collection of bytes called an “erase block”. When you write to a Flash SSD, the drive itself cannot just update the sectors you are changing, but must merge your changes with existing data to update a complete erase block. As Flash SSDs have gotten faster and larger, erase blocks have grown as well. Flash erase blocks used to be 16K in length. Now they are 1 Megabyte for small SSDs extending up to as large as 4 Megabytes for some models.
    • If you are doing pure reads, a Flash SSD will typically be 20x faster than a hard disk for small random reads. If you are doing pure random writes, the same drive might be 15x slower than a hard disk
    • Of pertinence is the table which shows how a small % of writes can destroy Flash SSD Performance. It is for this reason alone that Flash SSDs, by themselves, are not very effective with random update applications like on-line databases, mail queues, and other environments that involve a lot of small updates
  • One can improve write performance of a Flash SSD using the following methods -
    • OS Write caching - OS buffers writes which eventually get written to disk making the writes appear faster
    • File systems optimized to minimize random writes - YAFFS, JFFS2.
    • Managed Flash Technology - a patent pending technology by easyco which enables Flash Drives to write clusters of random data in linear streams

Costs

  • As of mid-2008, SSD prices are still considerably higher per gigabyte than are comparable conventional hard drives: consumer grade drives are typically US$2.00 to US$3.45 per GB for flash drives and over US$80.00 per GB for RAM-based compared to about US$0.12 per gigabyte for hard drives
  • DRAM based SSD require more power than hard disks when operating, and need continuous power when not in use if the data needs to be persistent
  • Check article Flash vs DRAM Price Projections - for SSD Buyers
    • In the first half of 2007 the difference in user price between a RAM versus Flash SSD was about 45 to 1. A year later in the first half of 2008 that ratio had changed to 25 to 1
    • However NAND has been on a steeper price decline than DRAM for its entire existence. The price of a gigabyte of DRAM declines (on average) 32% per year. There are indications that this decline may slow. Meanwhile, NAND’s price per gigabyte declines faster, at an average of 50% per year

My Conclusions

  • DRAM based SSDs are crazy expensive. They serve best for volatile caches (eg, memcached pools etc). If you have servers dedicated to serve in-memory cache data, it may reduce your cost to add DRAM SSDs to these clusters since they are likely not going to bottle-neck on CPU anyways
  • Flash based SSDs would work in an environment where the % of writes is low. As can be seen in some of the above benchmarks, a flash based SSD starts degrading in performance in comparison to HDDs in environments with just 5% writes. If one wants to use Flash based SSDs in environments with substantial writes, one should use special filesystems (YAFFS / JFFS2)  and/or use Managed Flash Technology
  • Flash based SSDs work like a charm in a read-only or mostly read environment
Tags:, , , , , ,