24 Oct, 2010
Achieving 100% uptime through the CRABS model
Posted by Bhavin Turakhia | (0) Comments
As a web 2.0 company today five nine’s no longer cuts it wrt uptime. We do not have the luxury of providing 99.999% availability. Users expect 100% uptime. This post is a macro model of things that need to be taken care of to achieve 100% uptime. Inkeeping with the industry’s love for acronyms I call it the CRABS model
Capacity
You must be aware of the exact capacity that your infrastructure can handle. In terms of requests, number of users, amount of storage, number of transactions, network throughput and so on. This is applicable to every component within the system. Each service has its own capacity limitations. If your architecture comprises of a database, an app server, a queue, a mail server, and a memory cache, each of these components have their own capacity limitations. Capacity also depends on the state of the system, time of the day, user patterns etc. For instance if you are heavily dependant on memory caches, and in your application design there is a possibility that you may start out with a cold cache, then the requests your application can handle during this time will be different from the requests it can handle with a warm cache.
Knowing the capacity of every component in the system allows you to do the following -
* determine the peak load your system can handle
* put limits into place to ensure your system never gets more requests than it can handle
* determine when the system is reaching close to peak capacity and pre-emptively scale the infrastructure to account for growth
Redundancy
Every component must have adequate redundancy in an active-active model. These days a simple n+1 does not cut it out, nor does a standby failover. Most redundant clusters consist of capacity well beyond that required during peak loads. Additionally it is not acceptable, anymore, to require even a few minutes of downtime for a standby to start-up incase of downtime of the primary node. And it is certainly not acceptable to lose any data. Downtime of any node or any component is expected to be completely transparent to end users. This starts becoming difficult when you take into account user sessions, state and data storage. This requires thought at design time. Applications have to be designed ground up to be redundant to an extent where downtime of multiple hardware and software components do not impact the end user in any way. Larger applications take into account geo-redundancy and the possibility of entire datacenters or geographical locations being unavailable for a certain period of time. As many components as possible should run in active-active mode where failure of one of a set does not result in any impact to the end user. Think of every component (hardware and software) in your setup and allow for several of them to fail at the same time. Ensure adequate capacity and data redundancy.
Abuse mitigation
Expect users, hackers, customers, vendors, developers and unrelated 3rd parties to intentionally or unintentionally abuse your system. I divide abuse into the following categories -
- Denial of Service: Someone sending unwarranted requests to your system utilizes the peak capacity of your system resulting in a denial of service to your other users. These can be application requests or network requests. The requests maybe intentional or un-intentional and maybe distributed. The requests may even be legitimate. For instance one may legitimately use your mail system to send out a million emails. Preventing DOS requires identifying all potential scenarios and ensuring none of the services and devices in your infrastructure permit any user or system to send more than a warranted number of requests. Network based DDOS attacks must be mitigated by using special DDOS mitigation equipment that cleans the traffic
- Security breaches: Someone accessing your system with the intention of damaging it by exploiting a vulnerabliity in the network, application, OS etc to gain access and disparage your services. One needs to employ server hardening, firewalls, strict security processes, access policies, intrusion detection systems, following owasp guidelines, ensuring application security and much more to ensure tight security of one’s services.
- Manual booboos: Many a downtime has been a result of an unsuspecting sysad running “rm -fr” or a fatigued developer running a “delete from table” without a where clause. One can prevent these by defining structured processes and policies.
Bugs
Another frequent cause of downtime or service unavailability is bugs in the software. Heed the following tips to ensure zero defects in a live scenario -
- Adequate automated and manual unit and functional testing of the software
- Dog-fooding and Staggerred release wherein new versions are always released to limited internal and external audiences before releasing them to the entire user base
Scalability
Careful capacity planning does not prevent getting tech-crunched, slash-dotted or dugg. Your application design must support infinite scalability. This again requires careful planning with respect to application design and hardware selection. Vertical and Horizontal partitioning, clustering, stateless configurations and more help in creating a design that scales linearly by adding additional nodes without requiring any downtime. Always think of millions of users.









