21 May, 2011

Unix domain sockets vs TCP Sockets

Posted by Bhavin Turakhia | (0) Comments

Here are two interesting links I found comparing the features and performance differences between using Unix Domain Sockets and TCP Loopback Sockets

http://lists.freebsd.org/pipermail/freebsd-performance/2005-February/001143.html

Excerpt: IP sockets over localhost are basically looped back network on-the-wireIP. There is intentionally “no special knowledge” of the fact that the connection is to the same system, so no effort is made to bypass the normal IP stack mechanisms for performance reasons. For example, transmission over TCP will always involve two context switches to get to the remote socket, as you have to switch through the netisr, which occurs following the “loopback” of the packet through the synthetic loopback interface. Likewise, you get all the overhead of ACKs, TCP flow control, encapsulation/decapsulation, etc. Routing will be performed in order to decide if the packets go to the localhost. Large sends will have to be broken down into MTU-size datagrams, which also adds overhead for large writes. It’s really TCP, it just goes over a loopback interface by virtue of a special address, or discovering that the address requested is served locally rather than over an ethernet (etc).

UNIX domain sockets have explicit knowledge that they’re executing on the same system. They avoid the extra context switch through the netisr, and a sending thread will write the stream or datagrams directly into the receiving socket buffer. No checksums are calculated, no headers are inserted, no routing is performed, etc. Because they have access to the remote socket buffer, they can also directly provide feedback to the sender when it is filling, or more importantly, emptying, rather than having the added overhead of explicit acknowledgement and window changes. The one piece of functionality that UNIX domain sockets don’t provide that TCP does is out-of-band data. In practice, this is an issue for almost noone.

http://osnet.cs.binghamton.edu/publications/TR-20070820.pdf

Excerpt: It was hypothesized that pipes would have the highest throughtput due to its limited functionality, since it is half-duplex, but this was not true. For almost all of the data sizes transferred, Unix domain sockets performed better than both TCP sockets and pipes, as can be seen in Figure 1 below. Figure 1 shows the transfer rates for the IPC mechanisms, but it should be noted that they do not represent the speeds obtained by all of the test machines. The transfer rates are consistent across the machines with similar hardware configurations though. On some machines, Unix domain sockets reached transfer rates as high as 1500 MB/s.

Category : Uncategorized

24 Oct, 2010

Achieving 100% uptime through the CRABS model

Posted by Bhavin Turakhia | (0) Comments

As a web 2.0 company today five nine’s no longer cuts it wrt uptime. We do not have the luxury of providing 99.999% availability. Users expect 100% uptime. This post is a macro model of things that need to be taken care of to achieve 100% uptime. Inkeeping with the industry’s love for acronyms I call it the CRABS model :)

Capacity

You must be aware of the exact capacity that your infrastructure can handle. In terms of requests, number of users, amount of storage, number of transactions, network throughput and so on. This is applicable to every component within the system. Each service has its own capacity limitations. If your architecture comprises of a database, an app server, a queue, a mail server, and a memory cache, each of these components have their own capacity limitations. Capacity also depends on the state of the system, time of the day, user patterns etc. For instance if you are heavily dependant on memory caches, and in your application design there is a possibility that you may start out with a cold cache, then the requests your application can handle during this time will be different from the requests it can handle with a warm cache.

Knowing the capacity of every component in the system allows you to do the following -
* determine the peak load your system can handle
* put limits into place to ensure your system never gets more requests than it can handle
* determine when the system is reaching close to peak capacity and pre-emptively scale the infrastructure to account for growth

Redundancy

Every component must have adequate redundancy in an active-active model. These days a simple n+1 does not cut it out, nor does a standby failover. Most redundant clusters consist of capacity well beyond that required during peak loads. Additionally it is not acceptable, anymore, to require even a few minutes of downtime for a standby to start-up incase of downtime of the primary node. And it is certainly not acceptable to lose any data. Downtime of any node or any component is expected to be completely transparent to end users. This starts becoming difficult when you take into account user sessions, state and data storage. This requires thought at design time. Applications have to be designed ground up to be redundant to an extent where downtime of multiple hardware and software components do not impact the end user in any way. Larger applications take into account geo-redundancy and the possibility of entire datacenters or geographical locations being unavailable for a certain period of time. As many components as possible should run in active-active mode where failure of one of a set does not result in any impact to the end user. Think of every component (hardware and software) in your setup and allow for several of them to fail at the same time. Ensure adequate capacity and data redundancy.

Abuse mitigation

Expect users, hackers, customers, vendors, developers and unrelated 3rd parties to intentionally or unintentionally abuse your system. I divide abuse into the following categories -

  • Denial of Service: Someone sending unwarranted requests to your system utilizes the peak capacity of your system resulting in a denial of service to your other users. These can be application requests or network requests. The requests maybe intentional or un-intentional and maybe distributed. The requests may even be legitimate. For instance one may legitimately use your mail system to send out a million emails. Preventing DOS requires identifying all potential scenarios and ensuring none of the services and devices in your infrastructure permit any user or system to send more than a warranted number of requests. Network based DDOS attacks must be mitigated by using special DDOS mitigation equipment that cleans the traffic
  • Security breaches: Someone accessing your system with the intention of damaging it by exploiting a vulnerabliity in the network, application, OS etc to gain access and disparage your services. One needs to employ server hardening, firewalls, strict security processes, access policies, intrusion detection systems, following owasp guidelines, ensuring application security and much more to ensure tight security of one’s services.
  • Manual booboos: Many a downtime has been a result of an unsuspecting sysad running “rm -fr” or a fatigued developer running a “delete from table” without a where clause. One can prevent these by defining structured processes and policies.

Bugs

Another frequent cause of downtime or service unavailability is bugs in the software. Heed the following tips to ensure zero defects in a live scenario -

  • Adequate automated and manual unit and functional testing of the software
  • Dog-fooding and Staggerred release wherein new versions are always released to limited internal and external audiences before releasing them to the entire user base

Scalability

Careful capacity planning does not prevent getting tech-crunched, slash-dotted or dugg. Your application design must support infinite scalability. This again requires careful planning with respect to application design and hardware selection. Vertical and Horizontal partitioning, clustering, stateless configurations and more help in creating a design that scales linearly by adding additional nodes without requiring any downtime. Always think of millions of users.

Category : 0-cosmos | TechTalk

17 May, 2010

To Trie or not to Trie – a comparison of efficient data structures

Posted by Bhavin Turakhia | (5) Comments

Since my discussion thread on the efficiency of the in-memory data structure of ZeroMQ with Martin Sustrik, I have been reading up a bit by bit on efficient data structures, primarily from the perspective of memory utilization. Data structures that provide constant lookup time with minimal memory utilization can give a significant performance boost since access to CPU cache is considerably faster than access to RAM. This post is a compendium of a few data structures I came across and salient aspects about them

Judy arrays http://judy.sourceforge.net/doc/10minutes.htm
Excerpt: A Judy tree is generally faster than and uses less memory than contemporary forms of trees such as binary (AVL) trees, b-trees, and skip-lists. When used in the “Judy Scalable Hashing” configuration, Judy is generally faster then a hashing method at all populations. A (CPU) cache-line fill is additional time required to do a read reference from RAM when a word is not found in cache. In today’s computers the time for a cache-line fill is in the range of 50..2000 machine instructions. Therefore a cache-line fill should be avoided when fewer than 50 instructions can do the same job. Judy rarely compromises speed/space performance for simplicity (Judy will never be called simple except at the API). Judy is designed to avoid cache-line fills wherever possible. The Achilles heel of a simple digital tree is very poor memory utilization, especially when the N in N-ary (the degree or fanout of each branch) increases. The Judy tree design was able to solve this problem. In fact a Judy tree is more memory-efficient than almost any other competitive structure (including a simple linked list).

HAT-trie – a cache concious trie http://portal.acm.org/citation.cfm?id=1273761
Excerpt: Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not however, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components. We evaluate performance using several real-world datasets and against other high-performance data structures. We show strong improvements in both time and space; in most cases approaching that of the cache-conscious hash table. Our HAT-trie is shown to be the most efficient trie-based data structure for managing variable-length strings in-memory while maintaining sort order.

Burst Trie http://goanna.cs.rmit.edu.au/~jz/fulltext/acmtois02.pdf
Excerpt: Many applications depend on efficient management of large sets of distinct strings in memory. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.

Radix trie (aka Patricia trie) http://en.wikipedia.org/wiki/Radix_tree
Excerpt: The radix tree is easiest to understand as a space-optimized trie where each node with only one child is merged with its child. Unlike balanced trees, radix trees permit lookup, insertion, and deletion in O(k) time rather than O(log n)

Ternary Search Trees http://en.wikipedia.org/wiki/Ternary_search_tree
Excerpt: A trie is optimized for speed at the expense of size. The ternary search tree replaces each node of the trie with a modified binary search tree. For sparse tries, this binary tree will be smaller than a trie node. Each binary tree implements a single-character lookup. It has the typical left and right children which are checked if the lookup character is greater or less than the node’s character, respectively. A third child is used if the lookup character is found on that particular node. Unlike the other children, it links to the root of the binary search tree for the next character in the string

Next steps: to trie ;) and setup benchmarks for some of these on a practical application

Category : 0-cosmos | TechTalk

6 May, 2010

RabbitMQ vs Apache ActiveMQ vs Apache qpid

Posted by Bhavin Turakhia | (20) Comments

We need a simple message queue to ensure asynchronous message passing across a bunch of our server side apps. The message volume is not intended to be very high, latency is not an issue, and order is not important, but we do need to guarantee that the message will be received and that there is no potential for failure irrespective of infrastructure downtime.

Dhruv from my team had taken up the task of researching various persistent message queue options and compiling notes on them. This is a compendium of his notes (disclaimer – this is an outline of our experience, there may be inaccuracies) -

RabbitMQ

General:

  • Some reading on clustering http://www.rabbitmq.com/clustering.html
  • DNS errors cause the DB(mnesia) to crash
  • A RabbitMQ instance won’t scale to LOTS of queues with each queue having fair load since all queues are stored in memory (queue metadata) and also in a clustered setup, each queue’s metadata (but not the queue’’s messages) is replicated on each node. Hence, there is the same amount of overhead due to queues on every node in a cluster
  • No ONCE-ONLY semanamntics. Messages may be sent twice by RabbitMQ to the consumer(s)
  • Multiple consumers can be configured for a single queue, and they will all get mutually exclusive messages
  • Unordered; not FIFO delivery
  • Single socket multiple connections. Each socket can have multiple channels and each channel can have multiple consumers
  • No provision for ETA
  • maybe auto-requeue (based on timeout) — needs investigation
  • Only closing connection NACKs a message. Removing the consumer from that channel does NOT. Hence, all queues being listened to on that channel/connetion are closed for the current consumer
  • NO EXPONENTIAL BACKOFF for failed consumers. Failed messages are re-tried almost immediately. Hence an error in the consumer logic that crashes the consumer while consuming a particular message may potentially block the whole queue. Hence, the consumer needs to be programmed well — error free. However, apps are like; well apps…
  • Consumer has to do rate limiting by not consuming messages too fast (if it wants to); no provision for this in RabbitMQ

Persistence:

  • It will use only it’s own DB — you can’t configure mySQL or any such thing

Clustering and Replication:

  • A RabbitMQ cluster is just a set of nodes running the RabbitMQ. No master node is involved.
  • You need to specify hostname of cluster nodes in a cluster manually on the command line or in a config file.
  • Basic load balancing by nodes in a cluster by redirecting requests to other nodes
  • A node can be a RAM node or a disk node. RAM nodes keep their state only in memory (with the exception of the persistent contents of durable queues which are still stored safely on disc). Disk nodes keep state in memory and on disk.
  • Queue metadata shared across all nodes.
  • RabbitMQ brokers tolerate the failure of individual nodes. Nodes can be started and stopped at will
  • It is advisable to have at least 1 disk node in a cluster of nodes
  • You need to specify which nodes are part of a cluster during node startup. Hence, when A is the first one to start, it will think that it is the only one in the cluster. When B is started it will be told that A is also in the cluster and when C starts, it should be told that BOTH A and B are part of the cluster. This is because if A or B go down, C still knows one of the machines in the cluster. This is only required for RAM nodes, since they don’t persist metadata on disk. So, if C is a memory node and it goes down and comes up, it will have to be manually told which nodes to query for cluster membership (since it itself doesn’t store that state locally).
  • Replication needs to be investigated (check addtl resources) however, from initial reading, it seems queue data replication does not exist
  • FAQ: “How do you migrate an instance of RabbitMQ to another machine?”. Seems to be a very manual process.

Transactions:

  • Any number of queues can be involved in a transaction

Addtl Resources

Apache qpid

  • Supports transactions
  • Persistence using a pluggable layer — I believe the default is Apache Derby
  • This like the other Java based product is HIGHLY configurable
  • Management using JMX and an Eclipse Management Console application - http://www.lahiru.org/2008/08/what-qpid-management-console-can-do.html
  • The management console is very feature rich
  • Supports message Priorities
  • Automatic client failover using configurable connection properties -
  • Cluster is nothing but a set of machines have all the queues replicated
  • All queue data and metadata is replicated across all nodes that make up a cluster
  • All clients need to know in advance which nodes make up the cluster
  • Retry logic lies in the client code
  • Durable Queues/Subscriptions
  • Has bindings in many languages
  • For the curious: http://qpid.apache.org/current-architecture.html
  • In our tests -
    • Speed: Non-persistent mode: 5000 messages/sec (receive rate), Persistent mode: 1100 messages/sec (receive rate) (send rate will be typically a bit more, but when you start off with an empty queue, they are almost the same for most queue implementations). However, the interesting bit is that even in transacted mode, I saw a lot of message loss if I crashed the broker (by crash I mean Ctrl+C, not even the more -9 signal type of thing that I usually do). Why I stress this is that apps. can usually hook on to Ctrl+C and save data before quitting, but qpid didn’t think it prudent to do so. Out of 1265 messages sent (and committed), only 1218 were received by the consumer (before the inflicted crash). Even on restarting the broker and consumer, that didn’t change. We observed similar behaviour with RabbitMQ in our tests. However, RabbitMQ docs. mention that you need to run in TRANSACTED mode (not just durable/persistent) for guaranteed delivery. We haven’t run that test yet.

Apache ActiveMQ

  • HIGHLY configurable. You can probably do anything you want it to with it
  • You can choose a message store. 4 are already available
  • Has lots of clustering options:
    • Shared nothing Master-Slave: ACK sent to client when master stores the message
    • Shared Database: Acquires a lock on the DB when any instance tries to access the DB
    • Shared Filesystem: Locks a file when accessing the FS. Issues when using NFS with file-locking; or basically any network based file system since file locking is generally buggy in network file systems
  • Network of brokers: This is an option that allows a lot of flexibility. However, it seems to be a very problematic/buggy way of doing things since people face a lot of issues with this configuration
  • Scaling:
    • A. Default transport is blocking I/O with a thread per connection. Can be changed to use nio
    • Horizontal scaling: Though they mention this, the way to achieve this is by using a network of brokers
    • Patitioning: We all know Mr. Partitioning, don’t we. The client decides where to route packets and hence must maintain multiple open connections to different brokers
  • Allows producer flow-control!!
  • Has issues wrt lost/duplicate messages, but there is an active community that fixes these issues
  • Active MQ crashes fairly frequently, at least once per month, and is rather slow - http://stackoverflow.com/questions/957507/lightweight-persistent-message-queue-for-linux
  • Seems to have bindings in many languages(just like RabbitMQ)
  • Has lots of tools built around it 12. JMS compliant; supports XA transactions: http://activemq.apache.org/how-do-transactions-work.html
  • Less performant as compared to RabbitMQ
  • We were able to perform some tests on Apache Active MQ today, and here are the results:
    • Non persistent mode: 5k messages/sec
    • Persistent mode: 22 messages/sec (yes that is correct)
  • There are multiple persisters that can be configured with ActiveMQ, so we are planning to run another set of tests with MySQL and file as the persisters. However, the current default (KahaDB) is said to be more scalable (and offers faster recoverability) as compared to the older default(file/AMQ Message Store: http://activemq.apache.org/amq-message-store.html).
  • The numbers are fair. Others on the net have observed similar results: http://www.mostly-useless.com/blog/2007/12/27/playing-with-activemq/
  • With MySQL, I get a throughput of 8 messages/sec. What is surprising is that it is possible to achieve much better results using MySQL but ActiveMQ uses the table quite unwisely.
  • ActiveMQ created the tables as InnoDB instead of MyISAM even though it doesn’t seem to be using any of the InnoDB features.
  • I tried changing the tables to MyISAM, but it didn’t help much. The messages table structure has 4 indexes !! Insert takes a lot of time because MySQL needs to update 4 indexes on every insert. That sort of kills performance. However, I don’t know if performance should be affected for small (< 1000) messages in the table. Either ways, this structure won’t scale to millions of messages since everyone will block on this one table.
Category : 0-cosmos | TechTalk

14 Nov, 2009

RAM Speed

Posted by Bhavin Turakhia | (6) Comments

To test the speed of RAM, I got Ramki to run a small program that writes a set of bytes into memory a billion times and ran 4 instances of it on a dual proc quad core machine. Below are the results of running four instances of the program simultaneously.

Result

output.1:       User time (seconds): 545.99
output.1:       System time (seconds): 1.33
output.1:       Elapsed (wall clock) time (h:mm:ss or m:ss): 9:07.38
output.1:       Involuntary context switches: 820

output.2:       User time (seconds): 250.90
output.2:       System time (seconds): 1.18
output.2:       Elapsed (wall clock) time (h:mm:ss or m:ss): 4:12.12
output.2:       Involuntary context switches: 378

output.3:       User time (seconds): 250.30
output.3:       System time (seconds): 1.15
output.3:       Elapsed (wall clock) time (h:mm:ss or m:ss): 4:11.49
output.3:       Involuntary context switches: 373

output.4:       User time (seconds): 563.62
output.4:       System time (seconds): 1.31
output.4:       Elapsed (wall clock) time (h:mm:ss or m:ss): 9:25.00
output.4:       Involuntary context switches: 845

Observations

  • The write speed was between 0.25 seconds per million writes to 0.55 seconds
  • Output.2 and .3 took half the time as that of .1 or .4
  • Don’t have a specific theory on why 2 of the cores did better than the other two
  • No processor affinity was set, and the processes were being scheduled on random processors after every context switch.
  • Seemingly the processes were accessing RAM simultaneously. In my limited knowledge that could mean a few things – Multi-channel FSB (Dual) and additionally while oneprocess was computing stuff the other processes could access the FSB. The program was using lrand48 to generate a random number to write data to random locations so as to ensure that we do not rely too much on the L1/L2 cache

Some reading

Category : 0-cosmos | TechTalk

17 Aug, 2009

A Compendium of solutions for scaling a Data Store

Posted by Bhavin Turakhia | (5) Comments

Achieving infinite scalability on your data store is the holy grail of scaling an application. App servers are typically stateless and therefore a cinch to scale. This document serves as a comprehensive compendium on my thoughts and research on scaling a data store.

Requirements

  • Inifinite Scalability
  • High Availability (0% downtime)
  • Data Redundancy
  • High Performance
  • Storage Flexibility – ability to store any type of data
  • Query Flexibility – ability to perform simple gets, range based gets, range based updates, and possibly complex joins

Features that a solution must have to deliver the above Requirements

  • Replication – Each unit of data should be copied to multiple nodes so that if an underlying node crashes there is no data loss
  • Partitioning – Data should be divided across multiple nodes based on specific keys so that the data layer is infinitely scalable
  • Online node addition – Solution should support adding new nodes online, with automated data distribution upon new node addition
  • Load Balancing – Queries should be load balanced between nodes
  • Persistence – Should have a data persistence layer so that data is not volatile
  • Caching – Should support flexible in-memory caching for increased retrieval speeds
  • Tree based Indexing – To support range based queries ideally one may need tree type indexes for keys on which range queries maybe made
Proposed Solutions
The below solutions are a result of researching a ton of options (refer Research seciton below). They also represent solutions that are practical to deploy and are being used in production. There are exotic variants that I came up with during my research but I have left them out.

Solution 1: Google App Engine
  • Provides BigTable – Googles implementation of a scalable database
  • BigTable is not an RDBMS, but has a fairly flexible API that supports creative data fetching methods
  • BigTable is distributed and self-balancing – scaling is no longer the application developers problem
  • You will need to host your application on Google’s App Engine
  • Your application needs to use the BigTable API for data storage
Solution 2: MySQL NDB Cluster
  • MySQL NDB Cluster is a master-master, self-partitioned, replicated storage engine that technically seems to provides all the features listed above
  • It also offers access via SQL or a high-performant native NDB API
  • It seems like the holy grail of database scaling
  • It however stores indexes entirely in memory – and has lesser flexibility w.r.t persistence
  • I could not find much material on the performance of an NDB Cluster
Solution 3: HyperTable
  • HyperTable is an opensource BigTable clone and provides essentially the same features as described in the BigTable whitepaper
  • It is supported by Baidu, Zvents and Rediff
Solution 4: Project Voldemort
  • Voldemort is a big, distributed, persistent, fault-tolerant hash
  • Developed and used by Linkedin
  • Java based API
  • Reasonably performant (10-20k ops on commodity hardware)
  • Maintains replicated copies of data over multiple nodes and automatically handles server failures
  • Does not support range based querying
Solution 5: Tokyo Tyrant
  • Tokyo Tyrant is a layer on top of Tokyo Cabinet – a highly performant, persistent data strore (site claims over 2 million qps)
  • Tokyo Tyrant itself claims to deliver upwards of 58,000 qps
  • It supports multiple language bindings (Java, Perl, PHP etc)
  • Supports various data structures – hash, tree, B+tree, array, table etc
  • Supports caching
  • Tokyo Tyrant does not support active-active master-master replication, thus failing out on redundancy. It also does not support data partitioning out of the box
Solution 6: Postgres + Pl/Proxy + Replication (Slony / Continuent) + PGBouncer (connection pooler)
  • Postgres is an extremely mature RDBMS
  • Using PL/Proxy one can abstracte horizontal partitioning concerns out of the database layer and into an abstracted underlying layer
  • Using Slony or Continuent one can ensure that multiple copies of any set of rows exist at any given point in time (Synchronous or Async replication)
  • PGBouncer provides a light-weight connection pooler for PL/Proxy
  • Together this will satisfy all our requirements above
  • You may club Voldemort or memcached or redis with this to provide a caching layer
  • Postgres also has hooks for connecting to memcached that make cache population and invalidation easier
Research
A lot of reading went behind discovering the above solutions. The below links are a good start -
BigTable
Google App Engine
MySQL Cluster
Amazon SimpleDB
Cassandra
Hypertable
MongoDB
neo4j
Project Voldemort
Redis
Tokyo Cabinet
Lightcloud
Gigaspaces
Coherence
Postgres
Category : 0-cosmos | TechTalk

16 Aug, 2009

Infinitely Scalable Infrastructure and RDBMSes

Posted by Bhavin Turakhia | (1) Comments

Since the last several months I have been spending part of my time on conceptualizing an abstracted infrastructure layer that is highly scalable and can be leveraged by any application without having to worry too much about it. I have researched and continue to research conventional and unconventional techniques – partitioning, clustering, replication, shared-nothing architectures, grid computing and so on. This article represents a sliver of my thoughts concerning scalability and RDBMSes -

The holy grail of scalability is to be able to scale your data store. And as data stores go, RDBMSes seem to be the predominant choice (though that is changing – refer http://bit.ly/2lnRet). RDBMSes by their very nature, due to the features they provide (ACID compliance, Transaction safety etc) tend to be difficult to easily scale. This has resulted in the recent mushrooming of data storage options that are feature-poor but scalable out of the box (eg Voldemort, HyperTable etc)

I wanted to chronicle the list of features that a standard RDBMS provides, that we take for granted, so that I have a reference of the features that one may have to compromise on w.r.t application development in favor of easier scalability -

  • Range based selects and updates – Being able to fire queries on a table specifying a range of values (eg where age >35). Typically RDBMSes use B+ Tree based indexes which support range based row selection. This in turn allows one to fire range based queries.
  • Transactions – In an RDBMS one can perform multiple operations within a transaction and ensure that all of them or none of them go through. This ensures data integrity
  • B+ tree indexes
  • Foreign key relationships and referential integrity
  • Joins and nested selects
  • Aggregations (sum, avg etc)
  • Advanced scripting using non-native languages (java etc)
  • Stored procedures (allow encapsulation of business logic in the database layer)
  • Triggers

Category : 0-cosmos | TechTalk

23 Jul, 2009

Cloud Platform Providers that I am investigating

Posted by Bhavin Turakhia | (6) Comments

With some free time on my hands, this week I am investigating various Cloud platform providers. The vendors I am reviewing are -

Category : 0-cosmos | TechTalk

22 Jul, 2009

Column Oriented DBMS

Posted by Bhavin Turakhia | (1) Comments

Conventionally we take DBMS for granted as a structured data store that stores data in the form of rows. Infact most application developers can begin visualizing their data as rows in an RDBMS quite naturally.

While RDBMS serve the purposes of OLTP applications well, OLAP / data anlytics type applications have conventionally not been able to obtain the type of performance needed from RDBMses. This is where column oriented DBMS can help.

In the simplest form the difference between a conventional RDBMS and a column oriented database is that the latter stores data in a column form rather than a row form when persisted to disk. Another way to look at this is that the storage in a column oriented DBMS transposes the rows and columns of the storage in a conventional RDBMS.

For eg

ID Name Age
1 Bhavin 29
2 Roger 30

This would be persisted in a conventional RDBMS as follows -

1,Bhavin,29|2,Roger,30

In a column oriented DBMS this would be persisted as -

1,2|Bhavin,Roger|29,30

It is common knowledge that the slowest piece of a DB query is its disk seek time. While the RDBMS favors queries which require fetching all data of a given row, the latter model favors queries which require aggregates. For instance – count of all users with age >20, or sum of ages of all users, and so on. These type of queries will run much faster on a column oriented DBMS due to lesser seek time required to obtain the data.

OLAP and BI applications mostly consist of data aggregation and would therefore run faster on column oriented databases.

For a list of column-oriented DBMSes refer to http://en.wikipedia.org/wiki/Column-oriented_DBMS

Category : 0-cosmos | TechTalk

20 Jul, 2009

GlusterFS – distributed redundant scalable filesystem

Posted by Bhavin Turakhia | (1) Comments

I have drafted a detailed article comparing various approaches of setting up a high-available, redundant file store – NFS, GFS and GlusterFS.

You can revew the article on our Directi Wiki – Gluster – a distributed, scalable, redundant FS – An interesting read for anyone who is involved with scalable and higly available applications.

Category : TechTalk