17 May, 2010

To Trie or not to Trie – a comparison of efficient data structures

Posted by

Since my discussion thread on the efficiency of the in-memory data structure of ZeroMQ with Martin Sustrik, I have been reading up a bit by bit on efficient data structures, primarily from the perspective of memory utilization. Data structures that provide constant lookup time with minimal memory utilization can give a significant performance boost since access to CPU cache is considerably faster than access to RAM. This post is a compendium of a few data structures I came across and salient aspects about them

Judy arrays http://judy.sourceforge.net/doc/10minutes.htm
Excerpt: A Judy tree is generally faster than and uses less memory than contemporary forms of trees such as binary (AVL) trees, b-trees, and skip-lists. When used in the “Judy Scalable Hashing” configuration, Judy is generally faster then a hashing method at all populations. A (CPU) cache-line fill is additional time required to do a read reference from RAM when a word is not found in cache. In today’s computers the time for a cache-line fill is in the range of 50..2000 machine instructions. Therefore a cache-line fill should be avoided when fewer than 50 instructions can do the same job. Judy rarely compromises speed/space performance for simplicity (Judy will never be called simple except at the API). Judy is designed to avoid cache-line fills wherever possible. The Achilles heel of a simple digital tree is very poor memory utilization, especially when the N in N-ary (the degree or fanout of each branch) increases. The Judy tree design was able to solve this problem. In fact a Judy tree is more memory-efficient than almost any other competitive structure (including a simple linked list).

HAT-trie – a cache concious trie http://portal.acm.org/citation.cfm?id=1273761
Excerpt: Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not however, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components. We evaluate performance using several real-world datasets and against other high-performance data structures. We show strong improvements in both time and space; in most cases approaching that of the cache-conscious hash table. Our HAT-trie is shown to be the most efficient trie-based data structure for managing variable-length strings in-memory while maintaining sort order.

Burst Trie http://goanna.cs.rmit.edu.au/~jz/fulltext/acmtois02.pdf
Excerpt: Many applications depend on efficient management of large sets of distinct strings in memory. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.

Radix trie (aka Patricia trie) http://en.wikipedia.org/wiki/Radix_tree
Excerpt: The radix tree is easiest to understand as a space-optimized trie where each node with only one child is merged with its child. Unlike balanced trees, radix trees permit lookup, insertion, and deletion in O(k) time rather than O(log n)

Ternary Search Trees http://en.wikipedia.org/wiki/Ternary_search_tree
Excerpt: A trie is optimized for speed at the expense of size. The ternary search tree replaces each node of the trie with a modified binary search tree. For sparse tries, this binary tree will be smaller than a trie node. Each binary tree implements a single-character lookup. It has the typical left and right children which are checked if the lookup character is greater or less than the node’s character, respectively. A third child is used if the lookup character is found on that particular node. Unlike the other children, it links to the root of the binary search tree for the next character in the string

Next steps: to trie ;) and setup benchmarks for some of these on a practical application

Tags: , , , ,
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
  • Comment by Ajitesh on May 22, 2010 @ 12:03 pm
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks
comment_type == "trackback" || $comment->comment_type == "pingback" || ereg("", $comment->comment_content) || ereg("", $comment->comment_content)) { ?>

Trackbacks & Pingbacks

Comments
comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
Shashikant
May 18, 2010

I had built a trie long time back (5 yrs to be exact.) The key was a string and value an integer. We also made it persistent so that building trie from scratch isn’t required. You could some optimizations if you are going to support only add/update and not delete.

What’s your key and value?

PS: How I can I follow discussion in comment, apart from visiting this page regularly?

comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
Ajitesh
May 22, 2010

Bhavin, I came across your blog post only a week back and honestly I can say that I have become your great fan, simply because your blogs are just do delicious food for thought! I am glad to find your blog!

comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
Reebok Easytone Shoes
July 13, 2010

The Reebok Easytone Shoes is probably the most athletic and traditional of the Calorie Burning or Reebok EasyTone category, emulating their traditional Reebok EasyTone Trainers. The Reebok ZigTech shoes,

comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
Shape Ups Shoes
July 13, 2010

Shape Ups Shoes have won quite a few awards and have really cleaned up financially in a tough economic market. But I’ve never been sold on what they are. I think that a Skechers Shape Ups should be either decorative or extremely functional in getting the Skechers Shape up trainer from Point A to Point B, like Skechers Shape up Shoes (not to say it can’t be both pretty and functional).

comment_type != "trackback" && $comment->comment_type != "pingback" && !ereg("", $comment->comment_content) && !ereg("", $comment->comment_content)) { ?>
Sambasiva
September 4, 2010

Hi,
Here is one more datastructure BLOOM FILTER
http://en.wikipedia.org/wiki/Bloom_filter

Leave a comment

(required)

(required)

Spam protection by WP Captcha-Free