31 Dec, 2009

Guide on GUIDs

Posted by Bhavin Turakhia

One of the seemingly trivial yet daunting challenges to scaling a datastore is the prevalence of auto-increment IDs to represent unique records in a database. Since any scaling involves horizontal partitioning of data, thus distributing inserts, how does one ensure uniqueness with respect to generation of IDs on these independent machines. One replacement method is using GUIDs (or UUIDs) which is nothing more than a randomly generated 128 bit number (there are various methods of generating one).

The uniqueness guarantee comes from the extremely low probability of two randomly generated 128 bit numbers ever colliding. Just to give you a sense of the size of the space - If a computer was to generate a new GUID every milli second, it would take 10790283070806014188970529154.99 years to generate all GUIDs. That is roughly 83 million billion times the estimated age of the universe.

Is a GUID truly unique

Mathematically speaking no. Since the GUID space consists of 2^128 possibilities, one cannot generate 2^128+1 unique GUIDs. Since the time taken to generate all combinations is so high the probability of a potential collision however within an application space is quite low. However this probability increases as the space of generated GUIDs increase, due to the birthday paradox. As per the birthday paradox in a sample set of n values from a total space of s, the probability that there is atleast one collision is given by the formula – P =  (s! / s^n * (s-n)!).

Applying this formula to a space of 2^128 values, the probability of atleast one collision becomes non-trivial when the number of values reach about 10^17 to 10^18 (about 0.001% to 0.14%).

Advantages of using GUIDs

* no centralization
* obfuscation of id
  • Globally unique without central generation. Allows easier partitioning of data without having to rely on a central auto-incrementer
  • GUIDs are obfuscated and cannot be guessed. Auto-increment IDs have a disadvantage in that one can guess subsequent IDs given a starting point. This allows attacks such as data-scraping and potentially even DOS attacks by simply querying a service for incrementing IDs from a starting point
  • Can be generated by the middle tier as opposed to the data layer

Disadvantages of using GUIDs

  • Take additional processing power to generate
  • Do not index as easily as smaller int values, thus increasing time taken for standard CRUD operations
  • Take up additional space (4 times as much – 16 bytes versus 4)
  • Can result in data and index fragmentation if a proper indexing mechanism is not chosen
  • Can be un-intutive

References


Comments
Ramki Gaddipati
December 31, 2009

The generation of UUIDs is not truly random. Most implementations are based on certain values in the problem/solution domain which are assumed/expected to be unique, to avoid even those small chances of collisions, like MAC, URIs, UserIds so on.

http://en.wikipedia.org/wiki/UUID

Ramki Gaddipati
December 31, 2009

An extract from MySQL reference manual, as an example

•UUID()

Returns a Universal Unique Identifier (UUID) generated according to “DCE 1.1: Remote Procedure Call” (Appendix A) CAE (Common Applications Environment) Specifications published by The Open Group in October 1997 (Document Number C706, http://www.opengroup.org/public/pubs/catalog/c706.htm).
A UUID is designed as a number that is globally unique in space and time. Two calls to UUID() are expected to generate two different values, even if these calls are performed on two separate computers that are not connected to each other.
A UUID is a 128-bit number represented by a utf8 string of five hexadecimal numbers in aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee format:
■The first three numbers are generated from a timestamp.
■The fourth number preserves temporal uniqueness in case the timestamp value loses monotonicity (for example, due to daylight saving time).
■The fifth number is an IEEE 802 node number that provides spatial uniqueness. A random number is substituted if the latter is not available (for example, because the host computer has no Ethernet card, or we do not know how to find the hardware address of an interface on your operating system). In this case, spatial uniqueness cannot be guaranteed. Nevertheless, a collision should have very low probability.
Currently, the MAC address of an interface is taken into account only on FreeBSD and Linux. On other operating systems, MySQL uses a randomly generated 48-bit number.

http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid

Bhavin Turakhia
December 31, 2009

*ramki: except that if you are using them to substitute your regular rdbms auto increment on the server then using MAC / URIs etc will infact increase the collision chance as opposed to reducing it

fivefigers shoe
July 15, 2010

Although vibram flow walking over gravel in my Vibrams was a little vibram sprint uncomfortable, the rest of the hike was extremely fun and vibram fivefingers kso enjoyable. I am wanting to work out in my vibram 5fingers every day!Most of the vibram shoes sale, the fact that it is indeed more harm than good. The problem is that these shoes in order to “protect” in such a way that they should not be his leg. After the over-built five finger shoes muscles, tendons and ligaments of the lower limb atrophy conclusions. This is because your five fingers shoe do the work, your legs and feet should do.http://www.fivefingeronline.com/

Leave a comment

(required)

(required)

Spam protection by WP Captcha-Free