Something we’ve been ruminating on at work recently is uniqueness. We’d quite like everything in our system to have a unique identifier, usually referred to as a GUID or Globally Unique IDentifier. This way you can access any entity from any place, and the display method of said entity is determined by the place of access. By way of example, if you were to access an email, you’d see the email, but if you were looking at a folder full of emails, you’d see the email in the context of the folder that contains it.
Back to uniqueness though. We don’t really need our GUIDs to be truly globally unique, only unique within our system. Ideally we’d also like them to be short so if you email someone a url their client won’t wrap the line in the middle of our GUID.
The simplest way to ensure this would be with a number. This is easy to maintain as they can be sequential, generated by a database and don’t need any special rules. The only problem is the namespace is not very big. If you allow five characters for your GUID in the familiar base 10 number system, you have 10,000 potential values, or 105.
10,000 is a good start but what if you’ve got entities in different tables? Entity#1 in table A is not the same as entity#1 in table B. Perhaps we can do better.
The canonical GUID implementation is the Universally Unique Identifier or UUID. UUID has been standardised as RFC4122 by people a lot cleverer than me so let’s assume it’s pretty good. A UUID takes the following form:
550e8400-e29b-41d4-a716-446655440000
Right away we can see that it’s going to result in a much larger namespace than using numbers as it incorporates letters as well. This expands the value each character can be from 10 to (26 + 10), giving us a base 36 namespace. A five character GUID in base 36 has 60,466,176 possible values, or 365.
Version 5 UUIDs are calculated by taking an existing source of uniqueness (a URL is recommended) and hashing it using the SHA-1 algorithm. This is not fantastic for us as we’re trying to create a GUID which will be used in the creation of a URL, so we’re going to have to look elsewhere.
The simple thing to do would be to generate them randomly. This is not as bad as it sounds and appears to be a standard way of doing it. We’ve just introduced a subtle problem though. As we are now randomly generating the GUID we become vulnerable to the birthday paradox. From this rather nifty calculator, we can deduce that we’ll reach a 99.9% probability of a collision (eg. two items with the same GUID) with a five character GUID in base 36 after 29,903 items. Although this means we are fairly likely to have collisions with randomly generated GUIDs, it’s still much better than using integers which reach 99.9% after just 372 items.
Clearly the way to increase the namespace is to increase the base of the number system we are using. There’s easy thing we can do here – include capital and lower case letters. YouTube do this to great effect, even including punctuation:
http://www.youtube.com/watch?v=ZOU8GIRUd_g
So lower case (26), upper case (26), integers (10) and underscores (1) gives us base 56 – 550,731,776 potential values at five characters, or 565. No need to stop at underscores though – RFC2396 tells us that the only restricted characters for URLs are ; / ? : @ & = + $ ,. That leaves !@£%^*()[]{}”‘\|><. etc etc.
At 11 base 56 characters, YouTube can accommodate nearly 17,000,000,000,000,000,000 videos (or 1.7 x 10-8 hella videos) before running out of GUIDs. When they do, they can just add another character on to literally increase the namespace exponentially or simply allow additional punctuation marks. Taking the birthday paradox into account, they can have 15,325,262,787 videos before a 99.9% probability of collision occurs. If they currently have 120,000,000 they are still a long way off needing to extend their namespace.
So there you have it. Increase your number base to reduce the chance of GUID collision and make your life easy when trying to make everything in your database have a unique, email friendly identifier.
Popularity: 1% [?]