For a list of BASHing data 2 blog posts see the index page.
The curious world of UUIDs
I often see UUIDs (Universally Unique IDentifiers) in my data work, like this one:
b2ca08fb-5537-48e3-a5bc-ee8de24baa2a
but until recently I had no idea what was behind their design. All I knew is that they contained randomly generated information — so much information that it was extremely unlikely that two separately generated UUIDs would be the same. This makes UUIDs ideal for database table records, product IDs and other uses where identifiers need to be genuinely and permanently unique.
I now know that there are currently eight different versions in the UUID standard, and more may be coming.
The UUID format is a string of 32 ASCII characters (0-9 and a-f or A-F) in 5 hyphen-separated blocks. I suspect the block/hyphen structure is there so that a program given a UUID as a string will recognise the string as a UUID. Ignoring the hyphens, a UUID could be seen as a gigantic number expressed in hexadecimal format. In binary (base 2), the number is a string of 128 bits:
Ignore the number. It's that binary string that's important in UUIDs, but it isn't read as a number. Instead it's processed bit by bit. Below is the example ID and its 128 bits, numbered from 0 to 127. I've grouped the bits in 32 quartets so you can see how each of the 32 hexadecimal "digits" in the UUID is formed. I also show octets of bits, numbered from 0 to 15:
Version 4. How the bits are generated depends on the UUID version. A great tool for working with UUIDs is uuid, because in addition to allowing you to generate any number of UUIDs in versions 1 to 5, the -d option for uuid allows you to decode a UUID and see a report on it:
OK, my example UUID is version 4. The "SIV" result is just the gigantic hexadecimal number in decimal format:
The version 4 specification says that the first 48 bits (0-47 in the diagram above) can be filled with randomly or pseudo-randomly generated 1's and 0's. Bits 48 through 51 are the "version field", and are always filled with "4" in version 4. That means you can spot a version 4 UUID by looking for a "4" as hexadecimal character 13 out of 32, at the beginning of the third hyphenated block:
b2ca08fb-5537-48e3-a5bc-ee8de24baa2a
The next 12 bits (52-63) can again be filled with random 1's and 0's.
Next there's a "variant field" occupying bits 64 and 65. For an explanation, see section 4.1 in the UUID specification, but for most purposes the allowed bits are 1 followed by 0.
All of the remaining 62 bits can be filled with random 1's and 0's.
Notice that the information in a version 4 UUID says nothing about when and where the UUID was generated. It's as random as the specification can make it, and ignoring the "version" and "variant" bits, there are 2^122 possible version 4 UUIDs — an unimaginably huge number!
Version 1. Unlike version 4, version 1 contains clues to the UUID's origin. Here's one for an example:
ab112476-587e-11ee-8c99-0242ac120002
To quote from the specification:
UUID version 1 is a time-based UUID featuring a 60 bit timestamp represented by Coordinated Universal Time (UTC) as a count of 100-nanosecond intervals since 00:00:00.00, 15 October 1582 (the date of Gregorian reform to the Christian calendar).
The layout of the "time" bits in the UUID is complicated, but they occupy bits 0-47 and 52-63. The time is meant to be the time at which the UUID was generated, and as shown above, my example was built on the 21st of September 2023, at 12:59:15.421503.0 UTC.
Bits 48-51 are again reserved as a "version field":
ab112476-587e-11ee-8c99-0242ac120002
and bits 64 and 65 are a "variant field". Now for the interesting parts. Bits 80-127 in a version 1 UUID make up a "node" section which may contain the MAC address of the UUID-generating system, and bits 66-79 make up a "clock-sequence" (can be random) which allows for changes in the system clock that set the "time" bits.
Let's see what happens when I build 3 version 1 UUIDs:
They have a "serial" look because all that's changed in their building is the time at which they were generated:
And now you know when I built those three UUIDs, and you know from the MAC address in the node section that I built them on a Dell computer. If I use the -m option, the UUID node section instead contains a random multi-cast MAC address:
There are alternatives to UUIDs that do the same job but aren't so large or complicated, but these alternatives don't seem to be so widely used. I'll look at one of the alternatives in a future post.
Last update: 2024-05-17
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License