You probably don’t need a UUID

Share
You probably don’t need a UUID

My troubles with record identifiers starts with a web site I developed, Eksi Sozluk. It's been one of the most popular Turkish web sites in the world for the last quarter century. When I first wrote it in 1999, I had to run it on a remote hosting service with no way to install external tools. All I had access to was their FTP server. So, I improvised and decided to keep every record in a single text file. Yeah, bad idea, but it worked at the beginning.

There were no record identifiers, only topics and free-form list of entries. Deleting a specific record involved me downloading a single big text file from FTP, deleting relevant lines, and uploading the file back again.

Then, I had to develop a UI for it, and it was immediately apparent that I needed a unique ID for every record to make it less painful.

In mere months, I was convinced that I had to use an actual database instead of a plain text file as the number of users grew quickly. I decided to migrate to MS Access. Yes, I know, Access wasn't designed to be a server DB, but it sure beats the hell out of a plain text file.

When I was creating my DB schema, I was asked to create an identifier for tables as the default option. Something called an "autoincrement". What a convenient feature! I selected that, imported the user records. Then, I noticed that I botched the order of records, emptied the table, and imported the user records again. I probably did that a few more times. That's why my user ID on Eksi Sozluk is now 8097 instead of 1. I had no idea how to reset the autoincrement back then, and I didn't care anyway. It was just a number.

One of the greatest weaknesses of autoincrementing integers as record identifiers is that they might convey the number of records. You may not want people to see that. Also, when you see an autoincrementing integer, you can easily guess that 8098 and 8099 are probably valid records too. It lets people to enumerate the records, and you may not want that either.

Actually, days after I started writing this article, 27 years after my adventures with autoincrement, someone just discovered that another user on Eksi Sozluk had a smaller ID than my account just by trying records before 8097 on a URL that resolved the ID to a nick. It turns out that I botched the record orders anyway.

The final problem of autoincrementing integer ID's is that ID generation have to be one by one in order to avoid duplicate IDs. This means all other tasks on a system must wait for an ID generation to be completed in order to get a new ID themselves. It sounds inefficient.

Here comes UUIDv4

That's when UUIDv4 was heralded as the savior from all those problems. As a 128-bit random identifier

  • It didn't convey count
  • It didn't expose neighboring record IDs
  • It can be generated in parallel

Like, the best of all worlds. Here are some UUIDv4s, in upper-case, which I think how they look the best, and in Berkeley Mono:

Random UUIDs! hex UUIDs! decimal UUIDs! v4 UUIDs! v7 UUIDs! If you find a better UUID cheaper... use it!

But, it wasn't long before people discovered the problems of UUIDv4:

UUIDs have worse UX

When you deal with UUIDs, you just can't memorize them, similar to IPv6 addresses. I had memorized my user record identifier 8097 instantly, no way that I could memorize my UUID.

It's harder to select UUIDs on GUIs because double clicking would cut-off the selection at the hyphen.

No, not my intent

Windows supports "select all surround text" by clicking a third time on that selection, but it works inconsistently and usually the source of frustration. It's definitely much harder to handle than a simple integer.

That problem can be alleviated by using a different display format for UUIDs. You can Base32 encode them and get a shorter and easier to handle identifier.

The same UUID in hex and Crockford Base32 encoding

UUIDv4 isn't truly random

I've seen many instances that people thought UUIDv4 is truly random, and used them in security-sensitive contexts such as using a UUID as initial vectors for cryptography. But, the thing is, UUIDv4 spec doesn't guarantee cryptographically secure identifiers.

That means, for a security researcher, a UUIDs neighbor records might be as predictable as sequential integer identifiers.

DB indexing woes

That's likely the most discussed problem with UUIDv4. The story is simple: DB organizes records in a B-tree (balanced tree) structure based on its components. Since UUIDv4 is random, every record ends up in a different node during inserts, and that pretty much forces B-tree to be rebalanced all the time, instead of once in a while. Rebalancing is a costly operation, and insert operations suffer because of that.

My guess is that, that would stabilize after a certain number of rows because new IDs would match with existing nodes more. So, I don't care much about that, but people did and came up with a solution called UUIDv7.

UUIDv7 to the rescue?

The version 7 of UUID spec designates a format that mostly contains a timestamp with a sequential or random portion to avoid conflicts for IDs generated in the same millisecond or so. That way IDs are way more B-tree index friendly. Sequential records would fall into same B-tree nodes and would stay balanced longer.

That said, UUIDv7 is still a bunch of tradeoffs, and like all tradeoffs, it comes with its own downsides.

First and foremost, UUIDv7 now exposes record creation timestamps, and that might be a greater security problem than sequential identifiers depending on how sensitive a record creation date could be. The record timestamps could be used for correlating events or determine usage patterns. It's a significant signal for intelligence. The structure of UUIDv7 also makes it much easier to enumerate than UUIDv4, although possibly less than a sequential integer.

The other problem of UUIDv7 is that the information it contains needs to be accurate to be useful. That means all clients that generate UUIDv7 IDs must be time-synced if you want to have coherent picture of all of your records. Otherwise, that timestamp becomes useless, or maybe even more harmful than using something else. But, because UUIDv7 is represented as an obscure hex string instead of date itself, the problems with dates and times aren't apparent or easily noticeable. It might be too late to find out that you logged wrong UUIDv7s.

Debugging a problem with wrongly generated UUIDv7s for the last five years

That's not as big of an issue with a centralized ID generation systems because if there's a skew, it's universal. It's easier to spot and fix. When all your clients generate the timestamps, it can be chaos.

Does UUIDv7 make a difference?

I didn't want to just jump to a conclusion on a hunch and did some crude benchmarks. I installed PostgreSQL.

To insert 1 million rows to table of an id and varchar pair (both NOT NULL), this is the time it takes on my desktop machine (not clustered indexes):

ID Type Time (seconds)
UUID (v4) 4.7
UUID (v7) 3.9
Autoincrement BIGINT (BIGSERIAL) 2.6

So, UUIDv4 seems only marginally (1.1x) slower than UUIDv7, but the situation changes when I want to insert 10 million rows instead of just 1 million.

ID Type Time (seconds)
UUID (v4) 97
UUID (v7) 38
Autoincrement BIGINT (BIGSERIAL) 26

Now, UUIDv4 is 2.5 times slower than UUIDv7. My guess is still that UUIDv4 would eventually stabilize due to enough allocated B-tree nodes, but that's an exercise I wanted to leave to the reader due to time constraints.

Obviously 64-bit integer beats both UUIDs easily, but these were sequential inserts, so, it's hard to see the impact of locking for ID generation.

What do we do?

If you only want nicely balanced DB indexes, just go with an integer. Do you expect your table to hit two billion rows in the future? Then, maybe a 64-bit one. "64-bit ought to be enough for everybody" as never been famously said. The tradeoffs are similar. Integers are way easier to remember, at least for the first million rows or so. They are easier to use, debug and deal with.

But what about parallel ID generation benefits of a UUID? That's never been an issue. We've used autoincrementing identifiers even in realtime chat features on a web site that receives gets tens of millions of visitors every month. Yes, with granular locks, DBs support concurrent row insertions, but ID generation is usually never the bottleneck.

A 64-bit integer occupies half the space of a UUID. A 32-bit integer, a quarter. That just means faster I/O, less storage overhead, especially if you're concerned about row insertion performance bottlenecking on B-tree rebalancing.

Remember, in our small experiment above, we have managed to insert 10 million records in a database in less than 30 seconds. How many years will it take your web site to reach that row count, say, for "posts", for "users", for "comments", or even "likes"? Let me tell you, it's been 27 years since I switched my web site to autoincrementing integers. Many of the DB fields are still 32-bit integers too, and no, we haven't hit any ceiling or a scalability issue directly related to them yet. Again, this is one of the most popular web sites in a country of 85 million people.

For your security related concerns, always plan for your threat model. But, you'd already know that if you'd read my book. If you need to hide record counts, use UUIDv4 identifiers in a secondary index. Preferably show them Base32-encoded on the UI so they are nicer to handle, or maybe you don't need a unique ID because the record itself already has enough uniquely discriminating fields, you can just use those then. If you need to avoid enumeration, put access controls to all of your resources. Don't just rely on "nobody would guess that". Assume that people can guess everything.

No, you probably don't need a UUID.