Early in my programming career most of the databases I worked on have had the luxury of having decent hardware and relatively small data (100s of thousands of rows per table). However, more and more I now find myself working with databases where 10s of millions of records per table is the norm. With smaller databases it is easy to just focus on “how can I store my data with the least amount of effort?”, but as things scale it becomes imperative that you carefully consider the performance implications of what you are asking the computer to do. One thing you can do to take a big step in the right direction is to learn how to select appropriate primary keys.
Primary key selection is important for a couple of reasons. First, almost every table will have one, and almost every mildly complex query will need to include it for joins, so knowing how to select one is a win that keeps on winning. Secondly, the primary key dictates access patterns, and how an application can go about creating new records. I think most people who have worked with databases are well aware of the first point, but I’m willing to bet even many DBAs are less familiar with the second one. Luckily, I will be discussing both.
Natural vs Surrogate Keys
The first choice you need to make when selecting a primary key is whether to use a natural or surrogate key.
A natural key is one in which one or more existing fields within a record are also used to identify the record. For instance, if I identify a person by social security number, or by first and last name, or a building by address, or an airport by it’s abbreviated name (IATA code), or an order line by order number and order line number. Natural keys were much more common in older databases because they do not require persisting additional data, but they come with a major drawback; almost everything you think is a natural key is not. What do I mean by this? Well, if you were paying attention to my examples above you might have realized that none of them is necessarily a natural key. Social security numbers and addresses can be reassigned. IATA codes can be reassigned, or in theory an airport could have the code assigned to it changed. First and last name combinations are even worse since they are often not even unique at a point in time. The only example I gave that might be a natural key is order and order line number, but it depends on your business logic. Can the business reuse an order number when it rolls over? Will it ever roll over? Is it possible to have two order lines with the same line number (an example would be soft deleting lines)? If you can’t say with certainty that two records will never exist with the same identifier, or that an identifier would never change, then that identifier can’t be used as a natural key. Things get a bit muddier if you do not keep a history of the data and thus you *technically* can reuse a primary key at two different points in time. However, except where you must legally purge the history you can generally assume this will not be the case. Plus, even when you delete the record from the database, have you deleted it anywhere else it may still exist (a cached web page for instance, or an excel export, or even in the minds of the system operators)?
OK, so if not a natural key, then what? This is where a surrogate key comes in. Instead of trying to identify a set of fields will represent a unique identifier now and until the end of time, we make one up. It is much easier to guarantee that a key will always be unique and will never change if it only exists as an identifier and nothing else. This is why you see a lot of primary keys that are integers, GUIDs, or a string value that is never displayed outside the database. For the cost of some extra storage and potentially needing to pull more fields into your queries you gain the advantage of having strong primary keys. In fact, some of the best natural keys I have seen are ones which include the surrogate key of a parent record. For instance, depending on your business rules you may be able to say with certainty that an order line number will always be unique within an order. Given that, if the order has a surrogate key, an order line can use a natural key of the order key and line number. Surrogate keys are almost a no fail way to identify records, and are quite common in business applications.
Sequential vs Non-Sequential Keys
The next important decision you will make is whether or not the keys will be sequential; does each value follow the previous one when sorted? This is an important decision because it has huge performance implications that I unfortunately won’t be able to get into the details here without turning this post into a novella. Just know that generally inserts are much more efficient when identifies are sequential.
The typical example of a sequential identifier is an integer that starts at 0 or 1 and increments by 1. This is great because not only are inserts efficient, but it is a nice small identifier which saves you space and even makes it possible to remember as you look through records. One hack you can use to get twice as many possible values is instead of starting near 0, start with the lowest or highest value possible and increment by one in the opposing direction (for example, in SQL Server INTs can go from -2^31 to 2^31-1). Also, you can save some additional space or be able to store even more records by selecting the correct integer value (in SQL Server your options are TINYINT, SMALLINT, INT, BIGINT).
Your typical non-sequential identifier is a GUID/UUID (globally unique identifier/universally unique identifier). GUIDs are (semi-) randomly generated 16 byte values (twice as large as INTs) that are assumed to be unique (I’m sure I will write more about this later). The great thing about GUIDs is that they are much more arbitrary; the value has no correlation to other fields on the record and has nothing to do with the order of inserts. This has security implications since given your entire database and all of your code a hacker still couldn’t predict what the next identifier will be. The downside to using GUIDs though is that inserts are slower and your tables will take up more space due to fragmentation (though these downsides can be mitigated with “sequential” GUIDS; so many topics for another day…). GUIDs have one other important feature that I will discuss next.
Decentralized Key Generation
An arbitrary surrogate key sounds good, it nicely separates the concerns of identifying a record from what that record actually stores and how it is actually persisted, but what does this actually buy us? I’m glad I asked 🙂
Since the primary key is arbitrary the database does not need to act as a central authority when assigning them. This means that our application(s) can take on that responsibility, which gives them the ability to do some interesting things. For instance, if you need to create and persist a complex graph of data (parent, child, and grand-child records) then you can do so without needing to insert each record of data and waiting to get an identifier back. Instead, you can create the whole graph, identifiers and all, and bulk insert all the records in dependency order. Also, if you have records that are related to each other but in separate databases (think micro-services) then your application can insert the records to both databases without needing to insert into one and wait for a response. Another interesting pattern that application generated identifiers allows is persisting data to an offline database to later be synchronized with the central authoritative database. Which technically this could be achieved with a complex process of reassigning identifiers on synchronization or by federating blocks of IDs (yet another topic), GUIDs are a much simpler and more elegant solution. A final pattern enabled by decentralized ID generation is pulling together data from multiple databases with similar schema into a single report. This is useful in environments where you have a database per tenant, but you want to be able to combine the data into reports for the business. While it is not always a requirement, decentralizing identifier assignment opens up some interesting design possibilities that are simply not possible otherwise.
So there it is, a “brief” primer on primary keys. I have attempted to keep this as short as possible, but I was not entirely successful. Nonetheless, I hope you found it interesting and useful, and I hope it will help you make more informed decisions when selecting primary keys in the future.