Ah … primary keys … such a topic! When discussing what columns to define as a primary key in your data models, two large points always tend to surface:
These can be very complicated and sometimes polarizing things to debate. As I often try to do, I will attempt to approach this topic from a slightly different perspective.
Let's start things off with what I feel is a good interview question:
I suspect that many people will answer (a), and quite a few will answer (b). If you answer (c), though, you are correct! Why? Because a primary key is not a single column, it is a set of columns. Many people who have designed large, complicated systems are simply not aware of this.
I once worked with a consultant who kept claiming that importing data into his system was complicated, because his database used “primary keys”. It was very confusing (yet humorous) trying to discuss things with him because he kept confusing primary keys with identity columns. They are not the same! An identity column may be a type of primary key, but a primary key is not an identity column; it is a set of columns that you define that determine what makes the data in your table unique. It defines your data. It may be an identity column, it may be a varchar column or a datetime column or an integer column, or it may be a combination of multiple columns.
When you define more than one column as your primary key on a table, it is called a composite primary key. And many experienced and otherwise talented database programmers have never used them and may not even be aware of them. Yet, composite primary keys are very important when designing a good, solid data model with integrity.
This will be greatly oversimplifying things, but for this discussion let's categorize the tables in a database into these two types:
Tables that define entities are tables that define customers, or sales people, or even sales transactions. The primary key of these tables is not what I am here to discuss. You can use GUID columns, identity columns, long descriptive text columns, or whatever it is you feel comfortable to use as primary keys on tables that define entities. It’s all fine by me, whatever floats your boat as they say. There are lots of discussions and ideas about the best way to determine what the best primary key of these tables should be, and pros and cons of all of the various approaches, but overall, that is not really what I am addressing.
Tables that relate entities, however, are a different story.
Suppose we have a system that tracks customers, and allows you to assign multiple products to multiple customers to indicate what they are eligible to order. This is called a many-to-many or N:N relation between Customers and Products. We already have a table of Products, and a table of Customers. The primary key of the Products table is ProductID, and the Customers table is CustomerID. Whether or not these “ID” columns are natural or surrogate, identity or GUID, numerical or text or codes, is irrelevant at this point.
What is relevant and important, and what I am here to discuss, is how we define our CustomerProducts table. This table relates customers to products, so the purpose of the table is to relate two entities that have already been defined in our database. Let’s also add a simple “OrderLimit” column which indicates how many of that product they are allowed to order. (This is just a simple example, any attribute will do). How should we define this table?
For some reason, a very common answer is that we simply create a table with 4 columns: One that stores the CustomerID, one that stores the ProductID we are relating it to, the Order Limit, and of course the primary key column which is an identity:
This is what I see in perhaps most of the databases that I’ve worked with over the years. The reason for designing a table in this manner? Honestly, I don’t know! I can only surmise that it is because of the lack of understanding what a primary key of a table really is, and that it can be something other than an identity and that it can be comprised of more than just a single column. As I mentioned, it seems that many database architects are simply not aware of this fact.
So then, what is the problem here? The primary issue is data integrity. This table allows me to enter the following data:
In the above data, what is the order limit for customerID #1, productID #100? Is it 25 or 30? There is no way to conclusively know for sure. Nothing in the database constrains this table so that we only have exactly one row per CustomerID/ProductID combination. Remember, our primary key is just an identity, which does not constrain anything.
Most database designs like this just assume (hope?) that the data will be always be OK and there will be no duplicates. The UI will handle this, of course! But even if you think that only one single form on one single application ever updates this table, you have to remember that data will always get in and out of your system in different ways. What happens if you upgrade your system and have to move the data over? What if you need certain transactions restored from a back up? What if you ever need to do a batch import to save valuable data entry time? Or to convert data from a new system that you are absorbing or integrating?
If you ever write a report or an application off of a system and simply assume that the data will be constrained a certain way, but the database itself does not guarantee that, you are either a) greatly over-engineering what should be a simple SQL statement to deal with the possibility of bad data or b) ignoring the possibility of bad data completely and setting yourself up for issues down the road. It's possible to constrain data properly, it's efficient, it's easy to do, and it simply must be done or you should not really be working with a database in the first place -- you are forgoing a very important advantage it provides.
So, to handle that issue with this table design, we need create a unique constraint on our CustomerID/ProductID columns:
Now, we are guaranteed that there will only be exactly one row per combination of CustomerID and ProductID. That handles that problem, our data now has integrity, so we seem to be all set, right?
Well, let’s remember the definition of what a primary key really is. It is the set of columns in a table that uniquely identify each row of data. Also, for a table to be normalized, all non-primary key columns in a table should be fully dependent on the primary key of that table.
Consider instead the following design:
Notice here that we have eliminated the identity column, and have instead defined a composite (multi-column) primary key as the combination of the CustomerID and ProductID columns. Therefore, we do not have to create an additional unique constraint. We also do not need an additional identity column that really serves no purpose. We have not only simplified our data model physically, but we’ve also made it more logically sound and the primary key of this table accurately explains what it is this table is modeling – the relationship of a CustomerID to a ProductID.
Going back to normalization, we also know that our OrderLimit column should be dependent on our primary key columns. Logically, our OrderLimit is determined based on the combination of a CustomerID and a ProductID, so physically this table design makes sense and is fully normalized. If our primary key is just a meaningless auto-generated identity column, it doesn’t make logical sense since our OrderLimit is not dependent on that.
Some people argue that having more than one column in a primary key “complicates things” or “makes things less efficient” rather than always using identity columns. This is simply not the case. We’ve already established that you must add additional unique constraints to your data to have integrity, so instead of just:
we instead need:
So we are actually adding complexity and overhead to our design, not simplifying! And we are requiring more memory and resources to store and manipulate data in our table.
In addition, let's remember that a data model can be a complicated thing. We have all kinds of tables that have primary keys defined that let us identify what they are modeling, and we have relations and constraints and data types and the rest. Ideally, you should be able to look at a table's primary key and understand what it is all about, and how it relates to other tables, and not need to basically ignore the primary key of a table and instead investigate unique constraints on that table to really determine what is going on! It simply makes no sense and adds unnecessary confusion and complication to your schema that is so easily avoided.
Some people will claim that being able to quickly label and identify the relation of a Product to a Customer with a single integer value makes things easier, but again we are over-complicating things. If we only know we are editing CustomerProductID #452 in our user interface, what does that tell us? Nothing! We need to select from the CustomerProducts table every time just to get the CustomerID and the ProductID that we are dealing with in order to display labels or descriptions or to get any related data from those tables. If, instead, we know that we are editing CustomerID #1 and productID #6 because we are using a true, natural primary key of our table, we don’t need to select from that table at all to get those two very important attributes.
There are lots of complexities and many ways to model things, and there are many complicated situations that I did not discuss here. I am really only scratching the surface. But my overall point is to at least be aware of composite primary keys, and the fact that a primary key is not always a single auto-generated column. There are pros and cons to many different approaches, from both a logical design and physical performance perspective, but please consider carefully the idea of making your primary keys count for something and don’t automatically assume that just tacking on identity columns to all of your tables will give you the best possible database design.
And, remember -- when it comes to defining your entities, I understand that using an identity or GUID or whatever you like instead of real-world data has advantages. It is when we relate entities that we should consider using those existing primary key columns from our entity tables (however you had defined them) to construct an intelligent and logical and accurate primary key for our entity relation table to avoid the need to create extra, additional identity columns and unique constraints.
Note:This is a picked-up article from :http://weblogs.sqlteam.com/jeffs/archive/2007/08/23/composite_primary_keys.aspx
Thx very much for this article.