Key Points About Surrogate KeysRoy Hann, Rational Commerce Limited Session #: NT020S Presented at CA-World '96 New Orleans, Louisiana, USA August 1996 Abstract This presentation introduces the notion of a surrogate key, with illustrations of how surrogate keys are useful. Techniques for generating surrogate keys in CA-Ingres applications are described, and are analyzed from the point of view of allocation performance and insertion problems. Techniques for efficiently generating keys are presented. Variations on the basic surrogate key are developed, including pseudo-random order keys and verifiable keys. Note: this paper was slightly updated in October 2002, and an addendum on primary keys was added in August 2003. This addendum overrides advice suggested under the subheading "The Solution", and should in fact cause us all to reconsider whether we even need surrogate keys any longer. A Typical ProblemFrequently when we design databases we discover that there is no natural key that we can rely on in our application. Consider the following (simplified) real-world example, from an application developed to study and manage the improvement of beef cattle. Beef cattle are identified by different people in many different ways. There is the farm name, by which the animal is known on the farm--and possibly to the world if it is a famous sire. Farm names are names such as "Thunder Beef IV" or "Haven Great Heart". Farm names are therefore much like human names in that there is no central coordination in assigning them, and there is a moderately high probability of ambiguous naming, i.e. two or more animals with the same name. The latter problem is very common as some farms like to re-use a good name over and over. Clearly we can't use the farm name as a totally reliable, unambiguous method of identifying an animal's records. Animals also have a tattoo number. The tattoo number is usually assigned by the breed association for the particular breed of animal. There is in fact an ISO standard for beef cattle tattoos, but the standard requires such a long number that it is not often used. Instead a truncated number is assigned which omits the international identifier. That would be fine, except that cattle are often traded internationally, with Australian, American, and Canadian cattle flowing back and forth. This raises the possibility, at least in principle, of an animal which originates in Australia being imported into Canada with a tattoo number that conflicts with another tattoo number on an animal already in Canada. To make matters worse, historically the breed associations did not include a breed identifier in their tattoo numbers, so that animals from the same country but of different breeds may have the same number too. There is also the producer's ear-tag number, which is the animal's number within its herd, but these numbers are not useful off the farm. There is also a purebred registration number, but only about 10% of animals will have one of these. There are even more numbers besides these, and some numbers can change from time to time anyway (as for instance when the animal moves to another herd). As you can see, none of these identifiers are suitable for propagating throughout the database as a key. Although this may be a somewhat extreme example, problems like this are familiar to everyone regardless of their application domain. For instance hospital databases have exactly the same problem identifying patients. Compounding the ProblemEven when a potentially reliable natural key exists, if we are relying on human operators to enter the key value we should be reluctant to propagate it throughout the database because of the difficulty of correcting it when a user inevitably enters an incorrect value. Another Common ProblemSometimes our problem is different, though a lot simpler. We just want to preserve the order in which rows were added to the database for some reason, and there is no way to recover that ordering by sorting on an existing key in the data. In this situation we sometimes add a row number when the row is stored. (I don't actually advocate doing this; I believe that if the ordering cannot be deduced from the data, then either the logical model is defective or the desire to have an ordering is wrong. But in the real world, right or wrong, users often demand this.) And Yet AnotherSometimes tables that have a key that consists of a long string can cause problems for the optimizer. The utility that collects statistics on key value distributions for the optimizer only examines the leftmost 8 bytes of the string. [Note: this was true at the time of writing, but is no longer an issue. --RH October 2002] If the string is longer than that, and if it is relatively unvarying on the left, then the statistics may suggest that the key is not very selective, even if it is completely unique. The optimizer would then ignore a very good key when it is evaluating query plans. To solve this problem we can replace the long key value with a more concise artificial value. (There are other potential solutions too, like adjusting the statistics manually, or turning off the use of statistics for queries involving this table.) The SolutionThe solution to the beef cattle identification problem is to generate a totally arbitrary "animal number" in the application and to cross-reference the "animal number" to all the natural identifiers in one place, and then to use the "animal number" as the key everywhere else in the database. If an incorrect tattoo number is entered by an operator, it needs to be corrected in only the place where it is cross-referenced to the "animal number". There is no need for the animal even to have a pre-existing tattoo number. (Employee numbers are another familiar example of the same solution to the same problem.) [Since I originally wrote this, advances in support for declaritive referential integrity constraints make this suggestion irrelevant. See the addendum for current advice. --RH August 2003] The arbitrary "animal number" and the artificial "row number" are examples of a surrogate key--a value used to uniquely identify an entity without using a real data value or "natural" key. An important property of these surrogate keys is that they are totally synthetic and hence they have no information content at all. This means they can never be wrong, and so we don't need to write code to update them. This simple point turns out to have fantastically important consequences. The rest of this presentation will deal with generating these content-free surrogate key values, and the surprising pitfalls that are hidden in this outwardly simple and easy-to-implement concept. We will also explore how to solve a variety of other problems using exotic surrogate key value generators. Possible ImplementationsThere are a couple of candidate implementations of this solution that need to be disposed of quickly before we discuss the technique that this paper recommends. Anyone who has had a little experience with CA-Ingres will be aware of the tuple identifier (TID) that addresses every row in the table. Some people have been seduced into using this TID as a convenient surrogate key value. For example in a master-detail relationship, after inserting the master row into the database they have looked up the TID of that row and used it in the foreign key in the detail table. The code might look something like this:
For a while this will not only work, but it will be very fast too, because the TID is the physical address of the row in the master table. Any time an application needs to locate the row in the master table corresponding to a detail row, it has its exact address in the detail table. Or at least it is there until the first time the master table is re-organized. (As it will be for instance when a MODIFY command is executed on it.) When that happens the row in the master table will probably get a new TID, and the relationship between the detail and the master will be irretrievably lost. To make it even worse a new row may be placed where the old row used to be, and that will be the row selected by the TID in the detail row, so the error may go undetected. For this reason the TID must never be used as a surrogate key value, under any circumstances. The second possible implementation that we will dismiss is somewhat more contentious. It is more contentious because it is supplied by the CA-Ingres product precisely to solve the surrogate key value generation problem. Unfortunately, there are very good reasons not to take advantage of it. But first let us understand how it works. CA-Ingres provides two logical key types: table_key and object_key. These are respectively an 8-byte and a 16-byte value. A column in a table can be declared to be of type table_key or type object_key, and (optionally) to be system_maintained. If it is declared to be system_maintained, then the value of the column is automatically initialized by CA-Ingres when a row is inserted. System maintained table_key values are guaranteed to remain unique within the table. System maintained object_key values are unique in the database. To outward appearance then, the system supplied logical key types are the perfect solution to some of the problems of generating surrogate key values, which should not be surprising since that is their intended purpose. Unfortunately this otherwise excellent solution has a couple of really serious drawbacks. The most significant problem is that the CA-Ingres COPY...FROM command cannot load values into a column that is a logical key type "with system_maintained". That is a problem because one must be able to use COPYDB and UNLOADDB in a variety of situations, such as doing an upgrade, or porting to a new computer system, or as a last-ditch salvage operation. But both of these utilities rely on COPY...FROM to reload the database so using system maintained keys makes them unusable. There is a work-around for this problem which will be shown later, but it is sufficiently complicated and wasteful of space that the alternate technique that is recommended in this paper may be preferable, especially considering some of the extra versatility offered. The technique that is advocated here is to construct a surrogate key generator from scratch. Such a generator can be almost as efficient as the system supplied logical key value generator, but with none of the problems, and it offers the opportunity to tailor the sequence of values generated to suit different applications. For instance, although a random sequence of values is useful in some situations, in other applications a sequential value is required. A "home brewed" sequence generator can be constructed that provides both types of numbers equally efficiently. Constructing A Surrogate Key GeneratorThe most important property of any usable surrogate key value generator is that it will never generate the same value twice. To be useful we must be able to rely on the surrogate key to identify only the one record (or group of related records) that we are interested in. The solution that first comes to mind is to generate sequential numbers. By noting the last value allocated, and adding one to get the next value, we are guaranteed to get unique numbers very simply. This kind of key generator is often implemented in the following way. Example 1First a table (called generator here) is created, with one row and one column:
This table is then accessed in the following way to obtain sequential numbers for a surrogate key value:
This fragment of code appears to do what is required, and in fact it might work most of the time on systems that have relatively few concurrent users. However it has a couple of flaws that must be corrected if it is to be reliable in all cases. The first flaw is quite elementary but frequently overlooked. In a multi-user system this code is capable of allocating duplicate surrogate key values to successive users. This can occur when a second (or third or fourth) user requests a surrogate key value at the same time as the first requestor. The problem is that by default the SELECT statement takes only a shared lock on the page. A shared lock only prevents another user from updating the row, it does not prevent another user from reading that row, and so another user is free to come and read the same data and so get the same surrogate key value. [Actually this might not really lead to duplicate numbers being used. But if not, it will cause a resource deadlock, which is only slightly better. --RH October 2002] One solution is to take advantage of the fact that an UPDATE statement takes an exclusive lock on the page. Once a user has an exclusive lock on a page, no other user can even read that page until the user who locked it releases their exclusive lock. Actually, it is not that simple. It is possible to read a page that has been locked exclusively by first issuing the SET LOCKMODE SESSION WHERE READLOCK=NOLOCK command (or some similar variant of the SET LOCKMODE command). This is highly inadvisable in most applications and would certainly be a serious problem for this key value generator. [...] Doing the UPDATE first and then doing the SELECT guarantees that a second user cannot read the same key value until after the first user commits. The more correct code looks like this: Example 2
Of course if this code is relying on the UPDATE to take and hold an exclusive lock, autocommit cannot be turned on while this code is executed. If there is a chance that this code will be used in an application that has issued the SET AUTOCOMMIT ON command, it must be protected with some boilerplate code to detect and handle that. Performance Problems--Lock ContentionWe have deliberately designed the key value generator so that no two sessions can access the generator table at the same time, so that no two sessions ever get the same key value. This solves the problem of accidentally creating ambiguous key values, but it does so at a very high cost. It induces the problem of "lock contention". The users end up queuing for access to the generator table and they experience (potentially) seriously degraded response time. In a small system with only a few users it may be tolerable, but in general, with a large number of users and locks that are held for significant periods of time, this sort of centralized key-generator can be a crippling bottleneck damaging overall system performance. To make the problem clear, suppose that the application accesses the generator table and obtains a new surrogate key value. Then suppose that the application displays a data entry frame and waits for the user to enter some data. After the user has entered the data, the application may subject it to a verification process that requires a number of database look-ups. Finally the application stores the data and then COMMITs the entire transaction, at which point the lock on the generator table is released. Now suppose that after the application displays the data-entry screen the user goes to a 3 hour meeting. As described here, the generator table will be locked for 3 hours unless the DBA intervenes. This is a far-fetched illustration, but it shows the problem, and even if the user enters their data promptly, there will still be a substantial period of "think time" during which the generator table will be locked. The busier the system is, the shorter the tolerable lock duration will be. In a really busy system a lock held for even a significant fraction of a second may be intolerable. Reducing the Impact of Lock Contention--Last-Minute LockingOne way to improve performance is to design the application so that the surrogate key value can be generated at the last possible moment in the transaction. That way the lock on the generator table will not be acquired until the last moment, and it will be released promptly. (In general we should always attempt to lock tables in the order least-busy first, so that the locks on the busiest tables are taken as late as possible and released as soon as possible; this is nothing new.) If we do nothing else, we should endeavor not to permit any user "think-time" after the time the lock on the generator table is acquired. If the application demands gap-free numbering, this may be the best we can do to eliminate lock-contention. That is because we not only have to guarantee that no two sessions ever get the same key value, we must also guarantee that no key value is ever wasted. Therefore the lock on the generator table must be held until the key value is actually used and the entire transaction is committed. But if the demand for gap-free numbering can be relaxed then there is plenty of room to improve performance still further. Regard Gapless Number Sequences as EvilThe next step towards a solution is to ask whether a gap-free number sequence is actually required. Frequently there is no reason to demand a gap-free number sequence. If the purpose of the surrogate key is only to provide a unique identification then there is no reason to insist on a gap-free numbering sequence (in fact, there may be excellent reasons to insist that it actually be non-sequential--more about that later). Similarly, if the surrogate key is only a sort key, there is no reason to insist that the sequence be gap-free, only that it be strictly increasing. We should regard gap-free number sequences as a great evil, to be avoided if possible because there just is no simple, low-cost way to generate one. Once this requirement is eliminated we can design the application to allocate a surrogate key value and then immediately COMMIT, before beginning the "real" transaction. If the "real" transaction is subsequently rolled back or never even begun, and the generated surrogate key value is wasted, we don't care. Because the COMMIT is done immediately the lock on the generator table is held only long enough to process the update and log the transaction. But we can still do even better than this. Pre-allocating Surrogate Key ValuesIf we are prepared to accept gaps in the number sequence at all, we should be prepared to accept gaps of any size. In which case we can reduced lock contention to almost nothing by pre-allocating blocks of numbers at one time and storing them, ready to use, in our application. For instance we may generate a hundred consecutive numbers to use later, and if we end up using only 30, we just won't care. We will look at this idea in detail. Pre-allocation in an application program written in a language like C or CA-OpenROAD is fairly easy. The code from Example 2 above is altered to increment keyval by a value greater than 1. In this example keyval will be incremented in units of 100, so the application will be able to obtain 100 surrogate key values with just one database update and one briefly held exclusive lock. The application will note the first and last values in the sequence it has reserved, and it will be responsible for sequencing through all the reserved values. The code would be similar to this: Example 3
In this example the increment is hard-coded as 100. It is actually better practice to store the size of the increment in the generator table . This allows the size of the increment to be tuned globally, without any program changes. If the increment column is called incsz then the code above would be written as:
Keep the State of Different Sequence Generators SeparateWe have said that surrogate key values have to be unique, but we ignored one detail that gives us still further options to improve performance. Surrogate key values need not be universally unique, they only need to be unique within their domain. (Domains are not a concept that CA-Ingres supports explicitly.) For simplicity here, the concept is illustrated by a few examples. The "domain" of employee numbers is distinct from the "domain" of animal numbers, and both are distinct from the domain of order numbers. There is no logically meaningful reason ever to join on an employee number and an animal number. They have absolutely nothing to do with each other. Therefore there is no logical necessity for the two numbers sequences to be distinct. It doesn't matter if the same value exists in both series because it is fundamentally incorrect to ever relate them. If we can have multiple independent key generator functions that do not need to coordinate the values they generate, then we have an opportunity to reduce contention still further. The key value generator for one domain does not need to share state information with the generator for another domain, so it doesn't need to be locked out by it and the two could operate completely concurrently. To make sure they do, we need to keep in mind that CA-Ingres and CA-OpenIngres (for the moment) use page-level locking. [This was true at the time of writing. Ingres II fully supports row-level locking today, however one should still not plan to exploit it, except as a last-resort. --RH October 2002] This means that if any row in a page is updated then an exclusive lock is taken on the whole page and hence also on all the other rows in that page. If the key generator state for several different surrogate key domains is kept in the same page then all the generator functions will be competing unnecessarily with each other for access to that page. To fix this problem we either need to use multiple generator tables, one for each surrogate key domain, or organize the generator table as a HASH table with a key to identify which generator uses which row. By structuring it as a HASH table and adjusting the value of MINPAGES we can ensure that each row has a page to itself and hence a page lock on that page will affect only the generator function for that surrogate key domain. This leads to the following new definition of the generator table:
After adjusting MINPAGES, we can check that each row has a page to itself with the following query:
The code to generate an employee number (identified with a key of "emp_nr") might look like this:
Patching Up the Native Logical Key TypesBefore leaving the subject of generating surrogate key values as efficiently as possible, let us revisit the native CA-Ingres logical key types. As noted above, using them can lead to catastrophic difficulties with the UNLOADDB and COPYDB utilities, and with the COPY...FROM command. However these system maintained logical keys are still very attractive. If you do not require some of the versatility that can be had by constructing a key value generator from scratch, it may be worth trying to work around the problem with the table_key and object_key types. The work-around is not entirely straightforward, and it relies upon the fact that the COPY...FROM command will not cause any rules defined on a table to be fired. When creating a table with a system maintained logical key type the trick is to create two columns of the same logical key type. One should be declared to be system maintained, and the other should not. The column which is not system maintained should be the one used in the primary and secondary keys on the table. (Ironically, the column which is system maintained should not be used for a key, ever.) A rule is then defined on the table to be fired after an INSERT, which should invoke a procedure to copy the value from the system maintained column to the other (key) column. If table foo is defined by:
We can define a procedure to copy the key value from auto_key to manual_key by:
The rule to execute this procedure is then:
With this work-around, the values in manual_key will be copied out and in with no difficulty so COPYDB and UNLOADDB will work fine. It doesn't matter that the system maintained column will get reinitialized during the processing because its value is only used for a moment, just after the row is inserted. The main drawback of this work-around is that it is potentially quite wasteful, due to the large size of the logical key types, and it doubles the processing and I/O required to insert a row. Generating Key Values With Special PropertiesHaving shown how to generate key values in efficient ways, so that the generator function does not seriously degrade overall system performance, let us look at some of the different kinds of key value sequences that we can generate for different purposes. Initially we assumed that the only property we wanted from the surrogate key value generator was uniqueness. However there may be other properties that would be useful. Two additional properties will be discussed here: random-order key values, and verifiable key values. Random-Order KeysIt is sometimes advantageous to generate key values in random order. Note that we want values that are supplied in random order and emphatically not random numbers. Random numbers may repeat, which would be disastrous for a surrogate key. Before showing one of several approaches to generating values in a random order, we should look at why we would want to do such a thing in the first place. The most powerful use of random-order keys is in B-tree tables, where they can be used to dramatically improve the performance of systems with many concurrent users performing INSERTs into the B-tree. B-trees are a very popular data structure for several reasons: they do not suffer from overflow as much as other structures, and they provide good look-up performance. Unfortunately they can suffer from a couple of problems when we use a sequential primary key value. One problem is that inserting large numbers of new rows can cause considerable waste of space. When a page is split, half the data is moved to the next page. The resulting free space is never re-used. This not only wastes space, it also reduces cache efficiency. B-trees are also prone to serious lock contention problems in applications that are INSERT-intensive where the primary key is a sequential value. A brief walk-through of a typical example will make this clear: Session 1 generates a key value k and uses that value for the primary key column of a row inserted into a B-tree. Session 2 generates a key value k+1 and uses that value for the primary key column of a row inserted into the same B-tree. The value k for the primary key means that session 1 inserts its row into page p of the B-tree, and because this is an INSERT, session 1 takes an exclusive lock on page p. In most cases, unless the page is now full, the value k+1 for the primary key means that session 2 will also attempt to insert its row into page p. But page p is locked, and will remain locked until session 1 does a COMMIT or a ROLLBACK, potentially quite some time later. Session 2 is therefore forced to wait. With more users and more sessions, the problem becomes worse. Even with a 100,000 page table, sequential key values will vector all the users to the just the one page within that large table and concurrency will collapse. There are two solutions to this lock contention problem. Convert the table to a HASH table, so that sequential values vector sessions to different pages throughout the table. (However the overflow and maintenance problems of HASH tables may make this option unacceptable.) Alternatively, use a B-tree still, but ensure that successively assigned key values are "far apart" and not monotonic (i.e. not strictly increasing or strictly decreasing). [Note, as a general rule we would should avoid designing systems that depend on a particular physical organization of the data in order to work properly. Random order keys will ensure that we do not depend on a particular physical organization. --RH October 2002] Just what "far apart" means is not easy to quantify, but for practical purposes we will be satisfied if it means that the number of times two sessions get vectored to the same page at the same time is close to zero. (Note that we are talking explicitly about multiple sessions here, and implicitly only about sessions inserting a very small number of rows into a table at one time. The situation with batch processes and bulk insertions is somewhat different. We will return to those in a moment.) There are any number of ways to generate successive values that are far apart, the real trick is to be sure to generate values that are unique too. There are a couple of ways to do that, but just one will be shown here. Constructing a Random-Order Value GeneratorThe generator shown here is an implementation of the so-called additive congruential method of generating values in pseudo-random order. It is based on a shift-register and an XOR-gate, and it has its origins in cryptography. It is attractive for our purpose because (i) it tends to generate successive values that are (usually) "far apart" , (ii) it does not repeat until it has generated every possible value, (iii) it produces uniformly distributed values, and (iv) it is easy to detect when it has cycled through all the values it can generate. To see how such a generator operates we will walk through all the iterations of the 4-bit generator illustrated in the diagram below: Initially the shift register contains the value 0001. The two low-order (rightmost) bits are XOR-ed together, giving 1, and the result is fed into the high-order (leftmost) bit position and the previous register contents shift one bit right. At the end of one iteration the register contains the value 1000. In the next iteration the two low order bits are 0, so a 0 is deposited in the high-order position and the register ends up with the value 0100. The entire series is shown in the table below:
It might not be obvious that successive values are all that far apart when we are looking at a tiny 4-bit register, but it is clear that the values are generated in no obvious order, all possible values except 0 are eventually produced, and the termination condition is clear--the generator cycles back to 1. With larger register sizes successive values are rarely "close". Simulating Bit-Operations in IngresThis looks like a great algorithm for generating guaranteed-unique keys in a pseudo-random order, but unfortunately it is expressed in terms of fundamental bit operations like XOR and SHIFT. CA-Ingres does not offer these functions; if we are forced to use them they would make this algorithm awkward to use in OpenROAD and ABF, and impossible in a database procedure. [This was true at the time of writing, but Ingres II now supplies bit-manipulation functions. Sadly they are not useful here. --RH October 2002] Fortunately it is possible to simulate them, and although the simulation is somewhat tortured compared with what would be possible if we could use the bit operations supported by the hardware, the computational overhead is still trivial (as will be shown later). We take advantage of the fact that shifting a binary number is equivalent to multiplying or dividing it by two, and it turns out that we can use ordinary arithmetic addition as a substitute for the XOR operation. The only other tool we need is the mod() function. Consider the 4-bit shift register shown above again. We will number each bit from right to left, starting at 0. Bit-0 is the low-order bit (i.e. the 2**0 bit), bit-3 is the high-order bit (i.e. the 2**3 bit). Suppose the content of this conceptual shift register is held in the ordinary CA-Ingres integer4 variable called keyval. The first step is to extract each bit so that we can operate on it. The low-order bit can be obtained by taking mod(keyval,2). This takes the value of keyval and divides it by 2, and returns the remainder. If keyval is odd (i.e. bit-0 is 1) we get 1. If keyval is even (i.e. bit-0 is 0) we get 0. If we now divide keyval by 2, that is, shift it right 1 bit, we can use the mod function again, to get the value of bit-1. If the value of keyval is initially as shown above, then mod(keyval,2) is 1, and mod(keyval/2,2) is 0. This process of shifting right (dividing by 2) and taking the result modulo 2 can be repeated until every bit in the word has been extracted. Now that we have extracted the value of bit-0 and bit-1, we add them together and reduce the result modulo 2. The result of this operation is identical to XOR. The result of this computation then has to be shifted 3 bits left (i.e. multiplied by 2**3, =8) and fed back into the shift register in the high-order position (bit-3). The previous content of the shift register has to be shifted right 1 bit (divided by 2) first. This is all carried out at once in the following CA-Ingres expression:
This can easily be incorporated into the surrogate key generator shown in Example 2 above:
Generalizing the AlgorithmThis algorithm looks like it should be easy to generalize to arbitrary word-sizes, and therefore to longer number sequences. But perhaps surprisingly, the extension of the algorithm to an arbitrary word length is very difficult because the "tap" positions where bits are extracted to be fed-back varies according to the word-size in an extremely non-obvious way. Choosing incorrect tap positions results in an incomplete (usually very short) cycle, which is unusable. Anyone who cares to attempt to penetrate the mysteries of how to determine analytically the number and location of the correct tap positions that generate all possible values will need a good college-level text in abstract algebra, and the E. J. Watson paper cited below [1]. Happily it is unnecessary for us to understand how the tap positions are discovered in order to be able to use them completely effectively; all we require is a table showing the locations to use. Tap Positions for Linear Feedback Shift Register Key GeneratorsTable 2 below shows the tap positions required to generate values for 8, 16, 31, 32 and 64-bit random order keys. (For a complete list of tap positions for all word lengths between 1 and 100 bits, refer to [1].) The 31-bit word is the one that is probably most likely to be of general use, but it is possible to imagine some applications which might want a shorter cycle for some special purpose. Note that most word-lengths shown here require more than 2 tap positions.
The 31-bit Random-Order Surrogate Key GeneratorUsing the table above we can see that we need to tap bits 0 and 3 to construct the 31-bit random-order surrogate key value generator (which is the one most people would want to use in practice):
(Incidentally, a 32-bit generator is not easy to implement on a 32-bit machine using the technique described above since it will usually generate an overflow error. A 31-bit generator should be good enough for most practical purposes though, with a cycle-length of over 2 billion states.) [By popular demand, here is the same algorithm implemented in C. --RH October 2002]
Performance of the 31-bit Random-Order Surrogate Key GeneratorAs noted above, this method of generating random-order key values appears quite intensive, and we might worry that it would be too slow to use in practice. In fact this is not the case. Referring to the response times reported in Table 3 below, the additional computational overhead of generating random-order key values is inconsequential--about 0.3 milliseconds even on the very slow hardware used to obtain these times.
(This timing information was obtained using CA-Ingres 6.4 on a 486DX2/66 system running Windows NT 3.5 in 20Mb RAM. Times are for the complete UPDATE-SELECT-COMMIT transaction.) Batch Processes and Bulk InsertionsRandom-order key values are not an objective. They are a tool, and they are only useful to the extent that they improve performance by reducing contention between multiple sessions. When contention is not an issue then random-order keys can be counterproductive. Knowing when not to use them is just as important as knowing how to create them. When a B-tree table will only ever be accessed by a single session then there will be no benefit from random-order keys because there will be no competition for access. In fact, sequential (or even identical) key values will actually improve performance by narrowing the "locality of reference" within the table, resulting in the best cache-hit rate. (The extent to which we can actually accomplish this is obviously dictated by the application.) We should not blindly insist on random order key values as a matter of policy just because a table is a B-tree. A similar but even stronger argument can be made when one of several concurrent sessions is inserting many rows into a B-tree at one time. While it is valuable to keep multiple sessions from accessing the same page at the same time, it can also be very important to keep one session from unnecessarily accessing too many pages. Scattered references would reduce the effectiveness of the DMF cache, and would lock extra pages in the table, potentially causing escalation to a table level lock and hence totally defeating the purpose of the random-order key values! If a session is going to insert (or otherwise manipulate) many rows in a B-tree in one transaction, and if the application logic permits it, it is good to keep all the associated rows of one transaction on the same page(s), while keeping other sessions out. Fortunately, unless you go out of your way to subvert it, this will tend to be the natural way of doing things anyway. When a series of rows is inserted into the detail table of a master-detail relationship, the leftmost part of the key would usually be taken from the master table and it would have been assigned a value in random order. The rest of the (composite) key in the detail table can be pretty well anything at all if the table is a B-tree because the page selected tends to be affected most strongly by the leftmost part of the composite key. The session will therefore insert all of its rows into the same page(s), while another concurrent session will insert its rows into another page entirely. We do not want to out-smart ourselves by interfering with this behavior. Verifiable KeysThe last variation on the topic of surrogate keys that we will discuss is the "verifiable" key value. In some applications in which the system generates a surrogate key value for an entity, that value may become visible to the user. (Strictly speaking, a true surrogate is always hidden from the user, otherwise becomes what is known as an artificial key, but we will ignore that detail.) It may even be necessary for operators to use that value to locate information in the database as part of their job function. Examples might be locating account balances by entering the account number, or pulling up performance records by entering the animal number. In these examples the user probably manually keys the identifier, copying it from a piece of paper. The identifier on the paper source may itself have been transcribed manually, perhaps several times. It may not even be entirely legible. Clearly there can be many opportunities for transcription errors. The application should have some way of determining if the identifier typed by the operator has mutated. Obviously if there is no row in the database with a key that matches the identifier that was entered by the user the error is easily detected and some sort of diagnostic message can be emitted. But the mutated identifier could correspond to another row in the database, not the intended row, so a more sophisticated method of detection is required. One possibility might be to display some information from the row and request verification from the user that it is the intended row. Unfortunately that procedure depends critically on the diligence of the operator, the availability of supplementary information for the operator to compare with the displayed information, and it depends on entities being easily distinguishable even given that supplementary information. None of that is very satisfying, and it is unlikely to stand up to an audit. (It would probably be found to be inadequate long before an audit anyway.) A greater level of assurance is provided by adding a check digit to the artificial key value when it is generated. A check digit is just a checksum. It is computed from the basic surrogate key value, and each value will have a unique checksum. This has two benefits. The checksum can be tested without accessing the database at all. The application can therefore tell "by inspection" that an identifier is incorrect if its checksum is incorrect. The second benefit is that it makes it extremely unlikely that a mutated identifier will accidentally correspond to a real row in the database. To understand why, we need to look at how the checksum is computed. The following code defines a view on the generator table that supplies values with a so-called 1212 modulo 10 check digit.
To use this view, the state of the number generator is advanced as usual to obtain the exclusive lock on the generator table, and then instead of reading the next value from the generator table itself, it is read from the cd_generator view:
Note that the check digit is not part of the state of the sequence generator. It is computed after the state is advanced. Therefore check digits can be used with both sequential and random-order generators equally easily. Note also that in this example, the key value is returned as a seven digit string, padded on the left with zeroes, and the check digit in the rightmost position. An integer value could just as easily be returned. To test an operator-entered identifier, the application strips off the supplied check digit and recomputes the checksum. The recomputed checksum is then compared with the supplied check digit that was stripped off. If they match then the identifier is correctly formed. If they don't, then there is an error somewhere. This check digit calculation adds just a single decimal digit to the key value, so there is a 1 in 10 chance of someone accidentally keying a wrong but correctly formed identifier. That is obviously inadequate for security purposes, but that is not the intention. The check digit is intended only to detect accidental mutations in identifiers that the operator is trying to key correctly. The way this checksum is computed guarantees that all errors involving a single transposition, or a single substitution, will result in a incorrect checksum. To enter an undetectably mutated identifier, the user would have to make at least two complimentary errors. If we assume that the user is making an effort to transcribe the identifier correctly then they will not often make an error at all. It will be much rarer still that they will make two errors in the same identifier. But even if they do, there is still only a 1 in 100 chance (at best) that the two errors will be complementary. The odds that such a mutated identifier also corresponds to a real record in the database are almost inconsequential. Together with the more straightforward ideas for detecting errors noted above, even this simple check digit calculation makes the identification of records highly reliable. ConclusionSurrogate keys are essential for developing reliable, high-performance database systems. But a thoughtlessly designed generator can easily compromise system performance instead of enhancing it. In spite of being a simple concept, efficient techniques for generating surrogate key values are not at all obvious. The fastest, most efficient techniques are extremely involved and require deep understanding of the CA-Ingres locking system. However, when they are correctly implemented they can help us to achieve very high throughput and reliability. Addendum[Since I wrote this paper in 1996 there have been numerous changes to the Ingres product. One of those changes has been the introduction of support for PRIMARY KEY and UNIQUE constraints in the database. I have become aware that very few people properly understand the purpose of these constraints, and frequently combine them with surrogate key attributes. In this small addendum I want to explain what these constraints are really intended to accomplish, and why they should never be applied to a surrogate key attribute. I also explain why surrogate keys are no longer necessary in most cases. --Roy Hann August 2003] ConstraintsConstraints are conditions that can be imposed on the database and enforced automatically by the database server. Constraints have many practical benefits and some theoretical consequences. Constraints are used to exclude defective data from the database. That is, they encode our understanding of what the data is allowed to look like, and in a very real sense they encode the "meaning" of the data. This is important because we make explicit and implicit assumptions about what we are allowed to do to the database based on our understanding of what it models. If there is data in the database that does not conform to our understanding (that is, to our reasonable assumptions), then we might very easily run a query that would be correct based on what we know about the real world, but which nevertheless returns invalid results. Worse still, if we know in advance that there is such incorrect data in the database, or even if we just know that such incorrect data could exist in principle--because we know that nothing is causing it to be excluded, then we will feel obliged to write vast amounts of code in our applications to verify the correctness of the data we are getting from the database. In short, we will have no faith in the database. Such boiler-plate code is a waste of effort, it is a huge cost, and it cannot to be successful even in principle. PRIMARY KEY and UNIQUEThe description of constraints above is probably the understanding of constraints that most people have, and certainly it makes good sense when we try to understand referential integrity constraints and check constraints. What is not perhaps so obvious is that PRIMARY KEYs or UNIQUE do anything to make the database more reliable, or more meaningful. To outward appearance they just prevent the rather trivial error of using the same key twice. (And indeed they do that.) But this misses the point of these constraints. In the real world we know there are many instances of each kind of entity. There are many employees. There are many orders. There are many steers. What is more, we know that each instance is totally distinct from all other instances. No two employees can be the same person. No two orders can be the same order. And so on. We therefore want to ensure that our database contains all the instances we are interested in, and only the instances we are interested in. And we want no duplicate records of each instance because a row in a table representing the entity is just a true statement about the entity, and repeating a true statement does not make it more true. Indeed repeating it may make it less true, because if we ever update one row (one true statement) and fail to update all the other rows that logically make the same statement about the same entity, we will have contradictory statements and no basis on which to prefer one over the other. In short, we will be confused. The only way to prevent the problem of having multiple true statements is to have a reliable way to tell two entities apart in the database, and then check each time a row is added to make sure that it is not a spurious repetition of an existing true statement. That is, we want a way to uniquely identify an entity so that we can be sure there is only one statement about it. KeysNotwithstanding anything in the original version of this paper (preceding the heading "Addendum"), entities are identified by looking at their distinguishing attributes. We consider the attributes describing the entity and choose one or more that necessarily distinguish it from all others. These attributes are candidate keys. They are candidates because if there is more than one possible distinguishing attribute we could equally well choose any of them. There is no logical advantage to one over another. Note that we are looking at an entity's distinguishing attributes. These are intrinsic features of the entity. When two entities are distinct there will always be some way of telling them apart. If there isn't, then logically they are the same entity, in which case there never were two entities and we were mistaken in ever thinking there was more than one. So if we cannot find a way to distinguish two entities (rows) in our business model then either we are mistaken in thinking there are two, or we simply have not thought about it hard enough, but either way we have got something wrong. This is the problem that PRIMARY KEY and UNIQUE constraints are really intended to prevent. Surrogate Keys and PRIMARY KEY (and UNIQUE)Now that we know what these constraints are really for, it should be self-evident that automatically introducing a surrogate key and then placing a PRIMARY KEY constraint on it not only trivially satisfies the constraint, it also completely subverts the purpose of these constraints. To put it more bluntly: automatically assigned surrogate keys as primary keys invalidate the consistency of the database. That is, there is a constraints that we should be able to apply in order to ensure the consistency and correctness of the database, and it is being defeated, and so we cannot be certain of consistency. One might well object that it is easy enough to apply additional UNIQUE constraints on the other natural candidate keys to guarantee consistency, and that is true. But no fundamental purpose is served by enforcing a PRIMARY KEY (or UNIQUE) constraint on an attribute that is trivially unique. Unfortunuately we sometimes think that we are forced into it because we want to be able to use a surrogate key as part of a referential integrity constraint. This happens when we have used a surrogate to prevent having to propagate a user-entered value to other tables. Interestingly, we are no longer forced to do this and there is now a much better alternative. Key PropagationAt the beginning of this paper I suggested using surrogate keys to avoid having to propagate user-entered key values throughout the database. At the time I wrote that, it was good advice. But support for declarative referential integrity within Ingres has come a long way since then, and the case for using surrogate keys is now almost impossible to defend. The argument for introducing a surrogate key was to avoid propagating a manually entered value to multiple tables in the database. The mapping from the natural key(s) to the surrogate key was done in one table, and the surrogate was then used in all other referencing tables. Thus if an error in a natural key were ever corrected it would not be necessary to have code to propagate the correction to all the referencing tables. Such code would be tedious to write and it would be impossible to give any assurance that it would be maintained when more referencing tables are added to the database. The use of a surrogate key was therefore more reliable than the alternative. Ingres II 2.5 now provides a built-in mechanism to automatically propagate changes to an attribute used as a foreign key. This is a far more robust solution. To use it, the PRIMARY KEY constraint is first applied in the usual way. For example:
The foreign key constraint is then added to the referencing table, specifying (using the ON UPDATE CASCADE clause) that the foreign key should be updated in the referencing table when the primary key is updated in the referenced table:
A Word About PerformanceOne is always properly concerned about performance when using new DBMS features. In this case the performance issue is secondary, because one should never be willing to trade correctness for speed. But as it turns out, the speed with which the server can propagate changes to the foreign key is impressive. There is certainly no reason to suppose the application could do it any quicker even if one were prepared to write and maintain all the necessary boiler-plate code. Indeed it is so quick that even several thousand references can be corrected in a couple of seconds. Using my original suggestion of a mapping between the natural keys and a surrogate would be quicker of course, since that would only ever be a single update to a single row, but the real difference is slight (in reasonable cases), and in any case one would only very rarely update a natural key because natural keys are supposed to be intrinsically ever-fixed distinguishing attributes. New ConclusionSurrogate keys still have a place, and all the arguments for using random-order key values still apply when they are used. But the justification for ever using a surrogate key has been largely eliminated thanks to much better support for declarative referential integrity constraints in Ingres. In my opinion this is one of the most important new features of Ingres yet it has passed almost completely unremarked. References
© Rational Commerce Limited, 2002, 2003 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||