So wow, this was the week of nonstop GUID ID discussion (in many different forms). Given that, I thought it'd be appropriate to expand on my last post. Surprisingly, the discussions have been angst and rant-free, which is a good thing. I've often wondered why professional engineers stoop to zealous and rancorous name-calling in dogmatic flames rather than objectively looking at all sides of an issue like the scientists we're supposed to be, so it's nice to engage in something that doesn't sound like jihad run amok.

This brings us to a central point in the debate, which falls into that fulcrum point between KISS/YAGNI and purposeful programming. One of the arguments often leveled against RAD development or the so-called "Mort" programmers, is they thrash about until they find something, anything, that appears to work and go with it, never taking into account the harm the approach might entail; the idea being that developers ought to program with a sufficient understanding of the stakes, and code with intent and purpose, not out of incident. We then have to balance that notion against the concept of YAGNI, which means not to over-evaluate nor over-engineer a situation and go with the best yet simplest solution available. However, I'm finding that a lot of people hide behind YAGNI, and use it as a crude bludgeoning tool for bashing and ignoring what they can't comprehend – which is the wrong approach to take. With that in mind, I'd like to point out once more that the purpose of my posts here isn't to say "Thou shalt never use GUIDs as IDs", but rather that I'm presenting issues that are crucial in order to make informed decisions. More importantly, I'm presenting arguments for both sides and allowing you (the reader) to come to your own conclusions based on your particular scenario, which is far more objective and constructive than saying "always do this" or "never do that".

One of the points I raised about using nonsequential GUIDs for keys is the index fragmentation. I soon realized that a lot of people didn't fully comprehend what this really meant, so this post is really a primer on index fragmentation and why you should or shouldn't care.

First, if your tables are very small, this doesn't pose much of a problem, if any at all. In fact, if your tables are small enough (for example, 10 rows or less) an index can be completely ignored for queries, as the query optimizer can favor a table scan over an index scan, since with so few records, a table scan can reach a desirable record faster than (or on par with) going via index on average.

Second, the problems you encounter with index fragmentation grow proportionally with table size and frequency of access. In other words, the more data there is, and the more concurrent reads/writes taking place, the more you will get noticeably slammed by index fragmentation. What this means is that to a certain degree, you can live with index fragmentation if you feel you absolutely have to, depending largely on the nature (size and use) of the table. Don't relegate this decision for later consideration! It's not a trivial matter to change the key and index structure of a table post production deployment. This could impact your entire architecture.

Before showing you what's really at stake with index fragmentation, it's important to understand the indexes themselves. And I find that with most developers I talk to today, the understanding of databases is very limited because they see it as some abstract physical storage black box not unlike a relationally structured file system (I can hear the anguished screams of terror DBAs are giving off right now).

First, SQL Server has two kinds of indexes: clustered and non-clustered. They are both implemented as B-Trees, meaning that you have a tree structure of nodes (for fast searching). In both cases, the indexed values are stored in order (from min to max, or alphabetically). The main difference between clustered and non-clustered indexes is that clustered indexes incorporate the actual storage of the data, whereas non-clustered indexes are separate lookups pointing to the data. OK, for a better idea of this notion, think of clustered indexes like a directory, encyclopedia, or phone book – the contents of the book itself are stored in order by a name or key. A non-clustered index is more like an index in a chemistry textbook: the index entries are in ordered, but it is separate from the content (found way in the back in some appendix), and each index entry points back to a specific fragment of actual content, which is in turn written out completely independent of the order of the index. Otherwise, both are pretty similar in terms of physical storage in the DB file. There are other nuances (such as whether or not unique keys are present), but those are somewhat inconsequential to this particular matter.

SQL Server is designed around pages of data, each containing 8Kb. As long as data is contained within a single page, things are fast. But when SQL Server has to switch pages, you get a performance hit, particularly when said page isn't contained in the same block (extent) and/or isn't cached and physical IO has to occur. Just like everything else in the DB, your indexes are written on pages. Now let's say that (in a really simplified view) your index, containing simple integer IDs for this example, looks like this:

original index

Notice that each page is full and everything is sequential and contiguous. This is what a good index is supposed to look like. The index Value here is the ID itself. Now let's pretend I have to insert a row where the index value (the key in this case) is 6. This entry has to be written into a page that already has data, and worse, is already full. Now SQL Server has to do a page split and relocation – which is very expensive for a write. Your index would now look like this:

original index

Notice that for entry 6 to be inserted in order, the rest of that page had to be split off, and relocated to a new page – which also happens to be at the end, so the page itself is also out of order. This affects query performance in a number of ways because the pages aren't contiguous (the scan has to bounce around), and caching will be difficult depending on the spacing between the pages (which is why this problem grows worse as the table gets larger). For severely fragmented indexes, which is what happens with random/pseudo-random/hashed values (like non-sequential GUIDs), page switching on queries will be very noticeable.

Also, notice that you now have partially-filled pages. Page 1 and Page 3 have holes with no data. This sharply increases the size of the data file for large, fragmented indexes. So in addition to GUIDs being physically larger values than the INT IDs, you get substantial data bloat with page fragmentation.

You can avoid a lot of these problems simply by using sequential GUID keys (for example, by using the NEWSEQUENTIALID() function) if you don't care much for INT IDs. Though the size of the data will increase, obviously, storage is cheap and a lot of people simply won't care, although you still have to take performance into account. However, this isn't really possible when using client-side generated IDs, barring some complex infrastructure.

So what makes me curious is why people want to use client-side generated IDs to begin with. This is genuine curiosity because it's possible the solution is solving problems that no longer exist or might be better served with a different solution. One of the reasons for client-generated unique IDs used to be that you couldn't always depend on the server to hand you the correct the identity value for parent-child relationships inserted in batch. For example, @@IDENTITY could hand you the wrong ID value since it grabbed the last ID produced by the session, which could have been the result of an insert trigger write to a different table, rather than the identity insert of the present command statement (as an example). However, there are more elegant ways around this today, particularly with the SCOPE_IDENTITY() function. Of course, that's all SQL Server specific, so YMMV depending on your particular scenario.

posted @ Thursday, May 21, 2009 4:30 PM