I had the pleasure of once again talking with Dan Simmons from MS at a TechEd mix session. He probably thinks I'm a pain in the butt with all my EF requests, but I do enjoy our conversations. Anyway, after the session, we were discussing data architectural decisions with some of the attendees when primary keys came up. At some point, the issue took a turn towards the old INT ID vs. GUID discussion. Ironically, I had the same discussion with someone else this week, so I figured it would make a good post.

INT ID vs. GUID is one of those long-standing arguments that for whatever reason rival SProcs vs. No-SProcs in terms of passionate opposition. People either love GUIDs or hate them. But the fact of the matter is that a lot of people simply have a hard time understanding the details enough to make an informed decision.

My first reaction is to say that GUID keys are there for a reason. That reason, as far as I can tell from back in the day was replication. In fact, on some of the tooling, you'll see uniqueidentifier identities referred to as replication identities. At this point, Dan quickly pointed out that GUIDs aren't guaranteed to be unique (at least across machines in this scenario), and that's a very good point to stress. In the old documentaion, you would see scenarios where people dumped data from disperate systems into a central BI database for analysis. The problem then becomes one of overlapping IDs. Enter GUID ID. But really, that only mitigates the problem to a certain degree since you still have to code for the occassional identifier collision.

GUIDs as IDs have two main drawbacks:

1) The size. They are four times the size of a normal INT. That may not seem like a lot for today's storage hardware, but on a table of 4 million rows, that's 12MB of difference. Multiply that by several tables and/or more rows, and you get the idea. Now, many tables may not get that big and you don't necessarily have to worry about it much.

2) (another reminder that Dan brought up) GUIDs do not index well, particularly if you are talking about clustered indexes (the default for primary keys in SQL server tooling). In this scenario, because they aren't sequential, GUIDs tend to fragment the index horribly. This both wastes space and can make index updates and relocation computationally expensive. In 2005, SQL Server introduced a new function - NEWSEQUENTIALID() - to help alleviate this problem. However, most people don't use it, and you still have to contend with the size issue.

I've seen people also argue that the set of operators for GUIDs are more limited than the set for INTs, but I don't think this argument carries a lot of weight. After all, you shouldn't be using arithematic operators on primary keys anyway. Comparison operators are perfectly valid for both though. However, if you don't have sequential IDs, then comparisons for ranking and sampling (for example, for paging) become a lot trickier.

At the end of the day, there's nothing particularly evil about using GUIDs for identifiers. However, I think people need to fit the solution to a specific problem and carefully weigh the pros and cons. Personally, I'd go with INT identifiers natrually unless there was a compelling reason to use GUIDs. And expected data size isn't one of those reasons. If you need more than 4 billion records, you can always upgrade to BIGINT.

posted @ Tuesday, May 19, 2009 11:38 AM