IDENTITY vs. ROWGUID, Part II

So wow, this was the week of nonstop GUID ID discussion (in many different forms). Given that, I thought it'd be appropriate to expand on my last post. Surprisingly, the discussions have been angst and rant-free, which is a good thing. I've often wondered why professional engineers stoop to zealous and rancorous name-calling in dogmatic flames rather than objectively looking at all sides of an issue like the scientists we're supposed to be, so it's nice to engage in something that doesn't sound like jihad run amok.

This brings us to a central point in the debate, which falls into that fulcrum point between KISS/YAGNI and purposeful programming. One of the arguments often leveled against RAD development or the so-called "Mort" programmers, is they thrash about until they find something, anything, that appears to work and go with it, never taking into account the harm the approach might entail; the idea being that developers ought to program with a sufficient understanding of the stakes, and code with intent and purpose, not out of incident. We then have to balance that notion against the concept of YAGNI, which means not to over-evaluate nor over-engineer a situation and go with the best yet simplest solution available. However, I'm finding that a lot of people hide behind YAGNI, and use it as a crude bludgeoning tool for bashing and ignoring what they can't comprehend – which is the wrong approach to take. With that in mind, I'd like to point out once more that the purpose of my posts here isn't to say "Thou shalt never use GUIDs as IDs", but rather that I'm presenting issues that are crucial in order to make informed decisions. More importantly, I'm presenting arguments for both sides and allowing you (the reader) to come to your own conclusions based on your particular scenario, which is far more objective and constructive than saying "always do this" or "never do that".

One of the points I raised about using nonsequential GUIDs for keys is the index fragmentation. I soon realized that a lot of people didn't fully comprehend what this really meant, so this post is really a primer on index fragmentation and why you should or shouldn't care.

First, if your tables are very small, this doesn't pose much of a problem, if any at all. In fact, if your tables are small enough (for example, 10 rows or less) an index can be completely ignored for queries, as the query optimizer can favor a table scan over an index scan, since with so few records, a table scan can reach a desirable record faster than (or on par with) going via index on average.

Second, the problems you encounter with index fragmentation grow proportionally with table size and frequency of access. In other words, the more data there is, and the more concurrent reads/writes taking place, the more you will get noticeably slammed by index fragmentation. What this means is that to a certain degree, you can live with index fragmentation if you feel you absolutely have to, depending largely on the nature (size and use) of the table. Don't relegate this decision for later consideration! It's not a trivial matter to change the key and index structure of a table post production deployment. This could impact your entire architecture.

Before showing you what's really at stake with index fragmentation, it's important to understand the indexes themselves. And I find that with most developers I talk to today, the understanding of databases is very limited because they see it as some abstract physical storage black box not unlike a relationally structured file system (I can hear the anguished screams of terror DBAs are giving off right now).

First, SQL Server has two kinds of indexes: clustered and non-clustered. They are both implemented as B-Trees, meaning that you have a tree structure of nodes (for fast searching). In both cases, the indexed values are stored in order (from min to max, or alphabetically). The main difference between clustered and non-clustered indexes is that clustered indexes incorporate the actual storage of the data, whereas non-clustered indexes are separate lookups pointing to the data. OK, for a better idea of this notion, think of clustered indexes like a directory, encyclopedia, or phone book – the contents of the book itself are stored in order by a name or key. A non-clustered index is more like an index in a chemistry textbook: the index entries are in ordered, but it is separate from the content (found way in the back in some appendix), and each index entry points back to a specific fragment of actual content, which is in turn written out completely independent of the order of the index. Otherwise, both are pretty similar in terms of physical storage in the DB file. There are other nuances (such as whether or not unique keys are present), but those are somewhat inconsequential to this particular matter.

SQL Server is designed around pages of data, each containing 8Kb. As long as data is contained within a single page, things are fast. But when SQL Server has to switch pages, you get a performance hit, particularly when said page isn't contained in the same block (extent) and/or isn't cached and physical IO has to occur. Just like everything else in the DB, your indexes are written on pages. Now let's say that (in a really simplified view) your index, containing simple integer IDs for this example, looks like this:

original index

Notice that each page is full and everything is sequential and contiguous. This is what a good index is supposed to look like. The index Value here is the ID itself. Now let's pretend I have to insert a row where the index value (the key in this case) is 6. This entry has to be written into a page that already has data, and worse, is already full. Now SQL Server has to do a page split and relocation – which is very expensive for a write. Your index would now look like this:

original index

Notice that for entry 6 to be inserted in order, the rest of that page had to be split off, and relocated to a new page – which also happens to be at the end, so the page itself is also out of order. This affects query performance in a number of ways because the pages aren't contiguous (the scan has to bounce around), and caching will be difficult depending on the spacing between the pages (which is why this problem grows worse as the table gets larger). For severely fragmented indexes, which is what happens with random/pseudo-random/hashed values (like non-sequential GUIDs), page switching on queries will be very noticeable.

Also, notice that you now have partially-filled pages. Page 1 and Page 3 have holes with no data. This sharply increases the size of the data file for large, fragmented indexes. So in addition to GUIDs being physically larger values than the INT IDs, you get substantial data bloat with page fragmentation.

You can avoid a lot of these problems simply by using sequential GUID keys (for example, by using the NEWSEQUENTIALID() function) if you don't care much for INT IDs. Though the size of the data will increase, obviously, storage is cheap and a lot of people simply won't care, although you still have to take performance into account. However, this isn't really possible when using client-side generated IDs, barring some complex infrastructure.

So what makes me curious is why people want to use client-side generated IDs to begin with. This is genuine curiosity because it's possible the solution is solving problems that no longer exist or might be better served with a different solution. One of the reasons for client-generated unique IDs used to be that you couldn't always depend on the server to hand you the correct the identity value for parent-child relationships inserted in batch. For example, @@IDENTITY could hand you the wrong ID value since it grabbed the last ID produced by the session, which could have been the result of an insert trigger write to a different table, rather than the identity insert of the present command statement (as an example). However, there are more elegant ways around this today, particularly with the SCOPE_IDENTITY() function. Of course, that's all SQL Server specific, so YMMV depending on your particular scenario.

IDENTITY vs. ROWGUID

I had the pleasure of once again talking with Dan Simmons from MS at a TechEd mix session. He probably thinks I'm a pain in the butt with all my EF requests, but I do enjoy our conversations. Anyway, after the session, we were discussing data architectural decisions with some of the attendees when primary keys came up. At some point, the issue took a turn towards the old INT ID vs. GUID discussion. Ironically, I had the same discussion with someone else this week, so I figured it would make a good post.

INT ID vs. GUID is one of those long-standing arguments that for whatever reason rival SProcs vs. No-SProcs in terms of passionate opposition. People either love GUIDs or hate them. But the fact of the matter is that a lot of people simply have a hard time understanding the details enough to make an informed decision.

My first reaction is to say that GUID keys are there for a reason. That reason, as far as I can tell from back in the day was replication. In fact, on some of the tooling, you'll see uniqueidentifier identities referred to as replication identities. At this point, Dan quickly pointed out that GUIDs aren't guaranteed to be unique (at least across machines in this scenario), and that's a very good point to stress. In the old documentaion, you would see scenarios where people dumped data from disperate systems into a central BI database for analysis. The problem then becomes one of overlapping IDs. Enter GUID ID. But really, that only mitigates the problem to a certain degree since you still have to code for the occassional identifier collision.

GUIDs as IDs have two main drawbacks:

1) The size. They are four times the size of a normal INT. That may not seem like a lot for today's storage hardware, but on a table of 4 million rows, that's 12MB of difference. Multiply that by several tables and/or more rows, and you get the idea. Now, many tables may not get that big and you don't necessarily have to worry about it much.

2) (another reminder that Dan brought up) GUIDs do not index well, particularly if you are talking about clustered indexes (the default for primary keys in SQL server tooling). In this scenario, because they aren't sequential, GUIDs tend to fragment the index horribly. This both wastes space and can make index updates and relocation computationally expensive. In 2005, SQL Server introduced a new function - NEWSEQUENTIALID() - to help alleviate this problem. However, most people don't use it, and you still have to contend with the size issue.

I've seen people also argue that the set of operators for GUIDs are more limited than the set for INTs, but I don't think this argument carries a lot of weight. After all, you shouldn't be using arithematic operators on primary keys anyway. Comparison operators are perfectly valid for both though. However, if you don't have sequential IDs, then comparisons for ranking and sampling (for example, for paging) become a lot trickier.

At the end of the day, there's nothing particularly evil about using GUIDs for identifiers. However, I think people need to fit the solution to a specific problem and carefully weigh the pros and cons. Personally, I'd go with INT identifiers natrually unless there was a compelling reason to use GUIDs. And expected data size isn't one of those reasons. If you need more than 4 billion records, you can always upgrade to BIGINT.

Old Passwords Still Working?!

I've had to respond to this topic a number of times now, so I figured I would just write it up here for future reference.

If you are using Active Directory, and you administratively change a principle's password, sometimes you find that the old password still works (at least for a little while). Most often, you'll see this if you are using PrincipleContext or a Membership Provider that uses A.D. under the covers, because when you call their methods to change password, they do an administrative password change using LDAP.

This is actual an old feature of NTLM authentication. The concept being (if I remember correctly) that after an administrative password change, you could still have cached credentials, shares, etc. using the old password. Therefore, NTLM still accepted the old password for some specified amount of time after the password change. The same "feature" is no longer available under Kerberos from what I understand.

So first, you might be asking, "What's all this about NTLM and Kerberos? I'm using Active Directory and/or LDAP". True, but AD is just a giant database of network resource info (including users and groups) and LDAP is simply one protocol to query/change that data. The information by itself doesn't actually "do" anything. To perform authentication, Windows uses information from AD, but uses one of two protocols: NTLM or Kerberos. NTLM is a hashed credential challenge-response based protocol, and Kerberos is a (non-Microsoft) based protocol using 3rd party symmetric credential encryption and verfication via tickets. If you didn't get all that, don't worry. The point is that they work differently. NTLM is the old Windows NT networking library (including authentication), and Kerberos is the shiney new authentication library. But Windows still maintains both because not all environments or even all versions of Windows are capable of using strictly Kerberos. In fact, by default, Windows uses a pseudo-protocol called Negotiate, which tries to use Kerberos, and if that fails, falls back on NTLM.

So one way to prevent the old password being valid after a password change is stop using NTLM for authentication. And that's easier said than done. If you are using PrincipleContext, nothing you do will prevent the code from using Negotiate (which ends up using NTLM when Kerberos comes back with an error). Even changing the ContextOptions won't do it.

The PrincipleContext class authenticates by establishing a secure LDAP connection and then calling Bind() on that connection. It provides the given credentials (user name and password) to the Bind operation. But no matter what you change on the PrincipleContext, it will almost always use Negotiate (barring some scenarios, like looking for local users, which simply queries the local SAM database). Therefore, you have to stop using PrincipleContext.ValidateCredentials. Instead, you can do what it does under the covers, and Bind to an LDAP connection manually. It's quick and only takes a few lines of code. Basically, you create a new LdapConnection and set the NetworkCredentials (just an object that holds user name and password). Next, (and this is the important part), set the AuthType to Kerberos. Finally, call the Bind() method. If the credentials are invalid, you'll get an exception to that effect. This is exactly what PrincipleContext.ValidateCredentials does (if you are trying to validate based on Active Directory), but this way, you get to control the protocol being used. Also, don't forget that LdapConnection is a Disposable object, so make sure to call Dispose() when you're done, or instantiate it inside a Using block.

LINQ to SQL Takes a Back Seat to EF

Well, it's official. One of my predictions (http://theruntime.com/blogs/devprime/archive/2008/09/02/prognostication.net.aspx) has somewhat come true. Microsoft officially announced today on the ADO Team Blog (http://blogs.msdn.com/adonet/archive/2008/10/29/update-on-linq-to-sql-and-linq-to-entities-roadmap.aspx) that starting with .NET 4.0 (the next release), Entity Framework will be THE choice of relational LINQ technology, and LINQ to SQL is taking a back seat.

This all made sense to me, because LINQ to SQL's limitations were just lame. It looked and felt like a tie-over product, meant to hold the LINQ to DB line until something better came along. The problem was that the "something better" (EF) still needed a lot of development work and wasn't ready by the time VS2008 was out the door.

Of course, that leaves a lot of people who invested time in learning LINQ to SQL a little steamed (don't you wish you had listened to me? - j/k!!). But take heart - a lot of the concepts do in fact carry over from LINQ to SQL to EF. Even if the biggest thing you learned was just how to deal with LINQ in the first place, then you've already got a good leg up on EF.

But strangely enough, looking at the comments on the blog, it seems like people are really misinformed about EF (and writing angry comments that don't have much basis in fact). So here's the Rob version of FactCheck:

SPIN: Entity Framework is more limiting than LINQ to SQL
FACT: It's just opposite. LINQ to SQL only works with SQL Server, while EF works with just about any DB. Also, the resulting SQL is more capable, and in some cases, more optimized, in EF than in LINQ to SQL. That perf-and-feature gap will only continue to grow as EF matures even more and MS dumps more money into it.

SPIN: LINQ to SQL was better because it allows for domain-first design
FACT: Not true at all. Entity Framework is much better at allowing you to design in a domain-first manner. That is a primary goal of the technology. It allows you to decouple the conceptual "entities" from the physical store layout. With LINQ to SQL, all you get are datasets tied to a single physical entity. That 1-to-1 limitations stinks because a conceptual entity may be split and normalized across multiple tables in the physical store. Also, the entities you get back with EF are true observable objects rather than datasets with smoke and mirrors (as is the case in LINQ to SQL). All this allows you to better represent your domain in EF than in L2S.

SPIN: LINQ to SQL was better at persistance ignorance
FACT: If you've already read the two previous facts, this one should be obvious. EF allows access to any DB while LINQ to SQL allows only access to SQL Server. For persistance ignorance, that means EF 1, L2S 0. You can't be ignorant of the persistance mechanism if you are forced to use one and only one persistance mechanism. FAIL. And because you can create your conceptual entities separately from the physical layout of the store, EF 2, L2S 0. In other words, L2S ties you specifically to the underlying table/view/SP-return structure, and that is NOT persistance ignorance.

SPIN: OK, screw you and your facts, LINQ to SQL was just easier
FACT: That's partially true. The problem here is that EF is much (MUCH) more capable than L2S, and that adds more levers and knobs to tweak. The second problem is that if you've played with EF v1, then you'll note that the GUI IDE tooling for EF seems a bit lacking. That's because it is. However, once completed, EF's IDE support will rival (if not surpass in some areas) anything L2S has exposed so far. To be quite honest, a lot of it will work in a fairly similar manner - a designer to define entities, and a mapping UI. Some scenarios are very difficult to do in v1 because the IDE support is incomplete and you have to write extra code or muck with the uderlying XML definitions. As EF matures and goes on to future releases, that will go away.

Trustworthy Languages

I was emailing with a VS languages team member yesterday, and asked him why he wrote a piece of code a certain way (just some sample snippet that he used to illustrate an idea in the email thread). He responded that he normally codes day-to-day in C++, and generally distrusts the language, so he tries to be as explicit as possible.

And that got me thinking about the concept of trustworthy languages.

It seems like sometimes we're so focused on the addition features related to many other engineering aspects and goals, that we normally don't ask the question: "at the end of the day, do you trust this language?" Before we can really answer that, it's important to understand why wouldn't you trust a language. I think some factors are:

  1. The interpreter/compiler generates logic that is counter-intuitive. This is often the case when the language tries to do the thinking for you, and allows you to skip being explicit by providing some default behavior if you don't supply specifics. The problem occurs when the compiler (by default) does something you don't expect, therefore causing unforseen runtime behavior or errors.
  2. The language implicitly resolves/casts/converts types, even when it's not "safe". For example, what will the code actually excute given   var X = "100" + 4? Will the results be "1004" or 104 or NAN? This of course, is an easy example, but in more complex logic, the type resolutions can really bite you in unexpected ways. This is also a problem with languages that play fast and loose with pointers. In this category, I'd place most scripting languages like Javascript, as well as VB with Option Strict OFF, and also C. In this respect, languages like Java, C#, and VB with Option Strict ON do pretty well.
  3. The language supports inheritance and OOP, but the rules of member resolution aren't clear. For example, let's say that you have a base class that defines a member, and you redefine it three subclasses later. Will the new member shadow the base class member, or replace it? "Good" languages force you to specify, but "bad" languages just allow you to define members however you want, and the compiler or runtime will decide what to do later. Or even better, what happens if you have multiple inheritance and inherit 2 base classes, each of which contain a member with the same name and arguments? Similarly speaking, what if you can, on top of all this other stuff, create AOP-style mix-in kinds of members? If a mix-in member clashes with a base-class member, who wins?
  4. The code just crashes... a lot... for no apparent reason (*cough*ActionScript*cough*)

This doesn't mean that languages can't have default unspecified behavior, but does mean that languages should be consistent and intuitive when doing default, unspecified things. And if people can't agree on the correct behavior, then it's not intuitive, therefore, untrustworthy. The more consistent and intuitive the results, the more the programmer will trust the compiler with decision making.

I Can See Your Email from My House in Alaska

I should have probably blogged about this earlier, but this has been a really busy week for me on my project. Shortly after my latest post on passwords, I had several talks with associates. Quite a few of them were ribbing about how I'm too security conscious. As if there is such a thing as too much security.

Then, right on que as if to prove my point, Sarah Palin's email account got "hacked". I say that in the loosest, pop-culture sense of the word, because it really isn't a hack. A hack is someone disassembling a backbone router's operating code and discovering a buffer overflow flaw that they can use to inject code that will take over the router. In this case, someone simply stole Palin's email credentials in a way that could have been accomplished by a savvy sixth grader. And that's the whole point - hacking into a computer to steal info is really, really hard provided the system is well-locked-down. That's why today's ID and data thieves don't hack code - they hack people. Social engineering is far easier to do. So much easier that even non-techie ID and data thieves can do it. That's what makes social engineering attacks so dangerous. Instead of keeping an eye on a handful of alpha nerds with NPD, you have to watch out for thousands and thousands of con artists.

In case you missed the affair in question, someone got into Palin's Yahoo! email account. Did they use some kind of crypto-defeating stealth code, or a supercomputer bot network to crack the password? Nope. They simply guessed the answer to her "forgot my password" question. And, they did it so easily because the answer to the question was available to the public! If the alleged accounts are to be believed, it was the location where she met her husband. Anyone with an internet search engine and connectivity could have gotten it.

So the point is that your password is a means of authentication. And so is the answer to the magic reset question! But if you're going to provide a backup or replacement authentication mechanism, it needs to be at *least* as secure as the primary one! In this case, your password is only as strong as the answer to the question. You wouldn't publicly post your password, so why publicily post the answer to the question?! Or perhaps more appropriately, why select a question-answer that is available in public?

I guess it all boils down to my assertion that the question-answer password reset mechanism is inherently bad. Good password mechanisms force people to select good passwords, but question-answer mechanisms seem to cater to a person's natural tendency to be lazy and choose the path of least resistance... which is something that most ID and data theives would love for you to. They are really only in business because they can depend on so many people behaving that way.

In short, shame on Sarah Palin for using a public email system for government business. Shame on Yahoo! for using this absurd mechanism for password resetting - or, on the other hand, if they feel they *have* to use such a system, then shame on them for allowing this sort of question. I would say shame on the idiot who broke into her account, but let's face it, people who do this on a regular basis really don't care.

Now as it turns out, there is a real possibility this person isn't a professional con or ID/data thief, but just someone who wanted to take a sneak peak at her emails to see if there was anything politically hot in it. That's not to excuse them in the slightest - they committed a *criminal* act. But the fact that an ordinary person who may not be a pro at breaking into systems can do this, should be a frightening enough reminder to take what I said previously seriously. Don't add yourself or your app to my wall of shame.

Prognostication.NET

"C#, .NET, Visual Basic, FoxPro, COM."

"What are Microsoft technologies that should have died by now?"

It's really interesting how the development community, in light of being a community of engineers, is so enthralled and passionate about subjective analysis and fortune-telling. I recently ran into a few online conversations with some long-time friends that lead me to think back to around 2001 to 2003 - a time rife with predictions about Microsoft technologies - and how most of those predictions never quite panned out.

After the J++ lawsuit debacle, Microsoft decided to take Sun head-on with a brand new (yet another) C-derived language: C#, and its accompanying runtime and base class library. Right from the get-go, Microsoft had a hard time making people take the language seriously, even having to spend an inordinate amount of time explaining how the correct pronunciation was "see-sharp". Well, after 8 years, C# is alive and kicking. And so is the .NET stack. Few Microsoft shops would rather develop solutions in anything else these days. Interestingly enough, while C# borrowed much from other environments (including Java - and not forgetting that Java itself is heavily derivative), Java has in the past few years adopted many things from the C# language, runtime, BCL and IDE.

VB has had a standing obituary for over a decade (which never quite proved to be reliable), but never a more pronounced one than during the first .NET release. People who always hated VB and "classic" VB aficionados alike derided the new "abomination" called VB.NET and perfunctorily declared Visual Basic dead. For some who were passionate about classic VB (and even classic BASIC, if ever there was such a standard), it was not merely that the VB flavor they knew and loved was dead and replaced by this thing so-called Visual Fred, VB.NOT, B#, or C-flat, but also that the new critter in VB clothing would be dead before long as well. Nothing could be farther from the truth. VB in its current incarnation is still the leading Microsoft development language (in terms of market share), beating both C++ and C#, and C# by a landslide at that.

And while Microsoft, in a series of (monumental) marketing and technical blunders and omissions, has shut VB out of certain niches, VB's market share still rivals that of Java according to Forrester, much more so than C#.

Of course, the opposite was also stated - that classic VB would die before long. Perhaps the feeding tube and respirator have been removed, but VB6 has shown a stubborn resilience and unwillingness to fade into the sunset. To me, that's hardly a surprise and only validates what I've said all along - that any development technology that is sufficiently adopted simply can't disappear in a short period of time. Chances are that one of your banks/credit-institutions/card companies/utility companies is still running a 15+ year old COBOL application.

FoxPro. OK, Fox is gasping its last breath if you can't exactly call it dead. It's not really feasible to see that any other way. Unlike VB, FoxPro won't get a reincarnation. However, it does leave behind a legacy. It seeded many of the ideas behind some of the SQL Server tools as well as being an inspirational source for LINQ.

For some who were absolutely giddy about .NET, there was the prediction that all other non-.NET development for the Microsoft platform would fade away - and most specifically, with regards to COM. Look, COM is most certainly a pain in the backside, but I always said it was far too ingrained near the heart of Windows to ever be removed. Like Iron Man, the best you can do is isolate yourself from the shrapnel with a shiny piece of technology. And MS isn't abstaining from developing new things with COM either. For 2005, the entire SSIS pipeline was built specifically for... you guessed it, COM.

Having said all that, there are some things that have indeed bought the farm. Most of them were first stabs that proved somewhat inadequate. The biggest example I can find is Remoting. Not that there was too much wrong with the technology, but it simply wasn't as encompassing or as flexible as its successor, WCF. Of course, Remoting is more like undead than dead (it's still fully supported in the framework, but won't get any fixes or additions), still limping around in (now) legacy applications until someone gets the time to migrate it over to WCF. And WCF is a bit overwhelming for the average crowd, with all its buttons and levers. Of course, if you can complain about WCF's complexity, Remoting is probably not your thing either.

A number of binding technologies seem to be destined for the same fate as well. I don't know if it's just me, but binding is always in a state of flux and nobody ever seems to *finally* get it right once and for all - a state of affairs that I'm sure leaves the anti-binding group in a standing ovation.

And moving forward, I will throw my own prediction into the hat. It's hard to see that LINQ to SQL will survive Entity Framework. It would be a ridiculous situation to have LINQ technology, and yet have no way to query into a database using it. LINQ to SQL seemed to be a ready candidate to fill that hole for the initial release. But it's an extremely poor OR/M, if I'm even allowed to call it that, and is rife with limitations that make it impractical for a lot of scenarios. EF, on the other hand, just isn't finished. It's an ambitious vision to be sure, but once it's done, you will have a much more flexible system that can connect to multiple data sources (not just SQL Server), easily plug into custom entities, removes the lame 1-to-1 mapping between tables and entities, and all while keeping with a simple set of similar designers. In other words, it would effectively do everything LINQ to SQL does and much more. It's also being worked into the guts of ADO.NET Data Services as well.

Handling Passwords

There are two reasons I’m writing this post. First, I’ve noticed a slew of articles and blog entries lately about the topic. Now, that’s good from the perspective that it’s an indication of people taking the topic seriously, and also helps to get the word out. Second, I’ve noticed that the authors often have incomplete and/or somewhat inaccurate information, which I’m sure they got from reading someone else’s incomplete and/or inaccurate material. That doesn’t necessarily make it bad, but they write the material with an authoritative tone – because, you know, blog authors are all leading experts in their fields, including fields they decide to wander into on any given day for the sake of a post

For those of you who missed the sarcasm: don’t believe everything you read on the internet, no matter who it comes from. Sounds like common sense, but that includes your favorite bloggers too, no matter how sharp you think they are. And on that note, I’ll give my standard security warning – although I’m pretty handy with math, I do not consider myself a true cryptanalyst. I am a hobbyist, enthusiast, and avid studier of cryptography. While some have considered that would make me something of an expert, I merely study the work of great cryptanalysts and I would prefer to be seen as just a guy who’s about to show you some of the things you need to know. So I encourage you to use this blog post the way you should use any blog post – as a stepping stone to research further, rather than the end-all and final word on the subject. The key being that when it comes to security, most developers just don’t know what they don’t know. Finally, the world of cryptology is steeped in detailed jargon and math formulas. This level of detail is absolutely necessary because in security, the devil is in the tiniest detail (and there are many). However, that level of detail also puts off about 90% of software developers. But since they are the ones who are entrusted with securing software, I will try to keep things at a fairly understandable level. Also, there are a few recent news items related to this topic that just begged to blogged about. So let’s dive right in.

The Importance of Passwords

Authentication is a critical operation for a vast amount of software. Systems need to know who any given user is. All of the user’s access to any given system, what they can and can’t do or see, is determined ultimately by who the system thinks they are. Password authentication works on the principle that a user knows a password that nobody else knows. The software has a record that lets it determine if a given user enters the right password. If they do, the software can be reasonably sure the user is who they claim to be. But in order for that to be effective, you need two things:
1) The password MUST be something that only the user knows. That means that a user must never divulge the password to ANYONE, including operators of the software.
2) It must be physically infeasible for someone else to get a hold of, or guess the password.
In order to get #2 right, the user is responsible for using a password that others can’t feasibly guess. This relates to password strength, which I’ll get to in a minute. It also means that the software must take precautions to forbid anyone but the user from seeing or otherwise getting a hold of the password. This means you should never, EVER, store a password in plain text. Ever. It doesn’t matter who you think has access to the data or not, stuff happens, and before you know it, your entire customer base’s data is in the hands of a malicious data thief. This has happened to MANY large corporations (for example, Citibank) despite their security measures. Besides being bad for business (most states are starting to require companies to publicly divulge when this sort of data compromise happens), some states now have laws that hold you (the data keeper) accountable and liable for data loss and compromise. That means users may be able to sue you if you didn’t do enough to protect their sensitive data.

So why not use something else for authentication? For example, we know DNA, digital fingerprinting, possibly retinal scanning, and other biometrics are unique to a person, and not easily compromised or duplicated. Well, if you lose your finger or eye, it’s gone. If your DNA is ever compromised, you can’t really change it the way you would change your password. It’s compromised forever and you’d be stuck with the option of never doing computerized banking or shopping ever again (which I’m sure would be the least of your worries at that point). So for now, passwords are the “in” thing.

Password Strength

Because this system relies on the basis that only one person knows their password, you, as a user, must be responsible for creating passwords others can’t guess. With only 48 8-bit characters, the possible combinations of letters, numbers, and symbols (before you even touch Unicode) are unfathomable. But there are only a few hundred thousand actual words in any given language, so don’t use real words as passwords. Computers can run through an entire dictionary in less than a second, so breaking such passwords with what’s called “brute force” (checking all possibilities) is extremely easy for a computer. Furthermore, if you use real words, a computer doesn’t have to check all possibilities under some situations (more on that later). But on a computer that can check 1 Billion combinations per second, it would still take tens if not hundreds of trillions of years to go through 48 characters of possible combinations (assuming you could use up all 8 bits in each character). Also remember that as far as brute force attacks are concerned, security improves exponentially (not linearly) with password length. For example, doubling the size of a password doesn’t double the number of combinations, it squares them.

So as far as security goes, you MUST have sufficiently long passwords. I’ve seen systems that require a limited number of characters (around 6 or 8 or so) and this is just not good enough. Anything less than 16 is probably asking for trouble – because on top of the number of bit combinations, people typically only use a limited set of values for each 8-bits (character, assuming non-Unicode). If you only use real words, then a potential data thief has an extremely narrow set of combinations to work with, so it’s important to really mix it up with as many combinations of numbers, letters, and symbols as possible. Also important for both users and software developers is the concept that passwords and validation algorithms must be case-sensitive. In English, with a 26 letter alphabet, you have 52 possible alphabetic characters (upper and lower case), but if your algorithm is case insensitive, you’ve thrown out 26 possibilities for each 8-bit chunk... which drastically reduces the overall effectiveness of the string when considering combinations. Since chances are that you are already working with a set of characters that doesn’t use up all 8 bits to begin with, then you really cripple the system by removing uniqueness based on character case.

This of course, must be balanced with memory (human, not RAM). A password is absolutely no good if the user can’t remember it. So it must be easy enough to remember, and difficult enough to foil guessing. Arbitrary strings of characters usually end up being one or the other, but not both. A lot of passwords are stored in a database, and that becomes one of the most popular reasons for limiting the size. However, there are some techniques which do not limit the size of the password – effectively making them as long as you want (even if they are stored in a database). For such systems, a good alternative is to have a pass phrase. Just like passwords, it’s important to mix things up with letters, numbers, and symbols, but you can use an entire sentence of text (although, you’d probably want to avoid using too many “real” words). It’s easier to remember a phrase of large text than it is to remember an arbitrary albeit smaller string of characters. Unfortunately, few systems allow you to enter as much password text as you want, even if they are capable of storing much larger phrases. However, it is definitely something to consider if you are about to implement a new system. As far as resilience against brute force attacks go, the longer the better.

Storing Passwords

If you are thinking of storing passwords in plain text, just hang up your hat and go home. Leave the system unfinished. It’s simply not worth the future trouble and liability you are subjecting yourself to.  That leaves you the option of somehow scrambling the password so that even if the data is stolen, people won’t be able to discover anyone’s password. At this point, I really need to re-iterate rule #1 of encryption: Do not EVER invent your own data protection algorithm. I know you think you are smart, have an advanced degree, and have been programming since [insert your favorite archaic system here]. But unless you are a cryptanalyst [expert], you will regret it. Data thieves don’t care about your pedigree, and as smart as you think you are, the odds are definitely in their favor, not yours. In fact, crypto-systems’ strength isn’t necessarily measured by the intelligence of the person who came up with the algorithm. Many extremely smart people (I’ll wager smarter than both you and I put together) have created systems that ultimately failed at the hands of some malicious hacker.  Secure systems have withstood many years of attacks and analysis. That’s what you should go with – a standard that has a proven track record with no vulnerabilities found.

Now is a good time to point out there are fundamentally two types of data protection primitives in use: encryption and cryptographic hash codes. The main difference between the two (although many details differ) is that encryption entails an algorithm that can both encrypt and decrypt text – that is to say, it can encrypt a chunk of plain text, and subsequently, it can decrypt the scrambled data back into plain text. On the other hand, cryptographic hash codes are one-way. You can’t “decode” a hash result back into the original plain text.

There are many encryption schemes and algorithms out there, but for the sake of this conversation, let’s just say that all the useful ones require keys. These keys must be securely stored. If an attacker gets hold of the keys, they can decode your entire database of passwords, and at that point, the passwords might as well have been in plain text to begin with. Effective “key management” is very difficult, and I’ll postulate that you shouldn’t have to return scrambled passwords back to plain text anyway (more on that later).

Cryptographic Hashes

Cryptographic hash algorithms have a few important properties:

  1. They take an arbitrary amount of text and turn it into a (usually smaller) fixed-length code. Among other things, this makes cryptographic hash algorithms ideal for pass phrases. No matter how long the phrase is, the resulting hash code is a relatively small and fixed size (let’s say 128 or 512 bits or 32 to 64 bytes).
  2. Running the same piece of plain text through a cryptographic hash algorithm will always result in the same hash code.
  3. They should be as collision resistant as possible. That means that two different strings of plain text shouldn’t result in the same hash code. This is very important and I’ll discuss it later as well.
  4. The algorithm shouldn’t be reversible. If all you have is the resulting hash code, it should be mathematically infeasible to run the algorithm backwards (or run some variation of the algorithm or any algorithm at all for that matter) that changes the resulting hash code back into the original plain text.
  5. Changing the plain text input, even a tiny little bit, will cause significant changes in the resulting hash code. You shouldn’t be able to approximate the original text by using similar text, as they should result in wildly different codes. There should also be no resulting patterns where the occurrence of a sequence in plain text can be deduced from a sequence in the hash code.
From these properties, we can concoct a Hash-based password scheme. When a user creates a password, the password is run through a hash algorithm. The system stores the resulting hash code. When the user re-enters their password for authentication, the system runs the newly-entered password through the same hash algorithm, and then checks the results with the stored hash code. If both match, then we know we have the original password (due to properties 2 and 3 above). Note that we never stored the actual password, and because of property 4, we shouldn’t be able to algorithmically derive the original password from the stored code.

 

Wow! That was easy. Well, not so fast, cowboy.  We still have a number of vulnerabilities to account for. In order to secure a system, you have to know how people will attack it. You can buy and install the thickest titanium steel door with the most complicated tamper-proof lock and it will do absolutely no good if someone can just go around to the back and open a window to get in. That is what cyber attackers live for, and they are much better at this than most of us. If you just took the preceding section into account, you would be in for a world of hurt, so it would be a good idea to see just how attackers try to get passed the system.

When Dictionaries aren’t Your Friend

The number one reason to avoid real words is that a brute force attack normally has to consider every possible combination. But if you use normal words (and most people apparently do), the attacker can just look at a much smaller set – about 500,000 or so possible words in the English language, for example.

If you just relied on the scheme I detailed above, then the attacker doesn’t even have to resort to brute force. They can use the so-called Dictionary or Rainbow Table attacks. Basically, someone sets up a dictionary of hash codes for every possible word (more complex dictionaries have combinations of text, numbers, etc.). That way, if an attacker gets the stored hash code, their software just looks up the hash code in the dictionary or rainbow table, and viola, they have the corresponding plain text (and it took considerably less time than 13 trillion years).

“So what,” I hear some of you say, “nobody has access to the password storage.” Think again. This is exactly the sort of attack that set entire UNIX networks on fire decades ago. The older Microsoft LanMan scheme suffers from the same problem. Many banks, online shopping companies, and even government agencies have had their data stolen in the past couple of years alone. It can happen to you. In fact, I’ve worked for two companies that hired me after having compromised data.

Salt and Speed Kills Rainbows

In order to foil these attacks, you have to produce hash codes such that pre-generation is infeasible. The first step is what’s called Salt or Nonce. This means that you append a piece of random data to the password before hashing it. This helps to stop even common words from being susceptible to dictionary/rainbow table attacks, but it’s still a good idea to avoid common words since attackers can just use them in brute force guesses. The salt or nonce doesn’t have to be very long. Due to hash algorithm property #5 above, even a small change to the input drastically changes the resulting code. However, the salt or nonce should be of a significant enough size. Let me put it this way – if your salt was 1 bit (0 or 1), then attacker would have to double the size of his dictionary or rainbow table (to account for all the possibilities with either a 0 or 1 added to them). But that’s not bad in terms of the size of the data the attacker needs. However, each additional bit of salt or nonce exponentially increases the combinations (just like increasing the size of the password itself – same concept). At a certain point, it becomes totally infeasible to pre-generate a table of possible values.

In fact, most commonly available (albeit older) rainbow tables have up to 8 characters, so generally-speaking, your passwords should always be more than 8 characters (although I seriously recommend at least 16), and you’ll weed out most of the wannabe attackers out there.

More importantly, each hash code should have its own UNIQUE and COMPLETELY RANDOM salt value. Do not compute the salt based on the input or use the same salt for every password. This makes the salt value predictable and helps the attacker narrow the amount of data they need for a dictionary or rainbow table. The introduction of unique and random salt or nonce also means that two people using the same passwords will have wildly differing hash codes stored in the database, which is definitely a good thing. Ideally, the salt or nonce should be created using a cryptographic random number generator, which is vastly more unpredictable than a standard time/seed-based generator.

Of course, in order to use the password + salt, you need to validate using the same salt value later (when the user enters their password for authentication). Fortunately, the salt doesn’t have to be a secret. You can just store it with the password hash code.

If you think salt isn’t enough, there are other strategies you can use in addition of salt or nonce. One of them, which I’m pretty fond of, is iterations. This means repeatedly hashing the password + salt. Basically, you take the initial password plain text, add the salt, and hash it. Then, you take the password again, add the salt and resulting hash code from the previous round, and hash that. Then repeat a few thousand times (or more). If salting helps to deter lookup attacks, iterations make it almost impossible with current technology. Some schemes like MD5-Crypt use 1000 iterations, but I prefer 10,000 or more. Even that number is relatively fast (less than a second), and you only use it when someone creates a password or attempts a login. This also makes input brute force attacks that much slower.

Birthdays Ruin Security

Here’s a neat little bar game that people have been playing for a long time: Walk into any bar (or office, or anywhere that has more than 20 people) and bet that at least two people there have a birthday on the same day. Some of you are probably thinking this is a really fast way to part with your hard-earned cash, and wondering what kind of dork would attempt that. But in a group of 23 people, there is a slightly more than 50% chance of that happening! If the group is 57 or more, then the chance is more than 99%. For the math behind this little scam, Wikipedia “Birthday Problem” (it’s a fun little exercise, but this post is getting long as it is).

The principle behind this is called collision counting. In this case, it’s how many birthdays collide (on the same day) given the number of combinations. For cryptographic hashes, the same principle applies, but on a much, much larger scale (after all, there are only 365 days in a year for birthdays). Hash algorithm property #3 from above states that they must be as collision resistant as possible, but mathematically-speaking, such an algorithm would almost be susceptible to collisions by definition. A collision here means that two different pieces of plain text input result in the same hash code. A good hash algorithm makes this virtually impossible for most input you would care about, but physically it might actually still happen, and the math behind it all says that it probably will. Clever attackers can try to use the math behind the birthday collision problem to narrow down a brute force (a so-called birthday attack). In that case, they don’t have to find the one piece of input that results in the code – they can attempt to find one of two or more possible choices that will also succeed.

There is very little you can do here except to avoid hash algorithms that have a mathematical flaw which allows for the possibility of a feasible collision. Even collisions at the block level reveal weaknesses that can be exploited. Practically speaking, this means that you probably want to avoid MD5 and SHA1 and anything that came before them. In addition to the potential collision problems, the resulting code is also relatively small. A larger hash code usually helps in many ways, and I would recommend something that’s at least 256 if not 512 bits long (the SHA suite has a few variations that result in that size).

More the Merrier?

So if one hash algorithm is good, two should be awesome, right? Not necessarily. Combing two hash algorithms don’t make the result twice as strong. If you hash a password with MD5 and then again with SHA1, and concatenated the results, there are still inherent weaknesses. Antoine Joux postulated the concept of multicollision, and how concatenating hash codes doesn’t really make them stronger than their component parts. Google “Joux Multicollision” if you want to see all the gory mathematical details.

What about hashing the password with one algorithm and then feeding that into the other algorithm? Still no dice. You’d be at the mercy of the strength of the first algorithm. If the first algorithm was susceptible to and experienced a collision, then it would simply feed the same output to the second one – which almost makes the second one only as secure as the first one.

Bottom line is that it’s best to stick with salt or nonce + iterations, and pick a nice, secure, and proven hash algorithm. And if you are using a premade authentication system, then use these tips as criteria to evaluate the package.

Keyed Hashes

There’s a class of hash algorithms that depend on a key in addition to a hash algorithm (HMAC). Many block ciphers (a type of encryption algorithm) can also be used in a way that effectively makes them a hash algorithm, and they would also use a key as well. But again, the problem is that you have to deal with effective key management. There simply isn’t enough space on this post to deal with that topic. But definitely research it if you are curious.

Forgive and Forget is Divine, but not Secure

So what happens when some user (as is inevitable) forgets their password? A lot of web systems have an “I forgot my password” link, which does a number of different things depending on the site. The worst offenders email your password back to you. This is the most horrible idea ever. You might as well spam the user’s password all over the net. First of all, that email is riding around the net in plain text. Anyone sniffing traffic can intercept it. Secondly, almost every router/switch and server along the path now has a logged transcript of that email.

Unfortunately, forgetting a password is something that, when handled properly, is going to be an inconvenience to the user. But you know what? That’s OK. A small inconvenience is a small price to pay for forgetting your password in comparison to losing your bank account or an account that has a stored credit card attached to it (which would be a whole different security post...).

Part of the reason systems emailed passwords back to the user was that they figured if you are someone malicious attempting to snag the account, the password would only go back to the email of the person who originally registered the account. But as you can see, that really isn’t true. The password just gets smeared across the internet. In a recent breaking story, it turns out that it’s much easier than previously thought to redirect internet traffic due to a protocol security flaw, and you wouldn’t even know it. So don’t think that it’s unlikely someone would be able to see that email.

Who’s your daddy?

Another option is to reset or to allow the user to change the password when they forget. Of course, since the person attempting this operation is anonymous (after all, they’d be logged in if they could remember the password), you have to be careful so you don’t just let anyone reset or change anyone else’s password. You still have to be reasonably sure the person making the request is in fact who they claim to be. A lot of sites these days force a user to record an answer only they know, to some kind of a “security” question. Make no mistake; this is just another method of authentication, and one that is loaded with weaknesses. Since answering the question grants you access to the system – either immediately, or by allowing you to change the password – it is just another password in and of itself. However, this “password” (the answer) is usually small, a normal word, and directly linked to a single question. It is infinitely easier to brute-force! Remember that titanium-steel door? Well, you just opened the window and hung a “welcome” sign on it. You don’t have to be a hacker to exploit this system weakness. A recent divorcee discovered that she kept failing to log into her email. She kept typing in what appeared to be the wrong passwords, and kept having to reset them. It turns out that her soon-to-be-legal-ex kept changing the password on her by using the “security” question, and had stolen all her private communications with her divorce lawyer. The ex did know quite a bit about her life - after all, they were married for many years, so chances were definitely in his favor that he’d know many of the common security question answers she would use. Savvy users will type in answers that make little sense to anyone else, but the questions really lead the average person to type in the worst and weakest kind of entry keyword possible. It’s a horrible system to get people accustomed to.

You are the Weakest Link. Goodbye.

And the sad part is that you don’t even have to be foiled by a close relative or spouse these days. Just about anyone can get a hold of everything they need to get into your account via the security questions. Why? Because YOU are GIVING it away to everyone!

In almost any system, it’s often not the data or the encryption or any code that forms the weakest security link. It’s the people. Data thieves often resort to social engineering attacks rather than attacking the software itself. You are a lot easier to crack than most software if you aren’t careful. That is especially true today in the horribly insecure Web2.0 world.
See, social networking sites are like crack. People check MySpace and Facebook more often than they check on their children. People collect friends like kids collect Pokemon cards. Unfortunately, in the quest to acquire more friends than Jesus and be the most popular kid on the net, you happily publish every last little thought in your skull and every last little detail about your life, how you’re feeling, and what you are drinking or listening to at the moment, knowing that everyone in your friends’ list is just lapping up every inane scrap of brain vomit throughout the day and thinking about just how cool you are and how lucky they are to know you. Of course, they aren’t, but it makes you feel good to think so. But those data thieves are absolutely riveted! Hey, you have fans after all!

That Yorkie fanatic next door, the one with three of the yappiest four-legged mop-heads on the planet – what’s her security question? “What is your pet’s name?” And what’s her blog loaded with? Yep, you got it. Wonder how much money she has in her account today. And the funny part is that the thieves don’t even have to pretend to be Nigerians with a few extra million coincidently sitting in a dusty ownerless vault waiting for you to act now and help them transfer it to your account.

Sadly, it’s not just the average Joe who gets suckered. The crack is so potent that it happens to the experts. In Defcon (a hacker’s convention) at Vegas this year, some of the attendees were friended by (people pretending to be) a number of big-shot prestigious hackers – described as the kind who once got arresting for speaking about some government software vulnerability, who lurk in the depths of secret security mail lists and forums, known to the world only by their net nicknames, who write crypto-defeating math equations that make your brain bleed and the NSA wet their pants. The attendees were more than giddy to accept. And then the fun began. You can read all about these social engineering attacks on social networking site shenanigans on the Defcon official site.
As far as secret info you use for authentication, it’s best to heed rule #1 of information security – Trust No One.

Wolf in eBay Clothing

The other major social engineering attack on passwords is phishing. This is where some data thief sends you an email that looks like it came from eBay (or a bank or anywhere that has an account you really don’t want to lose), usually telling you that your account is suspended, and directs you to a link where you are supposed to log in and verify that you haven’t done something horribly wrong. But instead, the link takes you to some Malaysian website (which is probably also compromised) where a fake login page, that might or might not actually contain the word eBay, sends your password to a script that feeds it back to the data thief. Of course, if your account was closed or frozen, how are you supposed to log in? Oh well, minor details. But phishing is effective, especially among people who have internet access but aren’t very net savvy (like a presidential candidate who shall remain nameless or granny who just uses it to email the family pictures around). It’s effective because they trust that eBay wouldn’t lie to them, and that the page they are going to is in fact owned by eBay.

Server Mug Shots

The problem is that up to now, apps (especially web apps) mostly handle a one-way authentication. Under more secure environments, you have mutual (two-way) authentication. Not only does the software have to know who you are beyond doubt, the software has to identify itself beyond doubt to you.

Up to now, people simply trusted in the URL or the browser’s interpretation of the server’s SSL cert validity. But a lot of people don’t even look at that. It’s simply not obvious to non-savvy users. And even if you think you can tell that the URL is really going to eBay.com rather than www.istealidentities.com/ebay.com, that doesn’t really mean someone isn’t re-routing your traffic. This would seem to be an elaborate trick to pull off, but not that difficult to people who know how to exploit routing protocols flaws, or if there are compromised servers/routers/switches at an ISP.

One of the schemes becoming popular now is image authentication. When a user creates an account, they select a picture from a large gallery of simple every-day items. Only the software knows which image they selected. When they go back for a login, they type their username, but not password. The server then identifies itself by showing the user the image they previously selected. Now the user can type in their password, knowing (reasonably so) that the software isn’t some phishing site. Phishing sites typically aren’t very complicated or dynamic enough to even show a different image, but they also shouldn’t know what picture the user selected. Of course, the number of pictures directly relates to the strength of that authentication, and perhaps it would be better to use phrases, although I suppose the argument is that pictures are easier to remember. The problem is that most systems have no really visible server/software authentication to speak of, and people simply aren’t used to it (yet).

Don’t Come a-Knockin’

The basic point of all these bits of trivia and attack modes is to ensure that you get a system that is really up to par on password security. The system needs to strive to eliminate the feasibility of any attack mode, forcing only brute force guessing of the password, because under the right circumstances, attempting all possible combinations could take longer than the earth has been in existence. But when it comes to brute forcing a password, you can also set up another barrier – lockouts and notifications. It’s common for people to get a password wrong on any given attempt. They might forget whether a character is “3” or “4”, or maybe the Caps Lock is on. But if someone tries to guess incorrectly 500 times in a few seconds, it’s probably an attack. To really stall brute force attacks, you need a lockout. This means that after some small number of failures (let’s say 4 or 5), the system stops authenticating that user period. You can re-enable authentication for the user after a few minutes (5 to 15, for example). It doesn’t necessarily have to be a permanent lockout. This foils brute force by making it completely infeasible to run through large numbers of guesses. If it takes trillions of years to brute force all possible combinations for a given length of password when a server can run through a billion combinations a second, imagine if it is now forced to deal with 4 or 5 every 15 minutes. Also, it doesn’t hurt to notify the user that something might be up with the account. Maybe they have a disgruntled ex-employee trying to get into a system using a co-worker’s or admin’s account. It also pays to keep a log of such attempts and have notification so your I.T. team can investigate, and perhaps law enforcement as well, if it comes to that.

Acknowledgements-
1 UNIX is a trade mark of The Open Group
2 LanMan and Microsoft are registered to the Microsoft Corporation
3 MySpace is a property of NewsCorp
4 Facebook hasn’t sold out (yet)
5 Pokemon is a property of Nintendo, 4Kids Entertainment, and Pokemon USA
6 The Weakest Link is a property of The BBC
7 eBay still owns itself
8 Nigeria still boggles me. They have billions in funds that nobody wants, yet people are starving and they want to scam you for the lousy 2 grand in your checking account. And apparently they have no idea who William Shatner is either. Same goes for their Hong Kong affiliates.
9 Malaysia doesn’t boggle me as much. I have three friends from there. I just think the .MY top level domain is cool.

LINQ with DataReaders

Over the past couple of days, one of the MVP lists has had some activity about LINQ (which is pretty common these days), and one of the threads meandered into using LINQ on DataReaders. The problem is that LINQ works on IEnumerable sources, but DataReaders don't implement that. Fortunately, it's easy to do, and several people out there have come up with various solutions, including Jim Wooley in this article of his. The basic gyst is that you can you create an extension method on IDataReader that serves as an Enumerator.

However, Jim encountered two problems. He is returning the data one row at a time (which is the point) but as an array of object. So first, he's worried about performance. This is a valid concern considering the boxing/unboxing implications for all value type columns as well as the fact that a LINQ statement might try to grab those values several times (for sorting, as an example). Secondly, because all you have is an array of object that represents a row, you reference each column of the source as "row[0]" and "row[1]", etc. - basically, you don't get to use nice column names in the SQL.

The way to fix both these problems is to have an object that represents the row, complete with specific and type-safe fields (properties if you want to get technical and PC). Then, you need a way to convert each DataReader row to an instance of the row class. Expanding on Jim's example, here's what I came up with:

C#
public interface IReaderRow
{
    IDataReader Reader { get; set; }
    object GetRowData();
}

public class TimeZoneRowReader : IReaderRow
{
    public IDataReader Reader { get; set; }

    public object GetRowData()
    {
        return new TimeZoneRow() 
        {
            ID = Reader.GetByte(0),
            Name = Reader.GetString(1),
            Description = Reader.GetString(2)
        };
    }
}

public class TimeZoneRow
{
    public int ID { get; set; }
    public string Name { get; set; }
    public string Description { get; set; }
}

public static class DataReaderExtension
{
    public static IEnumerable<T> DataReaderEnumerator<T, TReader>(this IDataReader source) where TReader : IReaderRow, new()
    {
        if (source == null)
            throw new ArgumentNullException("source");

        IReaderRow rowReader = new TReader() { Reader = source };

        while (source.Read())
        {
            yield return (T)rowReader.GetRowData();
        }
    }
}

As you can see, the LINQ statement is now typed and has intuitive and descriptive names for the columns. The downside, of course, is that you need a class representing each type of possible result, and a way to translate the DataReader data to that row instance. You can combine the row class with the specific row reader class, but I simply chose to keep them separate. Strangely enough, this starts to feel like LINQ to SQL or LINQ to Entities at this point, but it's a way to solve the two other problems I mentioned earlier. In the end, the LINQ will now look something like this:

C#
listBox1.Items.Clear();

using (SqlConnection cn = new SqlConnection(connectionString))
{
    using (SqlCommand cmd = new SqlCommand("Select * from TimeZone"))
    {
        cmd.CommandType = CommandType.Text;
        cmd.Connection = cn;
        cn.Open();
        using (SqlDataReader dr = cmd.ExecuteReader())
        {
            var queryResults = from timeZone 
                in dr.DataReaderEnumerator<TimeZoneRow, TimeZoneRowReader>)
                where timeZone.ID > 0
                orderby timeZone.Name descending
                select timeZone;

            listBox1.DisplayMember = "Description";
            listBox1.DataSource = queryResults.ToList<TimeZoneRow>();
        }
    }
}

If you search around, you'll see quite a few other solutions (some of them in community projects) for using DataReaders as LINQ sources, although they all seem to have some of these aspects in common. The really interesting stuff is when you start to think along the lines of using combinations of IL Emit and anonymous classes to create all these row-related classes automatically (the schema is available on the data reader, after all). However, without getting too tricky, it's tough to find a way for C# or VB to interpret the results in design time so you can still maintain the intellisense and column names without resorting to something like code generation.

More P/Invoke Help

One of the first things I really hammered with .NET 1.0 back in late 2000 was P/Invoke. I was used to Java, but IMO, Java always had a fatal flaw - language design hubris. Java is perfect, and C is garbage, therefore, everything should be converted to Java, or so it seemed anyway. That philosophy trickled down to how Java used existing C code - JNI (at the time) wasn't a way for Java to call C code as much as it was a way for *YOU* to write a "clean" and extensive wrapper worthy of being called by Java so that Java wouldn't have to dirty its hands with the "bad" C code. At the time, I was extremely skeptical about .NET, but one thing I did like is that it readily admitted that the enitre world was not in .NET and there was plenty of "legacy" code written in C that was.. *gasp* ... usable and useful! Furthermore, it cooperated with said C code and even had a very simple (relatively speaking) way of calling it. But a lot of people weren't used to P/Invoke, although it was similar in concept to using Declare statements in VB (just more powerful). I was really taken by how comprehensive P/Invoke was, but people who weren't used to looking at C code had a hard time translating calls. I remember writing a little tool called P/I-Spy that would take C function prototypes and convert them to P/Invoke, or do the same work via a step-by-step wizard that allowed you to describe the call. The problem is that I wanted to create a comprehensive database of calls for the Win API, and I just didn't have the time, so I was happy when someone else took up the mantle and created http://pinvoke.net. At that point, I let the tool slide completely and have never touched it since.

However, Microsoft just launched a tool of their own. You can find it here: http://blogs.msdn.com/vbteam/archive/2008/03/14/making-pinvoke-easy.aspx, and it has some really cool features, among them, the ability to convert volume code.

«April»
SunMonTueWedThuFriSat
303112345
6789101112
13141516171819
20212223242526
27282930123
45678910