Prognostication.NET

"C#, .NET, Visual Basic, FoxPro, COM."

"What are Microsoft technologies that should have died by now?"

It's really interesting how the development community, in light of being a community of engineers, is so enthralled and passionate about subjective analysis and fortune-telling. I recently ran into a few online conversations with some long-time friends that lead me to think back to around 2001 to 2003 - a time rife with predictions about Microsoft technologies - and how most of those predictions never quite panned out.

After the J++ lawsuit debacle, Microsoft decided to take Sun head-on with a brand new (yet another) C-derived language: C#, and its accompanying runtime and base class library. Right from the get-go, Microsoft had a hard time making people take the language seriously, even having to spend an inordinate amount of time explaining how the correct pronunciation was "see-sharp". Well, after 8 years, C# is alive and kicking. And so is the .NET stack. Few Microsoft shops would rather develop solutions in anything else these days. Interestingly enough, while C# borrowed much from other environments (including Java - and not forgetting that Java itself is heavily derivative), Java has in the past few years adopted many things from the C# language, runtime, BCL and IDE.

VB has had a standing obituary for over a decade (which never quite proved to be reliable), but never a more pronounced one than during the first .NET release. People who always hated VB and "classic" VB aficionados alike derided the new "abomination" called VB.NET and perfunctorily declared Visual Basic dead. For some who were passionate about classic VB (and even classic BASIC, if ever there was such a standard), it was not merely that the VB flavor they knew and loved was dead and replaced by this thing so-called Visual Fred, VB.NOT, B#, or C-flat, but also that the new critter in VB clothing would be dead before long as well. Nothing could be farther from the truth. VB in its current incarnation is still the leading Microsoft development language (in terms of market share), beating both C++ and C#, and C# by a landslide at that.

And while Microsoft, in a series of (monumental) marketing and technical blunders and omissions, has shut VB out of certain niches, VB's market share still rivals that of Java according to Forrester, much more so than C#.

Of course, the opposite was also stated - that classic VB would die before long. Perhaps the feeding tube and respirator have been removed, but VB6 has shown a stubborn resilience and unwillingness to fade into the sunset. To me, that's hardly a surprise and only validates what I've said all along - that any development technology that is sufficiently adopted simply can't disappear in a short period of time. Chances are that one of your banks/credit-institutions/card companies/utility companies is still running a 15+ year old COBOL application.

FoxPro. OK, Fox is gasping its last breath if you can't exactly call it dead. It's not really feasible to see that any other way. Unlike VB, FoxPro won't get a reincarnation. However, it does leave behind a legacy. It seeded many of the ideas behind some of the SQL Server tools as well as being an inspirational source for LINQ.

For some who were absolutely giddy about .NET, there was the prediction that all other non-.NET development for the Microsoft platform would fade away - and most specifically, with regards to COM. Look, COM is most certainly a pain in the backside, but I always said it was far too ingrained near the heart of Windows to ever be removed. Like Iron Man, the best you can do is isolate yourself from the shrapnel with a shiny piece of technology. And MS isn't abstaining from developing new things with COM either. For 2005, the entire SSIS pipeline was built specifically for... you guessed it, COM.

Having said all that, there are some things that have indeed bought the farm. Most of them were first stabs that proved somewhat inadequate. The biggest example I can find is Remoting. Not that there was too much wrong with the technology, but it simply wasn't as encompassing or as flexible as its successor, WCF. Of course, Remoting is more like undead than dead (it's still fully supported in the framework, but won't get any fixes or additions), still limping around in (now) legacy applications until someone gets the time to migrate it over to WCF. And WCF is a bit overwhelming for the average crowd, with all its buttons and levers. Of course, if you can complain about WCF's complexity, Remoting is probably not your thing either.

A number of binding technologies seem to be destined for the same fate as well. I don't know if it's just me, but binding is always in a state of flux and nobody ever seems to *finally* get it right once and for all - a state of affairs that I'm sure leaves the anti-binding group in a standing ovation.

And moving forward, I will throw my own prediction into the hat. It's hard to see that LINQ to SQL will survive Entity Framework. It would be a ridiculous situation to have LINQ technology, and yet have no way to query into a database using it. LINQ to SQL seemed to be a ready candidate to fill that hole for the initial release. But it's an extremely poor OR/M, if I'm even allowed to call it that, and is rife with limitations that make it impractical for a lot of scenarios. EF, on the other hand, just isn't finished. It's an ambitious vision to be sure, but once it's done, you will have a much more flexible system that can connect to multiple data sources (not just SQL Server), easily plug into custom entities, removes the lame 1-to-1 mapping between tables and entities, and all while keeping with a simple set of similar designers. In other words, it would effectively do everything LINQ to SQL does and much more. It's also being worked into the guts of ADO.NET Data Services as well.

Handling Passwords

There are two reasons I’m writing this post. First, I’ve noticed a slew of articles and blog entries lately about the topic. Now, that’s good from the perspective that it’s an indication of people taking the topic seriously, and also helps to get the word out. Second, I’ve noticed that the authors often have incomplete and/or somewhat inaccurate information, which I’m sure they got from reading someone else’s incomplete and/or inaccurate material. That doesn’t necessarily make it bad, but they write the material with an authoritative tone – because, you know, blog authors are all leading experts in their fields, including fields they decide to wander into on any given day for the sake of a post

For those of you who missed the sarcasm: don’t believe everything you read on the internet, no matter who it comes from. Sounds like common sense, but that includes your favorite bloggers too, no matter how sharp you think they are. And on that note, I’ll give my standard security warning – although I’m pretty handy with math, I do not consider myself a true cryptanalyst. I am a hobbyist, enthusiast, and avid studier of cryptography. While some have considered that would make me something of an expert, I merely study the work of great cryptanalysts and I would prefer to be seen as just a guy who’s about to show you some of the things you need to know. So I encourage you to use this blog post the way you should use any blog post – as a stepping stone to research further, rather than the end-all and final word on the subject. The key being that when it comes to security, most developers just don’t know what they don’t know. Finally, the world of cryptology is steeped in detailed jargon and math formulas. This level of detail is absolutely necessary because in security, the devil is in the tiniest detail (and there are many). However, that level of detail also puts off about 90% of software developers. But since they are the ones who are entrusted with securing software, I will try to keep things at a fairly understandable level. Also, there are a few recent news items related to this topic that just begged to blogged about. So let’s dive right in.

The Importance of Passwords

Authentication is a critical operation for a vast amount of software. Systems need to know who any given user is. All of the user’s access to any given system, what they can and can’t do or see, is determined ultimately by who the system thinks they are. Password authentication works on the principle that a user knows a password that nobody else knows. The software has a record that lets it determine if a given user enters the right password. If they do, the software can be reasonably sure the user is who they claim to be. But in order for that to be effective, you need two things:
1) The password MUST be something that only the user knows. That means that a user must never divulge the password to ANYONE, including operators of the software.
2) It must be physically infeasible for someone else to get a hold of, or guess the password.
In order to get #2 right, the user is responsible for using a password that others can’t feasibly guess. This relates to password strength, which I’ll get to in a minute. It also means that the software must take precautions to forbid anyone but the user from seeing or otherwise getting a hold of the password. This means you should never, EVER, store a password in plain text. Ever. It doesn’t matter who you think has access to the data or not, stuff happens, and before you know it, your entire customer base’s data is in the hands of a malicious data thief. This has happened to MANY large corporations (for example, Citibank) despite their security measures. Besides being bad for business (most states are starting to require companies to publicly divulge when this sort of data compromise happens), some states now have laws that hold you (the data keeper) accountable and liable for data loss and compromise. That means users may be able to sue you if you didn’t do enough to protect their sensitive data.

So why not use something else for authentication? For example, we know DNA, digital fingerprinting, possibly retinal scanning, and other biometrics are unique to a person, and not easily compromised or duplicated. Well, if you lose your finger or eye, it’s gone. If your DNA is ever compromised, you can’t really change it the way you would change your password. It’s compromised forever and you’d be stuck with the option of never doing computerized banking or shopping ever again (which I’m sure would be the least of your worries at that point). So for now, passwords are the “in” thing.

Password Strength

Because this system relies on the basis that only one person knows their password, you, as a user, must be responsible for creating passwords others can’t guess. With only 48 8-bit characters, the possible combinations of letters, numbers, and symbols (before you even touch Unicode) are unfathomable. But there are only a few hundred thousand actual words in any given language, so don’t use real words as passwords. Computers can run through an entire dictionary in less than a second, so breaking such passwords with what’s called “brute force” (checking all possibilities) is extremely easy for a computer. Furthermore, if you use real words, a computer doesn’t have to check all possibilities under some situations (more on that later). But on a computer that can check 1 Billion combinations per second, it would still take tens if not hundreds of trillions of years to go through 48 characters of possible combinations (assuming you could use up all 8 bits in each character). Also remember that as far as brute force attacks are concerned, security improves exponentially (not linearly) with password length. For example, doubling the size of a password doesn’t double the number of combinations, it squares them.

So as far as security goes, you MUST have sufficiently long passwords. I’ve seen systems that require a limited number of characters (around 6 or 8 or so) and this is just not good enough. Anything less than 16 is probably asking for trouble – because on top of the number of bit combinations, people typically only use a limited set of values for each 8-bits (character, assuming non-Unicode). If you only use real words, then a potential data thief has an extremely narrow set of combinations to work with, so it’s important to really mix it up with as many combinations of numbers, letters, and symbols as possible. Also important for both users and software developers is the concept that passwords and validation algorithms must be case-sensitive. In English, with a 26 letter alphabet, you have 52 possible alphabetic characters (upper and lower case), but if your algorithm is case insensitive, you’ve thrown out 26 possibilities for each 8-bit chunk... which drastically reduces the overall effectiveness of the string when considering combinations. Since chances are that you are already working with a set of characters that doesn’t use up all 8 bits to begin with, then you really cripple the system by removing uniqueness based on character case.

This of course, must be balanced with memory (human, not RAM). A password is absolutely no good if the user can’t remember it. So it must be easy enough to remember, and difficult enough to foil guessing. Arbitrary strings of characters usually end up being one or the other, but not both. A lot of passwords are stored in a database, and that becomes one of the most popular reasons for limiting the size. However, there are some techniques which do not limit the size of the password – effectively making them as long as you want (even if they are stored in a database). For such systems, a good alternative is to have a pass phrase. Just like passwords, it’s important to mix things up with letters, numbers, and symbols, but you can use an entire sentence of text (although, you’d probably want to avoid using too many “real” words). It’s easier to remember a phrase of large text than it is to remember an arbitrary albeit smaller string of characters. Unfortunately, few systems allow you to enter as much password text as you want, even if they are capable of storing much larger phrases. However, it is definitely something to consider if you are about to implement a new system. As far as resilience against brute force attacks go, the longer the better.

Storing Passwords

If you are thinking of storing passwords in plain text, just hang up your hat and go home. Leave the system unfinished. It’s simply not worth the future trouble and liability you are subjecting yourself to.  That leaves you the option of somehow scrambling the password so that even if the data is stolen, people won’t be able to discover anyone’s password. At this point, I really need to re-iterate rule #1 of encryption: Do not EVER invent your own data protection algorithm. I know you think you are smart, have an advanced degree, and have been programming since [insert your favorite archaic system here]. But unless you are a cryptanalyst [expert], you will regret it. Data thieves don’t care about your pedigree, and as smart as you think you are, the odds are definitely in their favor, not yours. In fact, crypto-systems’ strength isn’t necessarily measured by the intelligence of the person who came up with the algorithm. Many extremely smart people (I’ll wager smarter than both you and I put together) have created systems that ultimately failed at the hands of some malicious hacker.  Secure systems have withstood many years of attacks and analysis. That’s what you should go with – a standard that has a proven track record with no vulnerabilities found.

Now is a good time to point out there are fundamentally two types of data protection primitives in use: encryption and cryptographic hash codes. The main difference between the two (although many details differ) is that encryption entails an algorithm that can both encrypt and decrypt text – that is to say, it can encrypt a chunk of plain text, and subsequently, it can decrypt the scrambled data back into plain text. On the other hand, cryptographic hash codes are one-way. You can’t “decode” a hash result back into the original plain text.

There are many encryption schemes and algorithms out there, but for the sake of this conversation, let’s just say that all the useful ones require keys. These keys must be securely stored. If an attacker gets hold of the keys, they can decode your entire database of passwords, and at that point, the passwords might as well have been in plain text to begin with. Effective “key management” is very difficult, and I’ll postulate that you shouldn’t have to return scrambled passwords back to plain text anyway (more on that later).

Cryptographic Hashes

Cryptographic hash algorithms have a few important properties:

  1. They take an arbitrary amount of text and turn it into a (usually smaller) fixed-length code. Among other things, this makes cryptographic hash algorithms ideal for pass phrases. No matter how long the phrase is, the resulting hash code is a relatively small and fixed size (let’s say 128 or 512 bits or 32 to 64 bytes).
  2. Running the same piece of plain text through a cryptographic hash algorithm will always result in the same hash code.
  3. They should be as collision resistant as possible. That means that two different strings of plain text shouldn’t result in the same hash code. This is very important and I’ll discuss it later as well.
  4. The algorithm shouldn’t be reversible. If all you have is the resulting hash code, it should be mathematically infeasible to run the algorithm backwards (or run some variation of the algorithm or any algorithm at all for that matter) that changes the resulting hash code back into the original plain text.
  5. Changing the plain text input, even a tiny little bit, will cause significant changes in the resulting hash code. You shouldn’t be able to approximate the original text by using similar text, as they should result in wildly different codes. There should also be no resulting patterns where the occurrence of a sequence in plain text can be deduced from a sequence in the hash code.
From these properties, we can concoct a Hash-based password scheme. When a user creates a password, the password is run through a hash algorithm. The system stores the resulting hash code. When the user re-enters their password for authentication, the system runs the newly-entered password through the same hash algorithm, and then checks the results with the stored hash code. If both match, then we know we have the original password (due to properties 2 and 3 above). Note that we never stored the actual password, and because of property 4, we shouldn’t be able to algorithmically derive the original password from the stored code.

 

Wow! That was easy. Well, not so fast, cowboy.  We still have a number of vulnerabilities to account for. In order to secure a system, you have to know how people will attack it. You can buy and install the thickest titanium steel door with the most complicated tamper-proof lock and it will do absolutely no good if someone can just go around to the back and open a window to get in. That is what cyber attackers live for, and they are much better at this than most of us. If you just took the preceding section into account, you would be in for a world of hurt, so it would be a good idea to see just how attackers try to get passed the system.

When Dictionaries aren’t Your Friend

The number one reason to avoid real words is that a brute force attack normally has to consider every possible combination. But if you use normal words (and most people apparently do), the attacker can just look at a much smaller set – about 500,000 or so possible words in the English language, for example.

If you just relied on the scheme I detailed above, then the attacker doesn’t even have to resort to brute force. They can use the so-called Dictionary or Rainbow Table attacks. Basically, someone sets up a dictionary of hash codes for every possible word (more complex dictionaries have combinations of text, numbers, etc.). That way, if an attacker gets the stored hash code, their software just looks up the hash code in the dictionary or rainbow table, and viola, they have the corresponding plain text (and it took considerably less time than 13 trillion years).

“So what,” I hear some of you say, “nobody has access to the password storage.” Think again. This is exactly the sort of attack that set entire UNIX networks on fire decades ago. The older Microsoft LanMan scheme suffers from the same problem. Many banks, online shopping companies, and even government agencies have had their data stolen in the past couple of years alone. It can happen to you. In fact, I’ve worked for two companies that hired me after having compromised data.

Salt and Speed Kills Rainbows

In order to foil these attacks, you have to produce hash codes such that pre-generation is infeasible. The first step is what’s called Salt or Nonce. This means that you append a piece of random data to the password before hashing it. This helps to stop even common words from being susceptible to dictionary/rainbow table attacks, but it’s still a good idea to avoid common words since attackers can just use them in brute force guesses. The salt or nonce doesn’t have to be very long. Due to hash algorithm property #5 above, even a small change to the input drastically changes the resulting code. However, the salt or nonce should be of a significant enough size. Let me put it this way – if your salt was 1 bit (0 or 1), then attacker would have to double the size of his dictionary or rainbow table (to account for all the possibilities with either a 0 or 1 added to them). But that’s not bad in terms of the size of the data the attacker needs. However, each additional bit of salt or nonce exponentially increases the combinations (just like increasing the size of the password itself – same concept). At a certain point, it becomes totally infeasible to pre-generate a table of possible values.

In fact, most commonly available (albeit older) rainbow tables have up to 8 characters, so generally-speaking, your passwords should always be more than 8 characters (although I seriously recommend at least 16), and you’ll weed out most of the wannabe attackers out there.

More importantly, each hash code should have its own UNIQUE and COMPLETELY RANDOM salt value. Do not compute the salt based on the input or use the same salt for every password. This makes the salt value predictable and helps the attacker narrow the amount of data they need for a dictionary or rainbow table. The introduction of unique and random salt or nonce also means that two people using the same passwords will have wildly differing hash codes stored in the database, which is definitely a good thing. Ideally, the salt or nonce should be created using a cryptographic random number generator, which is vastly more unpredictable than a standard time/seed-based generator.

Of course, in order to use the password + salt, you need to validate using the same salt value later (when the user enters their password for authentication). Fortunately, the salt doesn’t have to be a secret. You can just store it with the password hash code.

If you think salt isn’t enough, there are other strategies you can use in addition of salt or nonce. One of them, which I’m pretty fond of, is iterations. This means repeatedly hashing the password + salt. Basically, you take the initial password plain text, add the salt, and hash it. Then, you take the password again, add the salt and resulting hash code from the previous round, and hash that. Then repeat a few thousand times (or more). If salting helps to deter lookup attacks, iterations make it almost impossible with current technology. Some schemes like MD5-Crypt use 1000 iterations, but I prefer 10,000 or more. Even that number is relatively fast (less than a second), and you only use it when someone creates a password or attempts a login. This also makes input brute force attacks that much slower.

Birthdays Ruin Security

Here’s a neat little bar game that people have been playing for a long time: Walk into any bar (or office, or anywhere that has more than 20 people) and bet that at least two people there have a birthday on the same day. Some of you are probably thinking this is a really fast way to part with your hard-earned cash, and wondering what kind of dork would attempt that. But in a group of 23 people, there is a slightly more than 50% chance of that happening! If the group is 57 or more, then the chance is more than 99%. For the math behind this little scam, Wikipedia “Birthday Problem” (it’s a fun little exercise, but this post is getting long as it is).

The principle behind this is called collision counting. In this case, it’s how many birthdays collide (on the same day) given the number of combinations. For cryptographic hashes, the same principle applies, but on a much, much larger scale (after all, there are only 365 days in a year for birthdays). Hash algorithm property #3 from above states that they must be as collision resistant as possible, but mathematically-speaking, such an algorithm would almost be susceptible to collisions by definition. A collision here means that two different pieces of plain text input result in the same hash code. A good hash algorithm makes this virtually impossible for most input you would care about, but physically it might actually still happen, and the math behind it all says that it probably will. Clever attackers can try to use the math behind the birthday collision problem to narrow down a brute force (a so-called birthday attack). In that case, they don’t have to find the one piece of input that results in the code – they can attempt to find one of two or more possible choices that will also succeed.

There is very little you can do here except to avoid hash algorithms that have a mathematical flaw which allows for the possibility of a feasible collision. Even collisions at the block level reveal weaknesses that can be exploited. Practically speaking, this means that you probably want to avoid MD5 and SHA1 and anything that came before them. In addition to the potential collision problems, the resulting code is also relatively small. A larger hash code usually helps in many ways, and I would recommend something that’s at least 256 if not 512 bits long (the SHA suite has a few variations that result in that size).

More the Merrier?

So if one hash algorithm is good, two should be awesome, right? Not necessarily. Combing two hash algorithms don’t make the result twice as strong. If you hash a password with MD5 and then again with SHA1, and concatenated the results, there are still inherent weaknesses. Antoine Joux postulated the concept of multicollision, and how concatenating hash codes doesn’t really make them stronger than their component parts. Google “Joux Multicollision” if you want to see all the gory mathematical details.

What about hashing the password with one algorithm and then feeding that into the other algorithm? Still no dice. You’d be at the mercy of the strength of the first algorithm. If the first algorithm was susceptible to and experienced a collision, then it would simply feed the same output to the second one – which almost makes the second one only as secure as the first one.

Bottom line is that it’s best to stick with salt or nonce + iterations, and pick a nice, secure, and proven hash algorithm. And if you are using a premade authentication system, then use these tips as criteria to evaluate the package.

Keyed Hashes

There’s a class of hash algorithms that depend on a key in addition to a hash algorithm (HMAC). Many block ciphers (a type of encryption algorithm) can also be used in a way that effectively makes them a hash algorithm, and they would also use a key as well. But again, the problem is that you have to deal with effective key management. There simply isn’t enough space on this post to deal with that topic. But definitely research it if you are curious.

Forgive and Forget is Divine, but not Secure

So what happens when some user (as is inevitable) forgets their password? A lot of web systems have an “I forgot my password” link, which does a number of different things depending on the site. The worst offenders email your password back to you. This is the most horrible idea ever. You might as well spam the user’s password all over the net. First of all, that email is riding around the net in plain text. Anyone sniffing traffic can intercept it. Secondly, almost every router/switch and server along the path now has a logged transcript of that email.

Unfortunately, forgetting a password is something that, when handled properly, is going to be an inconvenience to the user. But you know what? That’s OK. A small inconvenience is a small price to pay for forgetting your password in comparison to losing your bank account or an account that has a stored credit card attached to it (which would be a whole different security post...).

Part of the reason systems emailed passwords back to the user was that they figured if you are someone malicious attempting to snag the account, the password would only go back to the email of the person who originally registered the account. But as you can see, that really isn’t true. The password just gets smeared across the internet. In a recent breaking story, it turns out that it’s much easier than previously thought to redirect internet traffic due to a protocol security flaw, and you wouldn’t even know it. So don’t think that it’s unlikely someone would be able to see that email.

Who’s your daddy?

Another option is to reset or to allow the user to change the password when they forget. Of course, since the person attempting this operation is anonymous (after all, they’d be logged in if they could remember the password), you have to be careful so you don’t just let anyone reset or change anyone else’s password. You still have to be reasonably sure the person making the request is in fact who they claim to be. A lot of sites these days force a user to record an answer only they know, to some kind of a “security” question. Make no mistake; this is just another method of authentication, and one that is loaded with weaknesses. Since answering the question grants you access to the system – either immediately, or by allowing you to change the password – it is just another password in and of itself. However, this “password” (the answer) is usually small, a normal word, and directly linked to a single question. It is infinitely easier to brute-force! Remember that titanium-steel door? Well, you just opened the window and hung a “welcome” sign on it. You don’t have to be a hacker to exploit this system weakness. A recent divorcee discovered that she kept failing to log into her email. She kept typing in what appeared to be the wrong passwords, and kept having to reset them. It turns out that her soon-to-be-legal-ex kept changing the password on her by using the “security” question, and had stolen all her private communications with her divorce lawyer. The ex did know quite a bit about her life - after all, they were married for many years, so chances were definitely in his favor that he’d know many of the common security question answers she would use. Savvy users will type in answers that make little sense to anyone else, but the questions really lead the average person to type in the worst and weakest kind of entry keyword possible. It’s a horrible system to get people accustomed to.

You are the Weakest Link. Goodbye.

And the sad part is that you don’t even have to be foiled by a close relative or spouse these days. Just about anyone can get a hold of everything they need to get into your account via the security questions. Why? Because YOU are GIVING it away to everyone!

In almost any system, it’s often not the data or the encryption or any code that forms the weakest security link. It’s the people. Data thieves often resort to social engineering attacks rather than attacking the software itself. You are a lot easier to crack than most software if you aren’t careful. That is especially true today in the horribly insecure Web2.0 world.
See, social networking sites are like crack. People check MySpace and Facebook more often than they check on their children. People collect friends like kids collect Pokemon cards. Unfortunately, in the quest to acquire more friends than Jesus and be the most popular kid on the net, you happily publish every last little thought in your skull and every last little detail about your life, how you’re feeling, and what you are drinking or listening to at the moment, knowing that everyone in your friends’ list is just lapping up every inane scrap of brain vomit throughout the day and thinking about just how cool you are and how lucky they are to know you. Of course, they aren’t, but it makes you feel good to think so. But those data thieves are absolutely riveted! Hey, you have fans after all!

That Yorkie fanatic next door, the one with three of the yappiest four-legged mop-heads on the planet – what’s her security question? “What is your pet’s name?” And what’s her blog loaded with? Yep, you got it. Wonder how much money she has in her account today. And the funny part is that the thieves don’t even have to pretend to be Nigerians with a few extra million coincidently sitting in a dusty ownerless vault waiting for you to act now and help them transfer it to your account.

Sadly, it’s not just the average Joe who gets suckered. The crack is so potent that it happens to the experts. In Defcon (a hacker’s convention) at Vegas this year, some of the attendees were friended by (people pretending to be) a number of big-shot prestigious hackers – described as the kind who once got arresting for speaking about some government software vulnerability, who lurk in the depths of secret security mail lists and forums, known to the world only by their net nicknames, who write crypto-defeating math equations that make your brain bleed and the NSA wet their pants. The attendees were more than giddy to accept. And then the fun began. You can read all about these social engineering attacks on social networking site shenanigans on the Defcon official site.
As far as secret info you use for authentication, it’s best to heed rule #1 of information security – Trust No One.

Wolf in eBay Clothing

The other major social engineering attack on passwords is phishing. This is where some data thief sends you an email that looks like it came from eBay (or a bank or anywhere that has an account you really don’t want to lose), usually telling you that your account is suspended, and directs you to a link where you are supposed to log in and verify that you haven’t done something horribly wrong. But instead, the link takes you to some Malaysian website (which is probably also compromised) where a fake login page, that might or might not actually contain the word eBay, sends your password to a script that feeds it back to the data thief. Of course, if your account was closed or frozen, how are you supposed to log in? Oh well, minor details. But phishing is effective, especially among people who have internet access but aren’t very net savvy (like a presidential candidate who shall remain nameless or granny who just uses it to email the family pictures around). It’s effective because they trust that eBay wouldn’t lie to them, and that the page they are going to is in fact owned by eBay.

Server Mug Shots

The problem is that up to now, apps (especially web apps) mostly handle a one-way authentication. Under more secure environments, you have mutual (two-way) authentication. Not only does the software have to know who you are beyond doubt, the software has to identify itself beyond doubt to you.

Up to now, people simply trusted in the URL or the browser’s interpretation of the server’s SSL cert validity. But a lot of people don’t even look at that. It’s simply not obvious to non-savvy users. And even if you think you can tell that the URL is really going to eBay.com rather than www.istealidentities.com/ebay.com, that doesn’t really mean someone isn’t re-routing your traffic. This would seem to be an elaborate trick to pull off, but not that difficult to people who know how to exploit routing protocols flaws, or if there are compromised servers/routers/switches at an ISP.

One of the schemes becoming popular now is image authentication. When a user creates an account, they select a picture from a large gallery of simple every-day items. Only the software knows which image they selected. When they go back for a login, they type their username, but not password. The server then identifies itself by showing the user the image they previously selected. Now the user can type in their password, knowing (reasonably so) that the software isn’t some phishing site. Phishing sites typically aren’t very complicated or dynamic enough to even show a different image, but they also shouldn’t know what picture the user selected. Of course, the number of pictures directly relates to the strength of that authentication, and perhaps it would be better to use phrases, although I suppose the argument is that pictures are easier to remember. The problem is that most systems have no really visible server/software authentication to speak of, and people simply aren’t used to it (yet).

Don’t Come a-Knockin’

The basic point of all these bits of trivia and attack modes is to ensure that you get a system that is really up to par on password security. The system needs to strive to eliminate the feasibility of any attack mode, forcing only brute force guessing of the password, because under the right circumstances, attempting all possible combinations could take longer than the earth has been in existence. But when it comes to brute forcing a password, you can also set up another barrier – lockouts and notifications. It’s common for people to get a password wrong on any given attempt. They might forget whether a character is “3” or “4”, or maybe the Caps Lock is on. But if someone tries to guess incorrectly 500 times in a few seconds, it’s probably an attack. To really stall brute force attacks, you need a lockout. This means that after some small number of failures (let’s say 4 or 5), the system stops authenticating that user period. You can re-enable authentication for the user after a few minutes (5 to 15, for example). It doesn’t necessarily have to be a permanent lockout. This foils brute force by making it completely infeasible to run through large numbers of guesses. If it takes trillions of years to brute force all possible combinations for a given length of password when a server can run through a billion combinations a second, imagine if it is now forced to deal with 4 or 5 every 15 minutes. Also, it doesn’t hurt to notify the user that something might be up with the account. Maybe they have a disgruntled ex-employee trying to get into a system using a co-worker’s or admin’s account. It also pays to keep a log of such attempts and have notification so your I.T. team can investigate, and perhaps law enforcement as well, if it comes to that.

Acknowledgements-
1 UNIX is a trade mark of The Open Group
2 LanMan and Microsoft are registered to the Microsoft Corporation
3 MySpace is a property of NewsCorp
4 Facebook hasn’t sold out (yet)
5 Pokemon is a property of Nintendo, 4Kids Entertainment, and Pokemon USA
6 The Weakest Link is a property of The BBC
7 eBay still owns itself
8 Nigeria still boggles me. They have billions in funds that nobody wants, yet people are starving and they want to scam you for the lousy 2 grand in your checking account. And apparently they have no idea who William Shatner is either. Same goes for their Hong Kong affiliates.
9 Malaysia doesn’t boggle me as much. I have three friends from there. I just think the .MY top level domain is cool.

LINQ with DataReaders

Over the past couple of days, one of the MVP lists has had some activity about LINQ (which is pretty common these days), and one of the threads meandered into using LINQ on DataReaders. The problem is that LINQ works on IEnumerable sources, but DataReaders don't implement that. Fortunately, it's easy to do, and several people out there have come up with various solutions, including Jim Wooley in this article of his. The basic gyst is that you can you create an extension method on IDataReader that serves as an Enumerator.

However, Jim encountered two problems. He is returning the data one row at a time (which is the point) but as an array of object. So first, he's worried about performance. This is a valid concern considering the boxing/unboxing implications for all value type columns as well as the fact that a LINQ statement might try to grab those values several times (for sorting, as an example). Secondly, because all you have is an array of object that represents a row, you reference each column of the source as "row[0]" and "row[1]", etc. - basically, you don't get to use nice column names in the SQL.

The way to fix both these problems is to have an object that represents the row, complete with specific and type-safe fields (properties if you want to get technical and PC). Then, you need a way to convert each DataReader row to an instance of the row class. Expanding on Jim's example, here's what I came up with:

C#
public interface IReaderRow
{
    IDataReader Reader { get; set; }
    object GetRowData();
}

public class TimeZoneRowReader : IReaderRow
{
    public IDataReader Reader { get; set; }

    public object GetRowData()
    {
        return new TimeZoneRow() 
        {
            ID = Reader.GetByte(0),
            Name = Reader.GetString(1),
            Description = Reader.GetString(2)
        };
    }
}

public class TimeZoneRow
{
    public int ID { get; set; }
    public string Name { get; set; }
    public string Description { get; set; }
}

public static class DataReaderExtension
{
    public static IEnumerable<T> DataReaderEnumerator<T, TReader>(this IDataReader source) where TReader : IReaderRow, new()
    {
        if (source == null)
            throw new ArgumentNullException("source");

        IReaderRow rowReader = new TReader() { Reader = source };

        while (source.Read())
        {
            yield return (T)rowReader.GetRowData();
        }
    }
}

As you can see, the LINQ statement is now typed and has intuitive and descriptive names for the columns. The downside, of course, is that you need a class representing each type of possible result, and a way to translate the DataReader data to that row instance. You can combine the row class with the specific row reader class, but I simply chose to keep them separate. Strangely enough, this starts to feel like LINQ to SQL or LINQ to Entities at this point, but it's a way to solve the two other problems I mentioned earlier. In the end, the LINQ will now look something like this:

C#
listBox1.Items.Clear();

using (SqlConnection cn = new SqlConnection(connectionString))
{
    using (SqlCommand cmd = new SqlCommand("Select * from TimeZone"))
    {
        cmd.CommandType = CommandType.Text;
        cmd.Connection = cn;
        cn.Open();
        using (SqlDataReader dr = cmd.ExecuteReader())
        {
            var queryResults = from timeZone 
                in dr.DataReaderEnumerator<TimeZoneRow, TimeZoneRowReader>)
                where timeZone.ID > 0
                orderby timeZone.Name descending
                select timeZone;

            listBox1.DisplayMember = "Description";
            listBox1.DataSource = queryResults.ToList<TimeZoneRow>();
        }
    }
}

If you search around, you'll see quite a few other solutions (some of them in community projects) for using DataReaders as LINQ sources, although they all seem to have some of these aspects in common. The really interesting stuff is when you start to think along the lines of using combinations of IL Emit and anonymous classes to create all these row-related classes automatically (the schema is available on the data reader, after all). However, without getting too tricky, it's tough to find a way for C# or VB to interpret the results in design time so you can still maintain the intellisense and column names without resorting to something like code generation.

More P/Invoke Help

One of the first things I really hammered with .NET 1.0 back in late 2000 was P/Invoke. I was used to Java, but IMO, Java always had a fatal flaw - language design hubris. Java is perfect, and C is garbage, therefore, everything should be converted to Java, or so it seemed anyway. That philosophy trickled down to how Java used existing C code - JNI (at the time) wasn't a way for Java to call C code as much as it was a way for *YOU* to write a "clean" and extensive wrapper worthy of being called by Java so that Java wouldn't have to dirty its hands with the "bad" C code. At the time, I was extremely skeptical about .NET, but one thing I did like is that it readily admitted that the enitre world was not in .NET and there was plenty of "legacy" code written in C that was.. *gasp* ... usable and useful! Furthermore, it cooperated with said C code and even had a very simple (relatively speaking) way of calling it. But a lot of people weren't used to P/Invoke, although it was similar in concept to using Declare statements in VB (just more powerful). I was really taken by how comprehensive P/Invoke was, but people who weren't used to looking at C code had a hard time translating calls. I remember writing a little tool called P/I-Spy that would take C function prototypes and convert them to P/Invoke, or do the same work via a step-by-step wizard that allowed you to describe the call. The problem is that I wanted to create a comprehensive database of calls for the Win API, and I just didn't have the time, so I was happy when someone else took up the mantle and created http://pinvoke.net. At that point, I let the tool slide completely and have never touched it since.

However, Microsoft just launched a tool of their own. You can find it here: http://blogs.msdn.com/vbteam/archive/2008/03/14/making-pinvoke-easy.aspx, and it has some really cool features, among them, the ability to convert volume code.

Saying YAGNI to YAGNI?

I was reading a very interesting post this morning: http://codebetter.com/blogs/james.kovacs/archive/2008/03/13/using-yagni-responsibly.aspx

and I want to start by saying that I whole-heartedly agree with James at the heart of the matter. YAGNI, when intelligently applied, is not just an excuse to do stupid simplistic things. But that got me thinking (again) that YAGNI itself is probably an inadequte piece-of-jargon-turned-tool that encourages some people to use it "unintelligently". My motto is that if something isn't providing consistently good results, then there's probably something wrong with it, so here's the gist of my thoughts:

I think the problem with YAGNI is that as a concept, it works more like a binary switch. It encourages people to think in Yes or No options and concentrate on a single vector of a multi-dimentional problem. By "intelligently" applying YAGNI, the focus should really be on providing the most effective and efficient (i.e. simplist) solution that can solve the problem given all the goals that must be met. This is not a "yes I need it", or "no I don't" question - it's a question that begs us to consider a range of alternatives and evaluate them based on (not just complexity) but R.O.I, maintainability, and likelyhood of change and stability, among other factors. At this point, some agilists might argue that we're trying to do too much design upfront by taking all this into account, but most of these factors can be very quickly assessed in a matter of minutes on a "good enough" basis, which is really in keeping with the spirit of Agile, IMO. YAGNI is just a simple(istic) way of terming it, but perhaps it's too simplistic and encourages us to evaluate only one vector (complexity) of the problem that really has other dimensions that need attention. I honestly think it hurts the process to give jargon undo weight. A person who uses YAGNI intelligently is actually taking those other factors into account, so perhaps YAGNI as a concept is just putting too much focus in one area for those who are more prone to simply use the jargon at face value as opposed to "intelligently" applying a tool, which is starting to make me wonder if it's actually a good tool to begin with.

How many factories does it take to...

So a friend and I were talking about specific CABs, and containers, and DI, and MVP, and AOP/Cross-cutting injection, when I stopped and had to laugh. It reminded me of this post: http://discuss.joelonsoftware.com/default.asp?joel.3.219431.12

At this point, some of you may be saying, "But Rob... OK, so a lot of these 'frameworks' are getting out of hand, but haven't you posted previously about how most modern software is completely unmaintainable without frameworks?" Yep, guilty as charged. However, to quote Einstein: "Make everything as simple as possible, but no simpler." And this gets to the crux of this year's push for me - how to convey that there are two equally horrible extremes we need to avoid: being so simple that it's too simplistic to work correctly, consistantly, predictably and be maintainable, and being so complicated that it borders ridiculousness, just because someone slapped a "Best Practice" sticker on it. Unfortunately, the tool needed to make those distinctions is not found in a book, blog, or manual - it's found between your ears, where it for some reason appears to be more difficult to locate. And before anyone gets overly defensive, the point of the post is not to say that the things I mentioned up top are bad - in fact, they are far from it. Just saying that before you download and start figuring out how to use someone's ShineyNewFramework(tm) v2.5, make sure you know why you are using it, and don't just do it because someone's blog said you ought to.

Exposure - Expenses to Assets

Are your developers expenses or assets? Technically, the accounting department and the IRS have an answer for that, but here's what I'm talking about: An expense is something that you throw money at and it depreciates in value or is consumed until it's done. A good example would be papers and staples. You shell out money for paper and staples, and they eventually run out, but what are you going to do? The company needs paper and staples so you have to pony up the expense. An asset, on the other hand, is something for which you pay now, and expect future benefits or returns. In other words, think investment. It's not just tossing money out the window, it's (eventually) getting something in return for the initial cost.

So with our simplistic analogy, investments are only good if they grow and provide a bigger future return. You have the choice of viewing your developers as a necessary expense, or viewing them as something that with a little investment, will provide greater future benefits for the organization. But in order for that to happen, developers need to grow in skills. When hiring, the best developers I've gotten are the really passionate ones who eat, drink, and dream code. They spend a ton of their own time researching and collaborating and growing their skills. But realistically, not all developers are like that. It works out mostly if you have single guys and gals who have that level of interest in coding and architecture and probably little else to get in the way barring the prospect of a nightly raid in Warcraft. But for the rest of society, you have parents and people with other adult responsibilities that demand inordinate amounts of time and energy outside working hours.

And that gets us to the crux of today's post - exposure. You don't necessarily have to send developers to really expensive training in order for them to grow. The key factor is exposure. Exposure to techniques, technologies, practices, bugs, workarounds, etc. The more exposure the better. For the devs who do little else in their free time, they are getting that exposure through books, articles, online forums, collaboration projects, and many other avenues. But for the devs who can't really afford to do all that (or as much), the only exposure they really get is what's in front of them 9 to 5 - which is to say, little or nothing at all that will help them grow beyond what they already do. And here's where I encourage managers to allow their developers to surf the web, be active in tech forums, contribute to a collaborative project, blog, and otherwise get the type of outside exposure they can only get online *on company time*. Sure, you have to limit it and make sure it's appropriate, but even a half hour a day can help to vastly improve their exposure to new tips and tricks, technological features, practices, and feedback they would have otherwise never gotten. Try it and see if this doesn't provide better benefits than forcing them to go heads-down in the project code 8+ hours a day.

Real Technology Heroes

OK, I'm probably *really* late to the ballgame on this one, but this site just made me laugh in that geek-humor way that has all the "normal" people around raising an eyebrow because they have no idea what's going on.

http://realtechnologyheroes.com/

So here's to you, Real Technology Hero. You know who you are!

Security and Permissions

Nothing really new in this post, but I just ran into some specs defining what needed to be locked down for a project, and it reminded me of a recurring theme that I though would be a good topic to blog about. That usually happens when you find yourself running into the same conversation multiple times.

The basics of an authorization system are pretty simple - you have users and you define what actions users can take on certain "objects". In Windows, for example, those objects happen to be files, folders, registry keys, computers, etc. In any given system, there might be many objects and actions, so for the sake of convenience, you group everything into Roles or Groups. Instead of creating an ACL (access control list - a list of actions a user can take on specified objects) for an individual user, you create ACLs for Roles. It's easier to add users to three roles (with preassigned permissions) than it is to assign 150 permissions to each user.

In some systems, the ACLs are somewhat implicit, and the code simply looks at roles. Each chunk of code looks for certain roles assigned to the user and decides if certain actions are appropriate. While this is certainly easier to implement, it lacks some flexibility and maintainability since it's a bit more hard-coded than actually having a data-driven and explicit ACL. Instead of being able to move permissions around, the most you can do at runtime is move users between roles.

Regardless of whether the ACLs are more implicit or more explicit, the meat of this post is more about "what" to lock down. I'm not sure if this is something taught in a some extremely popular systems design book or something, but it seems like BAs and users naturally drift towards wanting to lock down UI elements. The specs usually read "X users can't see/interact with Y button and Z fields on Screen A". I'm not exactly sure why that irritates me so much, but what you really want in most cases is to restrict actions to more abstract entities, rather than restrict actions to specific UI elements. If you say that X users can't Edit the contact information for Customers, or that Y users can't Deactivate Customers, or that Z users can't Create Customers, then no matter where those interactions occur on a physical UI, the same rules will apply. That's much better than defining permissions on a field-by-field, button-by-button basis. What if there are multiple ways to deactivate a user? For example, a button on Screen A, a maintenance panel on Screen B, and a Menu entry?

Of course, users will still find legitimate reasons for locking down certain UI elements - screens, reports, buttons, etc., but in my opinion, that should always be the exception rather than the rule.

OOP and Interviews

Just got back from vacation and read a blog post by Eric with some concerns about developer knowledge on OOP. It reminded me of a similar conversation I've been having on and off with friends for years, so it seemed like an appropriate topic to write a quick post. After that, I'll also toss in a quick synopsis about interviews.

Eric claims that far too many devs don't really know OOP. To some degree he's absolutely right. I see the same thing when I conduct interviews. There's an interesting dynamic when it comes to OOP, and it depends on how you approach the topic. If you ask a lot of people about OOP, you'll most likely get one of two answers - either they feel it's great (whether they really know much about it or not, they feel the industry [read: you the interviewer] thinks it's important, ergo, it's a Good Thing (tm)) or they feel it's a complete waste of time and far too complicated. Since apparently a good chunk of developers can't articulate the pillars of OOP, some people have even written entire papers asking the question "Was OOP a Failure?"

I'm afraid the RIP engraving on the OOP tombstone is entirely premature, but here's what I've discovered: Just about everyone LOVES consuming OO code, but a good chunk of developers don't like writing OO code (or simply don't know how). OO libraries are easy to consume. It's simple to understand MyMainForm.Text, AcceptButton.Enabled, and MyData.Fields["RowID"].Value. Additionally, the data and behaviors are easily discoverable. That is to say, you start with some object, and in most IDEs with most OO-languages, you click "." and you get all the object's fields, properties, and methods. There's a good reason for all this goodness - one of the primary goals of OOP is code organization. This is much easier than having tons of variables strewn in dozens of files, with names like gdwmaxdaysinqueue, CUST1_TEMP_DBID, or even X$. While some devs might even rebel against writing OO code themselves, you'd be hard pressed to find (for example) a .NET dev who would give up the .NET framework classes for winforms in favor of something flat like the Win32 User API.

The problem is that getting all the organization isn't free. Just like living in a house is infinitely easier than building a house, consuming OO code is much simpler than actually writing it. At the same time, most newer languages are making this much easier. Having said that, I also strongly believe that a lot of the "complexity" is in some respects purely perceptual. I often say that when faced with a problem, most programmers simply bang away and try samples until they get something to work. After that point, whatever happened to work becomes "the way to do it". Problem solved, why go back and reinvent the wheel? However, writing good OO code requires a fair bit of upfront thinking rather than the "discover a working solution as I go" approach. A lot of developers, particularly self-taught devs, get passed this experience and don't like to go back and discover a different way to write code to solve the problems they're already solving now, albeit somewhat disorganized. And before I get beat up for saying that, it isn't to say that only OOP code is well organized and everything else is not. In reality, I find that the most resistance to OOP comes from people who have programmed for years without it. Some of the newer bunch simply don't know much about it, but aren't really against it.

To prove the case to myself, I taught several semesters of OOP at USF. The students (most of whom were taking the class as part of a business or accounting curriculum rather than CS, some with no prior programming experience) had absolutely no difficulty at all getting the concepts with a minimal amount of proper guidance. At the end of the 7 week course, they simply couldn't fathom writing code any other way. But unlike the trial-and-error method of most self-learning, not everyone has the patience to get the guidance necessary to get into OOP, even if it comes from a book and not a teacher.

And without further ado or even a good segue, I'll dive into the interview stuff.

Interviews really serve one purpose - to see if the person in front of you can do the job. This has as much to do with technical knowledge as it does with a "fit" in the team. What an interview shouldn't be is a game of trivial pursuit. If your candidate will be writing a lot of OO code, then of course it makes sense for them to have a very good understand of OOP and be able to intelligently articulate the pillars of OOP. In today's environments, OOP is often an extremely relevant topic, which candidates really should be (at the very least) brushing up on.

However, should a candidate be able to explain how to correctly write a custom WCF transport off the top of his or her head? In 99.5% of the cases, I'd wager the day-to-day tasks the dev will be performing has nothing to do with that. Some interviewers use the interview as a vehicle for conveying Alpha developer dominance (i.e. showing off who is more intelligent or technically savvy and making sure the candidate knows he is at the bottom of the totem pole should he get the job). The questions are about things the candidate won't be experiencing on the job and any answer wouldn't necessarily qualify or disqualify the candidate from the real duties they would have to perform. That's just bad form, and if I caught anyone on my team doing that, they would really be in for it.

The same goes for that incredibly stupid bucket of trivia puzzle questions concerning light bulbs, coins, minimum distances, and trains. A lot of people claim they use these to gauge "problem solving skills". I claim BS. In fact, most of the people asking these ridiculous questions didn't know the answers themselves until they were told or found them in a trivia puzzle interview questions book. If you are asking these questions, chances are that you (a) don't know relevant questions to ask and are using these as fillers, or (b) are just trying to show off. Knock it off; you're not impressing anyone (and if you're impressing yourself, seek professional help as soon as possible). A lot of candidates are nervous enough as it is. Ask questions relevant to what they will actually be doing (and no, it's not trying to figure out how to find defective pills in the minimum number of weighings). If the open position was about Mensa Puzzle Solving Punditry, it would be called "Mensa Puzzle Solving Pundit" instead of "Sr. .NET Developer".

Then what should you be asking? Well, I'm not going to be presumptuous enough to tell you what is the best way for you to interview. In the end, that's really for you to decide. My approach is simply this: talk to the candidate. Anyone who can do the job can hold a very engaging conversation about the relevant topics. I avoid the trivial pursuit approach at all costs. Typically what I do is start vague and steer the conversation into more and more specifics as it goes. For example, find a previous project on the resume that claims to have incorporated some of the relevant techniques or technologies. I start by simply asking the candidate to talk about that project.

I look for a few tell-tale warning signs. For example, I don't want to hear "we did X" and "we implemented Y". I'm not hiring your entire team. I'm trying to hire you. I want to hear what you did specifically, not what your team did. I want to know what you know, not what your team knows. Sometimes people who (involuntarily) get off a project hide behind the accomplishments of others and don't understand the relevant topics at all. If I hear too many "we"s, I immediately start asking specifically about what the candidate himself did and understands about the project.

Also, if (as a candidate or interviewer) you don't understand a topic, don't talk about it. Nothing bothers me more than BS during an interview. I'm not looking for perfect knowledge. I'm looking to see if someone is qualified enough to do the job. Don't make things up. Don't throw around acronyms or buzz words you don't understand. If you get caught (and particularly pray I'm not the one who catches you ;-)) you will look 10 times dumber than you probably are. It's OK to say you don't know, or don't know off the top of your head. I will then try to assess how you would solve such a problem involving that topic, and that's more important to me than knowing every answer from memory. If you can figure out something after reading a help doc for 15 minutes, that says a lot about your ability to perform and produce given something you didn't already know from the start. If you are an interviewer and you do this, and the candidate catches you, you will also look 10 times dumber than you probably are. That's not going to help you fill the position.

So as the candidate begins to discuss the project (and particularly with me homing in on certain technologies or techniques), I begin to ask more and more specific, and deeper questions. A proficient person will be able to keep up and go along with the conversation. When the candidate can no longer keep up, I've discovered the limit of their knowledge on the given topic. If they can keep up until I no longer have questions, then welcome aboard! I know I can do the job, so if you know as much or more than I do, I can't ask for more. This more "natural" style of communication (as opposed to simple question and answer sessions) also relaxes the candidates a bit more. That helps me filter out any noise caused by nervousness.

Do you have any good interview techniques or pet peeves? Let's hear about it! :-)