Safeguards and Minimizing Damage
I really don’t get one kind of coding mentality. Which one? I’ll tell you which one. The one where people insist that code should be written correctly from the start so that there’s no need for safeguards.
“What?”
Okay, I’ll elaborate a bit further with a theoretical example. Imagine that you have an embedded system of some sort, and a task needs to be kept running. The first thing to do is, of course, to ensure that the application is written properly, doesn’t crash, and handles errors in a proper way. Of course. I’m not against that in any way. But some people are of the opinion that this ought to be enough – that with enough unit testing and good programmers, this will result in an application that will not crash and will always run. In a perfect world I agree that this should suffice…but the fact is that people are stupid now and then and make mistakes.
Personally, I’m of the opinion that if a task is critical you need to write it securely and have a monitoring system that tries to jump start it again in case something happens. One of the most common questions I get when I propose things like that is: “What do you need that for? What could cause the application to crash?” I grind my teeth every time I’m asked that – if I knew what errors people might have made I would look for those potential errors instead! This is to ensure that the unknown factors are taken care of. Of course, a monitoring system can fail as well. But it increases the chance that things won’t crash and burn horribly – it just might work in some cases.
I know, I know: this is technically a bad approach to writing good software. If you have a safety net you might get sloppy in your implementation. Also, the point of not having a safety net might be that you want to find the errors, and the best way of finding them is to have the system crash. You’ll definitely know that something’s wrong then, and you can make a proper solution. But let me tell you a little story; a story about a couple of device drivers.
First there was device driver A. He was a strange beast, written by a team of Americans and then moved to India for further development. Not that that has anything to do with the issue – the Indian people are good developers and the same thing would have happened if another American team had taken over the driver development. The point is that I suspect that there was a distinct lack of understanding in regards to the driver’s internal logic, due to the new team’s relative inexperience with the code. This became very apparently when the driver received new functionality. The new functions were built on top of the old ones, and everything seemed hunky-dory until one day when a customer experienced serious problems.
“The device driver crashes,” said the customer to the friendly people in the middle who used device driver A in their products. “You suck! We will sue your asses and knit little sweaters out of your pubic hair.” This was a very strange threat, but it didn’t sound like fun at all. So the friendly people spent weeks trying to reproduce the problem. After many weeks of hard labour they finally reproduced it, and gave enough information to the device driver developers to fix the problem. Apparently it was caused by some internal states that were introduced; the driver could get locked up in one of these states, since there was no age timer that cleaned up those new states. But all was well, for a solution was found! Yayness!
But then there was device driver B, developed by a completely different team for a different set of products. Some time after the adventure with driver A, driver B started showing strange behaviours as well… Lo and behold – it appears that driver B has a problem that’s not completely dissimilar from driver A’s! The details differ, but the main concept of the problem (new functionality built on top of the old, but no safeguards were added) seems to be the same. If only there had been a general safeguard inside the driver, that examined the internal states and cleaned up strange states; maybe both driver beasts could have lived happily ever after, then! Sure, there would have been an error and this error would have had to be corrected. But the whole product would not have become useless in the meantime, risking potential deals with the customers who experience these errors.
Disclaimer: a safeguard mechanism might not have improved anything in these cases. It might have been completely useless, and provided nothing at all. But I’m firmly convinced that the damage could have been lessened immensely by a functioning safeguard, had it been in place…and that the mere chance of minimizing damage is worth the extra effort.
