When does panic happen? In my experience (and we'll focus this on in-house software development), two of the most stressful scenarios are software releases and system-down failures.
It never fails. After weeks (sometimes months) of careful testing and analysis, a myriad of problems always seem to materialize at the end of a development cycle. This can take many shapes:
- Undiscovered Requirements I love this one. After coming up with a really solid architecture, design and implementation, a stakeholder realizes that they were completely wrong about a fundamental assumption. This is particularly awesome if you've verified this assumption numerous times and built an entire solution on top of it.
- Procrastination I find it amusing when somebody has known they had to do something for months then all of a sudden start making noise about it a week before a product goes live. What started out as a "minor feature, it'll take me an afternoon" turns into a fiasco at the last minute due to lack of planning. This is when written requirements go out the window and the band aid solutions are applied.
- Late Testing You can get user feedback early in the process, but treat it all with a grain of salt. When you get a week out from release, everything changes. This is when the users who have been sitting on their hands for months suddenly panic and realize they haven't tried the new piece of software they need to use in order to do their job. All of a sudden any small deviation from what they expect becomes an uber-critical bug. These users also wait until this window to ask for "can't live without it features" even though they agreed to (or ignored) the original requirements.
So how do you reduce / eliminate panic in the development / release process? It goes back to the basics of the software development life cycle:
- Requirements Nothing new here. Every feature that is added to an application needs to have a written spec. This spec needs to include what it is, why it is,and how it behaves (and ultimately how to verify it). Everything needs to be written down, including who asked for the feature.
- Design Yes, software still needs to be designed. We have some great tools to help us with this process, but it still needs to be done.
- Implement This needs to be done correctly. There are many right ways to implement software, and many wrong ways. While there are many excellent design patterns that can be used, the universal advice of DRY (Don't Repeat Yourself) should remain at the top of the list. One of the best ways to produce unmaintainable code is to copy and paste the same thing over and over.
- Involve as Many users in testing as possible The users that jump on board early enough in the testing cycle can actually have some say in direction of development. Users that don't touch the product until the day before it is released aren't going to get features added anytime soon.
Watching people react to system down situations is particularly interesting to me. The panic in this case can come from both users and IT.
I've seen three main approaches from the IT side:
- Deer in Headlights This is fun to watch. Grown adults freeze in their tracks looking around the room for someone to come up with an answer. This person has completely succumb to panic.
- Jump in and start hacking This one can go either way. Normally introverted technical staff suddenly turn into ninjas and start throwing fixes at the problem (e.g. reboot now! kill the service!)
- Detective Mode This is when somebody starts looking at logs, performance counters, asks questions, etc.
Jumping in and hacking away can go either way. Even if prompt action is required, this is a dangerous time because normal procedures are often skipped and many safeguards are removed "just to make it work" Sometimes this can backfire with either immediate consequences (make the problem worse) or long term consequences (e.g. "you know how you gave that folder write permissions for Everybody? Well, somebody wrote to it").
Detective mode is good, as long as it leads to action. Paralysis through analysis is just as bad as deer in headlights because nothing is getting done.
I would like to think I'm a healthy mix of the last two... I'm not afraid to jump in and try something if I've got a good gut feeling where the problem is. However, if I'm flying blind, I go into detective mode and start eliminating potential sources of error component by component.
Every system and process should be designed with backups and procedures for when it fails. Even super-big-crazy-expensive clustered solutions can fail brilliantly. It takes a mindset of assuming that failure will happen (not might happen) to be prepared for this.
Practice, practice, practice! A disaster recovery procedure is useless unless it has been regularly tested. The canonical example of this is backing up data without testing a restore. How do you know you can recover unless you've done it?
At the end of the day, there are many ways to avoid unnecessary problems and prepare for unavoidable issues. But the best advice?