Someone comes to you with blood steadily dripping from a cut while an angry mob is chasing him. Do you…
a) take him to safety and immediately stop the bleeding or
b) tell him to wait while you jump on the internet and find out what to do on webmd.com?
If your answer is (b), you’re probably new to the sys admin world…like me.
This week, we rolled out a new citrix farm to 2000+ users. It was overall a success…except for scattered connection problems. not good for an emr system.
My immediate REACTION was to dive in and research what the problem was. (Reaction is in caps to bring attention to the feeling that I was reacting to a problem rather than having taken progressive action to prevent it)
Was it the load evaluator? health monitoring? paged/nonpaged pool size? spooler service misconfig?…
I want to remember that, in this case, I failed the test. But, more than that, I want to remember the lesson I learned.
I jumped head first into the problem without an escape plan. Imagine a firefighter going into a burning house.
You can be sure he has an escape plan, a plan b.
In this emergency, I didn’t have a plan articulated. I arrogantly thought that by attacking this problem with everything I have, I would be able to resolve it in a day or two. It is now 7 days since the problem started and we still haven’t found a resolution.
so, step 1 is to stop the bleeding first; bring back stability.
Divide and conquer: immediately after, step 2 is to devise your plan for attacking the root cause of the problem.
Devise tests to isolate and eliminate hypotheses.