Operational Risk Management Part I

Reading this week:

  • Neptune’s Inferno by James D. Hornnfischer

Recently (as I’m writing this), this USNI News article popped up in my news feed: “Investigation: Reckless Flying Caused Fatal T-45C Crash That Killed Two Naval Aviators.” In the incident, a student pilot was flying with an instructor. The instructor was performing and telling the student to perform advanced maneuvers that were not part of the plan. Both pilots misjudged the height and speed of the aircraft and crashed, killing both of them.

I’m not a pilot and I don’t have the skill to judge the technical aspects of the incident. But I was a nuke and a whole lot of things went wrong on my watch, many of them entirely or partially my fault. In the nuclear Navy, when things go wrong we hold a “critique.” A critique is an analysis of what exactly happened and the root causes of what happened (the other communities have similar systems but I haven’t experienced those first hand). Since I caused a lot of things to go wrong I went to a lot of critiques.

When things go wrong there are only really, fundamentally, a few root causes. Something could have broken. Sometimes things just break. You can probably dig down and analyze the cause of that failure, but sometimes mechanical and electrical devices just fail in ways that the watch team could not have predicted. Sometimes, people do go rogue. I had one mechanic who was getting fed up with the maintenance approval process (it was a busy day and his maintenance was low-priority for us so it kept getting pushed aside, but it was high priority for him because it was between him and going home) and so he just opened up the panel on the piece of equipment he wanted to work on and got to work. This is a big no-no and landed us in a critique.

But the vast majority of the time, when things go wrong, it is human error. Under the category of “human error” there are of course sub-categories. Sometimes, people just weren’t trained properly. Through no particular fault of their own they wind up in a situation they are not trained for and make a bad decision because of it. Sometimes there are hard decisions to make, and you can’t know everything, so you make a decision that turns out to be wrong. But the vast majority of cases that fall under “human error,” and therefore in the vast majority of cases that go wrong, it is my belief the fundamental root cause is poor operational risk management, or ORM.

ORM is the practice of balancing the risk associated with an action with the potential reward. Good practitioners of ORM will actively seek the ways to reduce risk, but it is also true that risk cannot be eliminated. Risk should be taken on, however, only in proportion to the commensurate reward. Flying planes, and especially warplanes, is an inherently dangerous thing to do. Aviation has a fantastic safety record and is one of the safest methods of transportation out there, but you’re still hurtling your body through space at high speed and high altitude. That is risky. The safest thing to do is to not fly. But not flying is not really an option; modern war requires aircraft and people to fly those aircraft, and the only way to get really good at flying airplanes is to practice flying airplanes. So you seek to reduce risk: you install safety equipment in the aircraft, you use simulators to practice where possible, you train the pilots to be familiar with the limitations of the aircraft and avoid exceeding them, and you only perform maneuvers that are as risky as necessary to successfully complete the training. In that way, the risk is reduced and becomes commensurate with the training value achieved from actually flying the airplane.

The most frequent way ORM falls apart is a bad evaluation of risk. If risk is evaluated poorly, then it will be impossible to tell when the risk being taken on has exceeded any potential value from the evolution. You can google a whole set of articles on why people are bad at evaluating risk, but in the Navy, I think the root cause is that most people are phenomenal at their jobs.

In the T-45C crash, neither pilot was a bad pilot. The Navy put a whole lot of time and money into training both of them and it showed. Again, I’m no pilot, so maybe they were actually terrible, but according to the article both pilots were conducting very advanced, unplanned maneuvers, passing the controls back and forth pretty continually and everything was going fine right until the end. The instructor was an experienced pilot but the student was still managing these maneuvers despite being a student. These two men were very good at flying airplanes, as far as I can tell. The upswing of being a very good pilot is that you can do a lot of dangerous things for a long time and have everything turn out just fine. Given how nonchalant he was, this could not have been the first time the instructor was conducting these sorts of maneuvers, and since he lived through every other time, then every other time must have turned out just fine. So, in my analysis, the instructor was unable to evaluate risk because every other time he did something stupid he managed to survive based on skill alone, which meant these activities no longer felt risky.

Given the Navy’s (and presumably the whole military’s) skill at training their personnel, this inability to properly assess risk of course extends to all communities. Last year, of course, both the USS Fitzgerald and the USS John S. McCain collided with merchant ships leading to the loss of life of 17 sailors. In the Navy’s investigation of the incident, they concluded that “the crew and leadership on board failed to plan for safety, to adhere to sound navigational practices, [and] to carry out basic watch practices.” What that translates to is that the crew did not implement the appropriate risk reduction procedures commensurate with the amount of risk they were taking on. In both the surface and submarine fleets, the way you mitigate risk in high-risk situations, such as operating near navigational hazards or near a large number of other ships is to station more watch standers who are able to better evaluate information as a team than one person is able to do alone. On both those ships, the situations that lead to the incidents, with inadequate watchstanders and inadequate safety precautions could not have been one time events.

Having been on a ship where things went wrong, and having been responsible for those things, I can tell you that suddenly a whole lot of people unfamiliar with the facts of the case have a lot of strong opinions on what you must have done wrong. I’m not here to do that in these incidents, and for all of them the Navy’s official reports compiled by the experts in these events are available to read. But what I am comfortable saying what I am saying because I’m not saying these people were bad watchstanders, I am saying I think they were probably very good. Both of those ships must have operated with inadequate watchstanders numerous times, and were able to do that because the watchstanders they did have on watch were highly trained and very good at driving ships. That means they, like the pilots, were able to do dangerous things and have nothing bad happen, which means they were ill-equipped to adequately assess the risk they were taking on and mitigate it.

I think some of the structural aspects of the Navy can contribute to this inability to assess risk. In 2016 two Riverine Command Boats attempted to transit from Kuwait to Bahrain. On the way they were seized by Iranian forces and held captive. In the Executive Summary of the incident report, the very first cause of the incident listed is that the command “demonstrated poor leadership by ordering the transit on short notice without due regard to mission planning and risk assessment. He severely underestimated the complexity and hazards associated with the transit.” In this case, what I suspect happened is that every other time the CO ordered this crew to perform a task, they pulled it off. Short notice, no notice, difficult conditions, you name it. So the CO lost his ability to adequately assess risk when it came to ordering this unit to perform a transit.

A similar thing happened to us. The first time we did a berth shift (when we moved the submarine from one pier to another), we took a week to prepare. It is, fundamentally, a pretty risky task, because you don’t have the reactor running and just have the diesel, which means the submarine has very little ability to get itself out of a bad situation and limited backup systems in case something goes wrong. But we did a lot of berth shifts and did them all perfectly and so before you knew it squadron was comfortable giving us two hours of warning to do a berth shift where previously we spent a week getting ready. We even did a berth shift with only one tug where normally you have two, vastly increasing the risk of the operation. We may have gotten a bit better at berth shifts, but at no point did the risk actually reduce, just our perception of the risk did.

In the case of the Riverine Squadron, this meant the CO was comfortable ordering the boat crew to make a high-risk transit with little warning time. This is especially dangerous because the CO is supposed to be the final arbiter of risk. If he is the one ordering the mission to go ahead, he has to have made a decision that the mission has an acceptable level of risk compared to the potential reward. I firmly believe that the officer in charge of the actual mission has a parallel responsibility to assess risk and must refuse to carry out the mission if in his opinion the risk vs. reward calculation is not worth it. But given that the order comes from the CO that fundamentally alters the officer’s ability to assess risk; the CO has many more years of experience and often has access to more information and his judgment should therefore be more reliable. But I think our CO here had seen this unit succeed many times, changing his assessment of the risk, and the feedback loop to the officer in charge meant that no one was able to assess the risk any more.

I’m going to keep harping on things that make it harder to assess risk, but I am trying to drive home the point that assessing risk is a fundamentally hard thing to do in the Navy precisely because we work so hard to reduce it. A major aspect of Navy maintenance procedures is to require several people to verify that something is safe. Before working on a piece of equipment, we’ll tag it out. What this means is that we’ll turn off and isolate every power source to a piece of equipment (including electrical, mechanical, and hydraulic power) and hang tags on those switches, valves, or isolations to make sure that no one turns them back on (similar to but not the same as civilian lockout-tagout). To tag out a piece of equipment one person has to determine what isolations are needed, another person has to verify them, and, before the tagout is hung, the supervisor also verifies it. For more complicated tagouts even more people can be involved but at the absolute minimum three people look at a tagout to make sure the worker won’t be harmed when he goes into a piece of equipment.

For something to go wrong then, all three of those people have to make the same mistake. Things, therefore, rarely go wrong. Two people can miss an isolation and have the third person catch it. In that case, nothing bad happens; the equipment doesn’t get worked on until the problem is fixed and the worker doesn’t get shocked or hurt by dangerous equipment (for nukes chuckling about the inevitable critique, yeah, but my point is no one was injured). Further, as I have been harping on, all three people in that process are usually pretty darn good at their jobs! That means mistakes are rarely made anyways, but it further drives down the possibility that all three people fail in the exact same way. But that, in turn, increases complacency. One guy, swamped with work, can maybe not do as thorough a job, knowing he’s got two other pretty smart people who will check the work and make sure nothing bad happens. The problem, of course, crops up when the other two people make the same assumption, or maybe one person makes an honest mistake and the other two don’t check it. No matter how good you are, everyone makes a mistake every once in a while.

Please stick around for part two, next week, where I solve the problem.