Skip to content

Latest commit

 

History

History
766 lines (603 loc) · 39.4 KB

Drift-Into-Failure.md

File metadata and controls

766 lines (603 loc) · 39.4 KB

Drift into Failure: From Hunting Broken Components to Understanding Complex Systems

Sidney, Dekker

Terms

complexity, drift, failure, broken part, Newton-Descartes, diversity, systems theory, unruly technology, theory, rational, rational choice theory, decrementalism, relationships, high-reliability organizations, normalization of deviance, complexity, efficiency, uncertainty, resource scarcity, competition, initial conditions, normal, resilience, Newtonian, complex adaptive systems, cybernetics, local work context, local rationality principle, cost pressures, multiple competing goals, adaptation, feedback imbalance, fine tuning

Message of the book: complexity of what society and commerce can give rise to today is not matched by the theories we have that can explain why such things go wrong.

We can't ever really understand why things go wrong.

local rationality principle: people are doing what makes sense given the situational indications, operational pressures, and organizational norms existing at the time.

In systems, people make "good enough" decisions based on their local view of the world. Because of the interconnectedness of systems, these decisions can have unforeseen consequences.

Systems thinking is about relationships, not parts

Five concepts that characterize drift

  1. Scarcity and competition
  2. Decrementalism, or small steps
  3. Sensitive dependence on initial conditions
  4. Unruly technology
  5. Contribution of the protective structure

Systems thinking perspective on the five concepts

  • Resource scarcity and competition, which leads to a chronic need to balance cost pressures with safety. In a complex system, this means that the thousands of smaller and larger decisions and trade-offs that get made throughout the system each day can generate a joint preference without central coordination, and without apparent local consequences: production and efficiency get served in people’s local goal pursuits while safety gets sacrificed – but not visibly so;
  • Decrementalism, where constant organizational and operational adaptation around goal conflicts and uncertainty produces small, stepwise normalization where each next decrement is only a small deviation from the previously accepted norm, and continued operational success is relied upon as a guarantee of future safety;
  • Sensitive dependence on initial conditions. Because of the lack of a central designer or any part that knows the entire complex system, conditions can be changed in one of its corners for a very good reason and without any apparent implications: it’s simply no big deal. This may, however, generate reverberations through the interconnected webs of relationships; it can get amplified or suppressed as it modulates through the system;
  • Unruly technology, which introduces and sustains uncertainties about how and when things may fail. Complexity can be a property of the technology-in-context. Even though parts or sub-systems can be modeled exhaustively in isolation (and therefore remain merely complicated), their operation with each other in a dynamic environment generates the unforeseeabilities and uncertainties of complexity;
  • Contribution of the entire protective structure (the organization itself, but also the regulator, legislation, and other forms of oversight) that is set up and maintained to ensure safety (at least in principle: some regulators would stress that all they do is ensure regulatory compliance). Protective structures themselves can consist of complex webs of players and interactions, and are exposed to an environment that influences it with societal expectations, resource constraints, and goal interactions. This affects how it condones, regulates and helps rationalize or even legalizes definitions of “acceptable” system performance.

Banality of accidents thesis: incidents do not precede accidents. Normal work does.

Features of complex systems

  1. Complex systems can exhibit tendencies to drift into failure because of uncertainty and competition in their environment. Adaptation to these environmental features is driven by a chronic need to balance resource scarcity and cost pressures with safety.
  2. Drift occurs in small steps.
  3. Complex systems are sensitively dependent on initial conditions.
  4. Complex systems that can drift into failure are characgerized by unruly technology.

Four ingredients of high reliability organizations:

  1. Leadership safety objectives
  2. The need for redundancy
  3. Decentralization, culture and continuity.
  4. Organizational learning (incremental learning through trial and error).

Highlights

  • drift occurs in small steps
  • In a complex system, however, doing the same thing twice will not predictably or necessarily lead to the same results. Past success cannot be taken as a guarantee of future success or safety.
  • complex systems are sensitively dependent on initial conditions
  • complex systems that can drift into failure are characterized by unruly technology
  • System thinking is about relationships, not parts
  • Safety certification is about bridging the gap between a piece of gleaming new technology in the hand now, and its adapter, coevolved, grimy, greased-down wear and use further down the line.
  • Local decisions that made sense at the time given the goals, knowledge and mindset of decision-makers, can cumulatively become a set of socially organized circumstances that make the system more likely to produce a harmful outcome.
  • Incidents do not precede accidents. Normal work does. The so-called common-cause hypothesis (which holds that accidents and incidents have common causes and that incidents are qualitatively identical to accidents except for being just one step short of a true or complete failure) is probably wrong for complex systems.
  • Drift, in other words, is not just a decrease of a system’s adaptive capacity. It is as much an emergent property of a system’s adaptive capacity.
  • In high-reliability organizations, active searching and exploration for ways to do things more safely is preferred over passively adapting to regulation or top-down control
  • Direct, unmediated (or objective) knowledge of how a whole complex system works is impossible.
  • Continuous operations and training, non-stop on-the-job education, a regular throughput of new students or other learners, and challenging operational workloads contribute greatly to reduced error rates and enhanced reliability.
  • The frame of reference for understanding people’s ideas and people’s decisions, then, should be their own local work context, the context in which they were embedded, and from whose (limited) point of view assessments and decisions were made.
  • Smaller dangers are courted in order to understand and forestall larger ones.
  • Instead [a belief about the possibility to continue operating safely] should be a belief that is open to intervention as as to keep it curious, open-minded, complexly sensitized, inviting of doubt, and ambivalent toward the past.
  • [Reaching and staying at a high-reliability end-state] involves a preoccupation with failure, a reluctance to simplify, a sensitivity to operations, deference to expertise and a commitment to resilience.
  • It also is an active consideration of all the places and moments where you don't want to fail.
  • High-reliability theory suggests that it is this complexity of possible interpretations of events that allow organizaitons to better anticipate and detect what might go wrong in the future.
  • These decisions are sound when set against local judgment criteria; given the time and budget pressures and short-term incentives that shape behavior. Given the knowledge, goals, and focus of attention by the decision-makers, as well as the nature of the data available to them at the time, it made sense.
  • One of the ingredients in almost all stories of drift is a focus on production and efficiency.
  • Recall Weick's and Perrow's warning: what cannot be believed cannot be seen.
  • It is this insidious delegation, this handover, where the internalization of external pressure takes place.
  • Instead, the processes by which such decisions come about, and by which decision-makers create their local rationality, are one key to understanding how safety can erode on the inside a complex socio-technical system.
  • "normalization of deviance"
  • The solution to risk, if any, is to ensure that the organization continually reflects critically on and challenges its own definition of "normal" operations, and finds ways to prioritize chronic safety concerns over acute production pressures.
  • The idea that organizations are capable of inculcating a safety orientation among its members through recruitment, socializing and indoctrination is met with great skepticism.
  • Strutural secrecy, with participants not knowing about what goes on in other parts of the organization, is a normal by-product of the bureaucratic organization and social nature of complex work.
  • People do what locally makes sense to them, given their goals, knowledge and focus of attention in that settings.
  • As with Vaughan's normalization of deviance, operational success with such adapted procedures is one of the strongest motivators for doing it again, and again.
  • This, instead, can be achieved by making the boundaries of system performance explicit and known, and to help people develop skills at coping with the edges of those boundaries.
  • Also, a reminder to try harder and watch out better, particularly during times of high workload, is a poor substitute for actually developing skills to cope at the boundary.
  • Safety is an emergent property, and its erosion is not about the breakage or lack of quality of single components.
  • Drifiting into failure is not so much about breakdowns or malfunctioning of components, as it is about an organization not adapting effectively to cope with the complexity of its own structure and environment.
  • Organizational resilience is about finding the political, practical and operational means to invest in safety even under pressures of sacrcity and competition, because that may be when such investments are needed most.
  • How can an organization become aware, and remain aware, of its models of risk and danger?
  • Rather than being the result of a few or a number of component failures, accidents involve the unanticipated interaction fo a multitude of events in a complex system - events and interactions, often very normal, whose combinatorial explosion can quickly outwit people's best efforts at predicting and mitigating trouble.
  • The idea behind system accidents is that our ability to intellectualy manage interactively complex systems has now been overtaken by our ability to build them and let them grow.
  • All components can meet their specified design requirements, and still a failure can occur.
  • The answer lies in understanding relationships.
  • We never enumerated certainties, only possibilities. We never pinpointed, only sketched. That is what complexity and system thinking does.
  • the behavior of the system cannot be reduced to the behavior of the constituent components, but only characterized on the basis of the multitude of ever-changing relationships between them.
  • the performance of complex systems is typically optimized at the edge of chaos, just before system behavior will become unrecognizably turbulent.
  • descriptions of complexity have to take history into account.
  • Open systems mean that it can be quite difficult to define the border of a system.
  • Complex systems operate under conditions far from equilibrium.
  • in complex systems, history matters. Complex systems themselves have a history.
  • Only constant training and indoctrination, as well as positive reinforcement, can help build confidence and acceptance for the use of unmitigated language.
  • Decrementalism, where organizational and operational adaptation around goal conflicts and uncertainty produces small, stepwise normalization where each next decrement is only a small deviation from the previously accepted norm, and continued operational success is relied upon as a guarantee of future safety
  • Resource scarcity and goal oppositions from one such pervasive influence.
  • In fact, what is remarkable about this accident is that everybody was pretty much following the rules.
  • Behavior that is locally rational, that responds to local conditions and makes sense given the various rules that govern it locally, can add up to profoundly irrational behavior at the system level.
  • Just like a lot of other technical problems - NASA engineers were, and always had been, working in an environment where technical problems proliferated. Flying with flaws was the norm.
  • ... safety problem into a maintenance problem. But what we really need to understand is how these conversions of language made sense to decision-makers at the time.
  • the important question to ask ourselves is how organizations can be made aware early on that such shifts in language can have far-reaching consequences, even if those are hard to foresee
  • Rather, we should be weary of renaming things that negotiate their perceived risk down from what it was before.
  • Drift into failure, in these terms, is about optimizing the system until it is perched on that edge of chaos. There, in that critical state, big, devastating responses to small pertubations become possible.
  • Remember the basic message of this book. The growth of complexity in society has got ahead of our understanding of how complex systems work and fail. Our technologies have got ahead of our theories. Our theoreis are still fundamentally reductionist, compoenential and lienar. Our technologies, however, are increaasingly complex, emergent and non-linear.
  • Drift into failure involves the interation between diverse, interacting and adaptive entities whose micro-level behaviors produce macro-level patterns, to which they in turn adapt, creating new patterns.
  • Investigations of past failures thus do not contain much predictive value for a complex system.
  • In a complex system, an action controls almost nothing. But it influences almost everything.
  • The commitment that is called for here is to see safety-critical organizations as complex adaptive systems.
  • It is important to realize that, in complex systems, the effects of local decisions seldom stay local.
  • The five features of drift - scarcity and competition, small steps, sensitive dependency on initial conditions, unruly technology, and a contributing regulator.
  • recommendations made by high-reliability theory include a validation fo minority opinion and an encouragement of dissent
  • A local optimization may become a global disaster
  • Outsiders might think about this very differently, and they may represent a resource that managers could capitalize
  • But for a regulator it means learning an entirely new repetoire of languages and countermeasures: from componential, determinist, compliance to co/counter-evolving complexity
  • in times of crisis, these people's responsibilities may well suddenly be affected by a part failure way outside their functional area.
  • a Newtonian narrative of failure achieves its end only by erasing its true subject: human agency and the way it is configured in a hugely complex network of relationships and interdependencies.
  • Complexity and systems thinking denies us the comfort of one objectively accessible reality that can, as long as we have accurate methods, arbitrate between what is true and what is false. Or between what is right and what is wrong.
  • accuracy cannot be achieved in complex systems tsince they defy exhaustive description
  • Authentic stories
  • A story of a complex system is authentic if it succeeds in communicating some of the vitality of everyday life in that system
  • In a complex system, there is no objective way to determine whose view is right and whose view is wrong, since the agents effectively live in different environments.
  • Diversity of narrative can be seen as an enormous source of resilience in complex systems.
  • In a complex system, we should gather as much information on the issue as possible
  • In a post-Newtonian ethic, there is no longer an obvious relationship between the behavior of parts in the system (or their malfunctioning ,for example, "human errors") and system-level outcomes. Instead, system-level behaviors emerge from the multitude of relationships, interdependencies and interconnections inside the system, but cannot be reduced to those relationships or interconnections. In a post-Newtonian ethic, we resist looking for the "causes" of failure or success. System-level outcomes have no clearly traceable causes as their relationships to effects are neither simple nor linear.
  • There is no objective way of constructing a story of what happened
  • We can at best hope and aim to produce authentic stories, stories that are true to experience, true to the phenomenology of being there, of being suspend inside complexity.
  • Decisions set a precedent, particularly if they generate successful outcomes with less resource expenditure.
  • People come to expect and accept deviance from what may have been a previous norm. They have come to believe that things will “probably be fine” and that it will do “a good job”

Outline

  1. Failure is always an option
    • Who messed up here?
    • Rational choice theory
    • Technology has developed more quickly than theory
    • The Gaussian copula
    • Complexity, locality and rationality
    • Complexity and drift into failure
    • A great title, a lousy metaphor
    • Why we must not turn drift into the next folk model
  2. Features of drift
    • The broken part
    • Unanswered questions
    • The outlines of drift
    • Scarcity and competition
    • Decrementalism, or small steps
    • Sensitive dependency on initial conditions
    • Unruly technology
    • Contribution of the protective structure
    • A story of drift
  3. The legacy of Newton and Descartes
    • Why did Newton and Descrates have such an impact?
    • So why should we care?
    • What Newton can and cannot do
    • Columbia: parts that broke in sequence
    • Broken part, broken system?
    • Accidents come from relationships, not parts
    • We have Newton on a retainer
  4. The search for the broken component
    • Broken components after a hailstorm
    • Broken components to explain a broken system
    • Newton and the simplicity of failure
    • Reductionism and the Eureka part
    • Causes for effects an be found
    • The foreseeability of harm
    • Time-reversibility
    • Completeness of knowledge
    • A Newtonian ethic of failure
  5. Theorizing drift
    • Man-made disasters
    • The incubation and surprise of failure
    • Risk as energy to be contained: barrier analysis
    • High reliability organizations
    • Challenging the belief in continued safe operations
    • Goal interactions and production pressure
    • Normalizing deviance, structural secrecy and practical drift
    • The normalization of deviance
    • How to prevent the normalization of deviance
    • Structural secrecy and practical drift
    • Managing the information environment
    • Control theory and drift
    • Resilience engineering
  6. What is complexity and systems thinking?
    • More redundancy and barriers, more complexity
    • Up and out, not down and in
    • Systems thinking
    • Complex systems theory
    • The butterfly effect
    • Adaptation and history
    • Complex versus complicated
    • Complexity and drift
    • What is emergence?
    • Organized complexity
    • Accidents as emergent properties
    • Phase shifts and tipping points
    • Optimized at the edge of chaos
  7. Manging the complexity of drift
    • Complexity, control and influence
    • Diversity as a safety value
    • Turning the five features of drift into levers of influence
    • Resource scarcity and competition
    • Sensitive dependency and small steps
    • Unruly technology
    • Contribution of the protective structure
    • Drifting into success
    • From a paperclip to a house
    • Complexity, drift, and accountability
    • Enron's drift into failure
    • A post-Newtonian ethic for failure in complex systems

Ch. 1: Failure is always an option

Who messed up here?

Rational choice theory

Rational choice theory: assumes that agents take actions that optimize their utilities conditional on information and actions of others.

Under this theory, when accident happens, it's because somebody knew the risks and made an amoral calculation of consciously putting money before safety.

("BP should have known all of this...").

This worldview isn't wrong, but it excludes other readings and other results. It leaves us less diverse, less able to respond in novel or more useful ways.

The growth of complexity in society has got ahead of our understanding of how complex systems work and fail.

When an accident happens, we think we can figure out what went wrong, what the broken part is. But we're applying Newtonian folk-science, we really can't.

Technology has developed more quickly than theory

We can model and understand things in isolation. But not when they are part of a system that interacts with many other components embedded within a competitive, nominally regulated society.

We can't rely on theories that only work for simpler systems. Instead, must turn to: complexity theory (systems theory, theory of complex adaptive systems).

The Gaussian copula

Example: Gaussian copula and financial crisis

Complexity, local and rationality

It's impossible to objectively know how an entire complex system works. Because of this, to understand why people involved in a system behave the way that they do, you need to understand their view of the world at the time the action took place. Use their frame of reference, understand the local work context.

local rationality principle: people are doing what makes sense gievn the situational indications, operational pressures, and organizational norms at the time. Contrast with: rational choice theory.

Reasoning and decision-making is based on people's local understanding, attention, goals, and knowledge, not on global.

What matters for people is that the decision (mostly) works in their situation, that the decisions they make are achieving their goals as far as they understand.

Complexity and drift into failure

Decision makers are locally rational, but local actions can have global effects.

Drift into failure: local decisions that make sense in the context in which they were made can accumulate in such a way that they make accidents more likely.

Harmful outcome becomes a routine by-product of the characteristics of the complex system itself.

Decisions that local actors make are reasonable at the time that they occur given the context. But these decisions end up (unintentionally) constraining and shaping how other people make decisions within the system. These earlier decisions are end up becoming standard operating procedure, the status quo, the way that things are gone. They provide the context for later decisions that get made by other actors in other parts of the system.

Decisions make good local sense given the limited knowledge available to people in that part of the complex system. But invisible and unacknowledged reverberations of those decisions penetrate the complex system. They become categories of structure, thought and action used by others (for example, a Gaussian copula becomes the risk assessment tool), and they limit or amplify the knowledge available to others (for example, a single risk number). This helps point people’s attention in some directions rather than others, it helps direct and constrain what other people in the complex system will see as sensible, rational or even possible. Wherever we turn in a complex system, there is limited access to information and solutions. Embedded as they are in a complex system, then, individual choices can be rendered unviable, precluding all but a few courses of action and constraining the consideration of other options. They can even become the means by which people discover their or their organization’s very preferences.

Drift occurs when agents within a system are all making local decisions that have a cumulative effect, that collectively push the system in a given direction. Local decisions lead to a systemwide response because they are made in the context of earlier decisions, and because of global properties (uncertainty, competition) that nudge decisions in a certain directions.

Adaptive responses to local knowledge and information throughout the complex system can become an adaptive, cumulative response of the entire system – a set of responses that can be seen as a slow but steady drift into failure.

Features of complex systems

  1. Complex systems can exhibit tendencies to drift into failure because of uncertainty and competition in their environment. Adaptation to these environmental features is driven by a chronic need to balance resource scarcity and cost pressures with safety.
  2. Drift occurs in small steps.
  3. Complex systems are sensitively dependent on initial conditions.
  4. Complex systems that can drift into failure are characgerized by unruly technology.

Uncertainty and competition

Agents within a system must deal with uncertainty and competition. When making decisions, agents must satisfy multiple goals: safety, resource scarcity, cost pressures. They therefore must make tradoeffs, and these tradeoffs occur in the many different decisions that happen every day by many agents with a system.

Drift in small steps

In the face of resource scarcity, agents will tend to make small decisions that favor saving resources over safety. Dekker calls these "adaptations". They'll save time by being a little less vigilant. And, lo and behold, nothing bad happened, and so skipping that little extra safety check becomes the norm. They become precedents.

People come to expect and accept deviance from what may have been a previous norm. They have come to believe that things will “probably be fine” and that it will do “a good job”.

In a complex system, just becomes something worked before, doesn't mean it will work again.

Sensitive dependence on initial conditions

A small decision early on in a complex system can have far-reaching consequences in the future.

the potential for drift into failure can be baked into a very small decision or event.

BP example: classifying drill site as onshore instead of offshore. This had consequences for regulatory and environmental approval proces.

Unruly technology

Unruly technologies create uncertainty. We can't accurately model/predict the behavior of these kinds of technologies.

Examples:

  • Gaussian copula
  • Deepwater drilling rigs

Protective structures interact with the system

[C]omplex systems, because of the constant transaction with their environment (which is essential for their adaptation and survival) draw in the protective structure that is supposed to prevent them from failing.

The protective structures (even those inside an organization itself) that are set up and maintained to ensure safety, are subject to its interactions and interdependencies with the operation it is supposed to control and protect.

Regulators that are supposed to ensure that the system functions safely are subject to the same kind of local decision making as the agents that act within the system.

Because the protective structure interacts with the system, it is effectively part of the system, which means that it is vulnerable to the same issues that cause the system to drift into failure in the first place.

Examples:

  • Rating agencies during the financial crisis
  • Aviation regulators (described in next chapter)
  • Internal risk management department of Enron
  • Minerals Management Service and "alternative compliance" process for deepwater drilling

Protective structure can actively contribute to the complexity of the system, as it is often made of multiple agents that have conflicting goals.

A great title, a lousy metaphor

"Drift" evokes imagery of reduced adaptation, but that's misleading, it happens because of constant adaptation that appears to be successful.

Drift, in other words, is not just a decrease of a system’s adaptive capacity. It is as much an emergent property of a system’s adaptive capacity.

  • Example: Congress diverting funds from ISS led to a changing environment for Space Shuttle Operations, which had to adapt.

Failure in complex systems is not just a matter of a few bad decisions or bad people that push a system over the edge. It takes more, and longer.

Complex systems are successful because of their complexity, not in spite of it. This web of relationships is what enables the system to be creative and to adapt to challenges in the environment.

Don't think of drift in linear terms, in the sense of a regressive sequence of events that occurred in the past. Instead, think of drift occurring because of constant adaptation to what is happening now. Drift is a consequence of adaptation.

Why we must not turn drift into the next folk model

Don't take "drift" too seriously as a metaphor. Adaptation doesn't always go in one direction.

Observing that drift happened is hindsight, retrospective.

  • outside perspectives
  • new perspectives
  • diversity of perspectives
  • taking the long view

Don't conclude that society is regressing because of the "drift" metaphor.

Encourages post-structural view. Don't assume that drift is either in the story or not. There is no fact of the matter, no inherent truth in the story. There are no final conclusions, we simply can't know what ever really happened in an accident.

Read drift into the story instead of out of it.

By reading drift into a particular failure, we will probably learn some interesting things. But we will surely miss or misconstrue other things.

By not reading drift into the story, we might also miss/misconstrue things.

It's an explicit choice on our part whether we read drift into a story or not. Depends on what we bring to the story, how much we are willing to dig.

Ch. 2: Features of drift

Example of Alaska Airlines flight 261 in 2000 where the pilots lost control and it crashed into the ocean.

The broken part

the jackscrew-nut assembly that holds the horizontal stabilizer had failed, rendering the aircraft uncontrollable.

describes the role of the horizontal stabilizer

jackscrew and nut assembly need adequate lubrication, otherwise the constant grinding wears out the thread on either the nut or the screw

On the surface, the accident seemed to fit a simple category: mechanical failure as a result of poor maintenance. A single component failed because people did not maintain it well. It had not been lubricated sufficiently. This led to the catastrophic failure of a single component. The break instantly rendered the aircraft uncontrollable and sent it plummeting into the Pacific.

Because the system have safeguards against individual failures, other things had to have gone wrong.

  • no suggstion in any checklist that flight crew should divert to nearest possible airport when experiencing horizontal stabilizer trouble
  • crew did nto have guidance on how to fly an airplane with a jammed stabilizer, using the autopilot was "inappropriate". Without guidance, improvising could make the probelm worse.
  • access panel in the tail was too small to adequately perform lubrication task
  • widespread deficiencies in Alaska Airlines' maintenance program
    • lack of adequate technical data to demonstrate extensions of the lubrication interval would not present a hizard
    • lack of task-by-task engineering analysis and justification in the process by which manufactureres revise recommended maintenance task intervals and by which airlines establish and revise these intervals
    • process for measuring how much slack there was in the screw/nut assembly did not meet aircraft manufacturer specifications
    • on-wing end-play check procedure was never validated and was known to have low reliability
  • shortcomings in regularity oversight by FAA
  • aircraft design that did not account for the loss of the acme nut threads as a catastrophic single-point failure mode

Because of our Western thinking, we are inclined to see the failure in terms of a broken component (the jackscrew) that can be explained in terms of other broken components (list above)

Unanswered questions

Explaining failure in terms of failures of other components leaves important questions unanswered.

Without answering those questions, we miss an opportunity to further our understanding of safety in complex systems.

Questions:

  • Why are there so many parts that appear broken in retrospect but did not at the end?
  • How did a maintenance program suddenly become deficient?
  • Why did none of the deficiences strike anybody as deficiencies at the time?
  • If somebody did not them, why wasn't that voice persuasive?

Our traditional (Newtonian-Cartesian) metaphors show us how things ended up, but not how they got there.

Field is starting to move from looking only at the sharp-end to blunt-end (organizational, administrative, regulatory) as well.

Systems thinking is about relationships, not parts

System thinking is about the complexity of the whole, not the simplicity of carved-out bits. Systems thinking is about non-linearity and dynamics, not about linear cause-effect-cause sequences. Systems thinking is about accidents that are more than the sum of the broken parts. It is about understanding how accidents can happen when no parts are broken, or no parts are seen as broken.

In the above example:

  • there were changes in regulations and oversight regimes (as always)
  • underspecified technical procedures and continuous quality developments in aircraft upkeep (procedures are always underspecified, changes in how to conduct maintenance are always ongoing)
  • this all interacted with what one company and its regular saw as sensible & safe

A story of drift is just a story. It's not more true than any other account. It's a perspective that is more open to certain properties of socio-technical systems (complexity, dynamics, erosion, adaptation).

Outlines of drift

In hindsight, trajectory looks certain, inevitable. That's not the case for the people who are involved at the time.

Drift may be very difficult to recognize without hindsight.

When there are signs of trouble, we tend to focus on fixing problems with components. We don't think in terms of drift.

Five concepts that characterize drift:

  • scarcity and competition
  • Decrementalism, or small steps
  • Sensitive dependence on initial conditions
  • Unruly technology
  • Contribution of the protective structure

Scarcity and competition

Complex systems do not function in isolation. They constantly interact with their external environment.

An environment applies constraints on a system:

  • amount of capital available
  • number of customres reachable
  • qualifications on available empoyees
  • how fast things can be built, developed, driven
  • the existence of competing organizations

Many industries are very competitve. Airlines and deregulation in the 70s as an example.

Jens Rasmussen: work in complex systems is bounded by three types of constraints (boundaries):

  1. economic: system can't sustain itself if boundary exceeded
  2. workload: people/tech can't perform their tasks if boundary exceeded
  3. safety boundary: system fails if boundary exceeded

Example of a noncommercial org experiencing these constraints: Air Transportation Oversight System (ATOS) experiencing time pressure to complete inspections

Resource scarcity/pressure is simply a fact of life.

Consequences of resource pressure:

  • people fight over resources
  • we make tradeoffs
  • management chooses to invest or prioritize in certain areas over others

"The pressure expresses itself... in almost all engineering and operational trade-offs between strength and cost, between efficiency and diligence."

"working successfully under pressures and resource constraints is a source of professional pride."

The main driver of drift is borne of this conflict between maximizing safety and just being able to do the thing at all.

Within this conflict, we deviate more and more from safer but less efficient practices.

As we use a system more and more, we learn how the system behaves, and we adapt our own behaviors accordingly. We make trade-offs that favor efficiency that, in hindsight, appear as poor judgment. However, from our perspective, these trade-offs were perfectly reasonable.

Feedback imbalance: It's easy to quantify the increase in efficiency when making trade-offs, but it's not easy to quantify the corresponding reduction in safety. It may appear as if there's no reduction in safety at all. Example: jackscrew assembly where there were no previously reported fatigue problems.

Because of this feedback imbalance, there will be a tendency towards making trade-offs towards efficiency in an environment where resources are scarce and there are goals to be met. From the outside, this looks like the folks at the sharp-end are pushing experiments outside of the system envelope. But from the inside, this is just adaptation in the face of constraints.

These adaptations, which start out as deviations from normal behavior, then become the normal behavior. This is how people within the system are able to achieve their goals ("maximizing capacity utilization but doing so safely; meeting technical or clinical requirements, but also deadlines").

Those inside of the system may believe that these trade-offs they have made haven't sacrificed safety because nothing bad has happened so far. However, they may simply have been lucky rather than operating safely. We never know when our luck will run out.

This moved Langewiesche to say that Murphy’s law is wrong: everything that can go wrong usually goes right, and then we draw the wrong conclusion.

Decrementalism, or small steps

Example: jackscrew assembly

Initial lubrication schedule was every 300 to 350 flight hours, airplane grounded every few weeks for maintenance. Drift happened: lubrication interval was eventually extended to 2550 hours by a series of small steps.

Sensitive dependency on initial conditions

[here]

Safety certification is about bridging the gap between a piece of gleaming new technology in the hand now, and its adapted, coevolved, grimy, greased-down wear and use further down the line.

A Story of drift

This is the banality of accidents thesis. These are not incidents. Incidents do not precede accidents. Normal work does. In these systems:

 accidents are different in nature from those occurring in safe systems: in this case accidents usually occur in the absence of any serious breakdown or even of any serious error. They result from a combination of factors, none of which can alone cause an accident, or even a serious incident; therefore these combinations remain difficult to detect and to recover using traditional safety analysis logic. For the same reason, reporting becomes less relevant in predicting major disasters.

...

Managing the complexity of drift

Resource scarcity and competition

Don't try to fully optimize a system. Lack of slack leaves you vulnerable to small pertubations that can cause large events. If you are highly optmized, that's the time to ask about if there's any margin left.

Diversity of opinion can be one way to make the system stop and think when it thinks it can least afford to do so.