WHY FAULT TREE DIAGRAM IS NECESSARY
The Fault Tree Approach
FTA can be simply described as an analytical technique, whereby an undesired state of the system is specified (usually a state that is critical from a safety or reliability standpoint), and the system is then analyzed in the context of its environment and operation to find all realistic ways in which the undesired event (top event) can occur. The fault tree itself is a graphic model of the various parallel and sequential combinations of faults that will result in the occurrence of the predefined undesired event. The faults can be events that are associated with component hardware failures, human errors, software errors, or any other pertinent events which can lead to the undesired event. A fault tree thus depicts the logical interrelationships of basic events that lead to the undesired event, the top event of the fault tree.
It is important to understand that a fault tree is not a model of all possible system failures or all possible causes for system failure. A fault tree is tailored to its top event that corresponds to some particular system failure mode, and the fault tree thus includes only those faults that contribute to this top event. Moreover, these faults are not exhaustive—they cover only the faults that are assessed to be realistic by the analyst.
It is also important to point out that a fault tree is not in itself a quantitative model. It is a qualitative model that can be evaluated quantitatively and often is. This qualitative aspect, of course, is true of virtually all varieties of system models. The fact that a fault tree is a particularly convenient model to quantify does not change the qualitative nature of the model itself.
Intrinsic to a fault tree is the concept that an outcome is a binary event i.e., to either success or failure. A fault tree is composed of a complex of entities known as “gates” that serve to permit or inhibit the passage of fault logic up the tree. The gates show the relationships of events needed for the occurrence of a “higher” event. The “higher” event is the output of the gate; the “lower” events are the “inputs” to the gate. The gate symbol denotes the type of relationship of the input events required for the output event.
Qualitative and Quantitative Evaluations of a Fault Tree
Both qualitative and quantitative evaluations can be performed on an FT. The FT itself is a qualitative assessment of the events and relationships that lead to the top event. In constructing the FT, significant insights and understanding are gained concerning the causes of the top event. Additional evaluations serve to further refine the information that the FT provides. The qualitative evaluations basically transform the FT logic into logically equivalent forms that provide more focused information. The principal qualitative results that are obtained are the minimal cut sets (MCSs) of the top event. A cut set is a combination of basic events that can cause the top event. An MCS is the smallest combination of basic events that result in the top event. The basic events are the bottom events of the fault tree. Hence, the minimal cut sets relate the top event directly to the basic event causes. The set of MCSs for the top event represent all the ways that the basic events can cause the top event. A more descriptive name for a minimal cut set may be “minimal failure set.” The set of MCSs can not only be obtained for the top event, but for any of the intermediate events (e.g., gate events) in the FT.
A significant amount of information can be obtained from the structure of MCSs. Any MCS with one basic event identifies a single failure or single event that alone can cause the top event to occur. These single failures are often weak links and are the focus of upgrade and prevention actions. Examples of such single failures are a single human error or single component failure that can cause a system failure. An MCS having events with identical characteristics indicates a susceptibility to implicit dependent failure, or common cause that can negate a redundancy. An example is an MCS of failures of identical valves. A single manufacturing defect or single environmental sensitivity can cause all the valves to simultaneously fail.
The quantitative evaluations of a FT consist of the determination of top event probabilities and basic event importance. Uncertainties in any quantified result can also be determined. Fault trees are typically quantified by calculating the probability of each minimal cut set and by summing all the cut set probabilities. The cut sets are then sorted by probability. The cut sets that contribute significantly to the top event probability are called the dominant cut sets. While the probability of the top event is a primary focus in the analysis, the probability of any intermediate event in the fault tree can also be determined. Different types of probabilities can be calculated for different applications. In addition to a constant probability value that is typically calculated, time-related probabilities can be calculated providing the probability distribution of the time of first occurrence of the top event. Top event frequencies, failure or occurrence rates, and availabilities can also be calculated. These characteristics are particularly applicable if the top event is a system failure.
In addition to the identification of dominant cut sets, importance of the events in the FT are some of the most useful information that can be obtained from FT quantification. Quantified importance allow actions and resources to be prioritized according to the importance of the events causing the top event. The importance of the basic events, the intermediate events, and the minimal cut sets can be determined. Different importance measures can be calculated for different applications. One measure is the contribution of each event to the top event probability. Another is the decrease in the top event probability if the event were prevented from occurring. A third measure is the increase in the top event probability if the event were assured to occur. These importance measures are used in prioritization, prevention activities, upgrade activities, and in maintenance and repair activities. Later sections describe in further detail the rich amount of qualitative and quantitative information that can be obtained from a FT.
The Success Tree as a Logical Complement of the Fault Tree
Since success and failure are related, the FT can be transformed into its equivalent ST. In the FT context, success in a success tree is specifically defined as the top event not occurring. The method by which the ST can be obtained from the FT will be described in a later section. The ST is a logical complement of the FT, with the top event of the ST being the complement of the top event of the FT. For example, if the top event of the FT is “Occurrence of LOV,” LOV implying Loss of Vehicle, then the ST will have as a top event “Nonoccurrence of LOV.” The ST therefore defines the logic for the failure top event not occurring. Moreover, the ST identifies the minimal sets of basic events that need to be prevented in order to assure that the failure top event will not occur. These minimal sets of events that prevent the failure top event are termed the minimal path sets. A more descriptive name may be “minimal prevention sets” since they indicate how to prevent the occurrence of the failure top event and achieve success in terms of its Non occurrence. The minimal path sets provide valuable information on the means by which the failure top event can be prevented even without quantification. Moreover, the ST can be quantified to provide the probability of success, i.e., nonoccurrence of the failure top event. Additionally, each of the minimal path sets can be quantified to prioritize the most effective methods for prevention (often in terms of cost to ensure prevention). Ability to analyze the top event from both a failure (occurrence) and success (nonoccurrence) standpoint increases the scope of information that can be obtained from these logic trees.
Role of FTA in Decision Making
A variety of information is provided by FTA to assist decision-making. An overview some of the major uses of FTA is presented here to give the reader an appreciation of the breadth of applications of FTA in decision-making. Note that this section includes some information already provided in previous sections for the benefit of readers who want to focus on the FTA role in decision making.
1. Use of FTA to understand of the logic leading to the top event. FTA provides a visual, logic model of the basic causes and intermediate events leading to the top event. Typically, fault trees are not limited to a single system, but cross system boundaries. Because of this, they have shown great benefit in identifying system interactions that impact redundancy. The combination of failures and events that propagate through a system are clearly shown. The minimal cut sets can be organized and prioritized according to the number of events involved and their nature. For example, if there are minimal cut sets that contain only one component failure thenthis shows that single component failures can cause failure of the system. A failure path of only human errors shows that human errors alone can cause system failure. After reading this handbook, the reader should be convinced that the qualitative information obtained from an FTA is of equal importance to the quantitative information provided.
2. Use of FTA to prioritize the contributors leading to the top event. One of the most important types of information from FTA is the prioritization of the contributors to the top event. If a FT is quantified, the failures and basic events that are the causes of the top events can be prioritized according to their importance. In addition, the intermediate faults and events leading to the top event can also be prioritized. Different prioritizations and different importance measures are produced for different applications. One of the valuable conclusions from FTAs is that generally only a few contributors are important to the top event. Often only 10% to 20% of the basic events contribute significantly to the top event probability. Moreover, the contributors often cluster in distinct groupings whose importances differ by orders of magnitude. The prioritizations obtained from FTA can provide an important basis for prioritizing resources and costs. Significant reductions in resource expenditures can be achieved with no impact to the system failure probability. For a given resource expenditure, the system failure probability can be minimized by allocating resources to be consistent with contributor importances. The importance measures obtained from a FTA are as important as the top event probability or the ranked cut set lists obtained from the analysis.
3. Use of FTA as a proactive tool to prevent the top event. FTA is often used to identify vulnerable areas in a system. These vulnerable areas can be corrected or improved before the top event occurs. Upgrades to the system can be objectively evaluated for their benefits in reducing the probability of the top event. The evaluation of upgrades is an important use of the FTA. Advocates of different corrective measures and upgrades will often claim that what they are proposing provides the most benefit and they may be correct from their local perspective. However, FTA is a unique tool that provides a global perspective through a systematic and objective measure of the impact of a benefit on the top event. The probability of the top event can be used to determine the criticality of carrying out the upgrades. The probability of the top event can be compared to acceptability criteria or can be used in cost benefit evaluations. Advances in cost benefit methodology allow uncertainties and risk aversion to be incorporated as well as the probabilities. Furthermore, success paths provided from FTA can be used to identify specific measures that will prevent the top event. The proactive use of FTA has been shown to be one of its most beneficial uses.
4. Use of FTA to monitor the performance of the system. The use of the FT as a monitoring tool is a specific proactive use that has been identified because of its special features. When monitoring performance with regard to the top event, FTA can account for updates in the basic event data as well as for trending and time dependent behaviors, including aging effects. Using systematic updating techniques, the fault tree can be re-evaluated with new information that can include information on defects and near failures. Actions can then be identified to maintain or replace necessary equipment to control the failure probability and risk. This use of FTA as a monitoring tool is common in the nuclear industry.
5. Use of FTA to minimize and optimize resources. This particular use of FTA is sometimes overlooked but it is one of the most important uses. Through its various importance measures, a FTA identifies not only what is important but also what is unimportant. For those contributors that are unimportant and have negligible impact on the top event, resources can be relaxed with negligible impact on the top event probability. In fact, using formal allocation approaches, resources can be re-allocated to result in the same system failure probability while reducing overall resource expenditures by significant amounts. In various applications, FTA has been used to reduce resource burdens by as much as 40% without impacting the occurrence probability of the top event. Software has been developed to help carry out these resource re-allocations for large systems.
6. Use of FTA to assist in designing a system. When designing a system, FTA can be used to evaluate design alternatives and to establish performance-based design requirements. In using FTA to establish design requirements, performance requirements are defined and the FTA is used to determine the design alternatives that satisfy the performance requirements. Even though system specific data are not available, generic or heritage data can be used to bracket performance. This use of FTA is often overlooked, but is important enough to be discussed further in a subsequent section.
7. Use of FTA as a diagnostic tool to identify and correct causes of the top event. This use of FTA as a diagnostic tool is different from the proactive and preventative uses described above. FTA can be used as a diagnostic tool when the top event or an intermediate event in the fault tree has occurred. When not obvious, the likely cause or causes of the top event can be determined more efficiently using the FTA power to prioritize contributors. The chain of events leading to the top event is identified in the fault tree, providing valuable information on what may have failed and the areas in which improved mitigation could be incorporated. When alternative corrective measures are identified, FTA can be used to objectively evaluate their impacts on the top event re-occurrence. FTA can also be an important aid to contingency analysis by identifying the most effective actions to be taken to reduce the impact of a fault or failure. In this case, components are set to a failed condition in the fault tree and actions are identified to minimize the impact of the failures. This contingency analysis application is often used to identify how to reconfigure a system to minimize the impact of the component failures. Allowed downtimes and repair times can also be determined to control the risk incurred from a component failure. As can be seen from the above, FTA has a wide variety of uses and roles it can play in decision-making. FTA can be used throughout the life cycle of the system from design through system implementation and improvement. As the system proceeds to the end of life, its performance can be monitored to identify trends before failure occurs. When consciously used to assist decision-making, the payoffs from FTA generally far outweigh the resources expended performing the analysis.
Role of Fault Trees in a PRA
A Probabilistic Risk Assessment, or PRA, models sequences of events that need to occur in order for undesired end states to occur. A sequence of events (event sequence) is usually called an accident sequence. An example of an accident sequence is a fire that leads to catastrophic consequences because mitigation systems fail to operate. A model of a simple event sequence in a PRA is shown below.
Notice that in the above event sequence model, success of a system as well as failure of another system appears. Which particular systems fail and which succeed determine the type of end state and its associated consequences. To quantify the accident sequence, a probability for each event in the event sequence, other than the end state, needs to be determined. The probability of each event is conditional on the previous events in the sequence (e.g., the probability of system A failing is the probability of A failing given the initiating event occurs, the probability of system B succeeding is the probability of B succeeding given A fails and the initiating event occurs). If an event is independent of others in the sequence and failure data exist, the probability can be directly estimated from the data. For more complex events in the sequence, that do not have directly applicable data or that may have dependencies on other events in the sequence, such as for a system failure, a fault tree is usually constructed. The fault tree is developed to a level that encompasses the dependencies between systems or to a level where failure data exist for the basic events, whichever is lower (more detailed). The fault tree is then evaluated to determine the probability of the system failure.
Each event sequence is a logical intersection (an AND gate) of the initiating event and the subsequent events other than the end state. Available PRA software automatically carries out the operations involving this intersection using all the fault trees that are input to an event sequence. Depending upon the level of resolution, a complex PRA such as for the Space Shuttle can have tens of thousands of accident sequences involving hundreds of different fault trees. In a large analysis, the fault trees (AND gates) of each sequence are combined into a single OR gate to generate accident sequence cut sets for the entire PRA in a single analysis run. When several different end states are defined the fault trees for each individual end state are combined. Fault trees are generally the work horses of a PRA, providing causes and probabilities for all the system failures involved, as well as a framework for quantification of the sequences.
Software for Fault Tree Analysis
A number of software applications exist for FTA and new applications are continually being developed. Some applications provide the capability to draw and quantify FT models, while others provide an integrated set of PRA tools that include the capability to draw and solve FTs. It is not the purpose of this document to serve as reference for FT-related software; it is not possible to describe all FT software that is currently available and it is clearly not possible to describe software that may be available in the future.
FTA in Aerospace PRA Applications
The following sections discuss FTA topics that are relevant to aerospace applications extracted from the NASA PRA Procedures Guide, shows the role of the FT modeling in a typical PRA. The block labeled “Logic Modeling” corresponds to event tree and fault tree modeling of accident sequences (accident scenarios). Because FTs are workhorses of a PRA, and can be used as stand-alone models, this handbook focuses on fault trees modeling techniques.
Over the past two decades, SPA inc. has conducted research of causal factors in hundreds of accidents utilizing human factors analysis, preliminary hazard analysis, health/environmental hazard analysis, and product safety analysis to determine most probable cause(s) for Forensic Technologies International, Cadcom Inc. (a division of ManTech International), Failure Analysis Associates Inc., the insurance community, and private counsel. In the analysis of accidents, the Fault Tree Analysis approach is utilized.
Fault Tree Analysis
SPA inc. typically utilizes a Fault Tree Analysis to determine causal relationships between the accident event and contributing factors, if any. A fault tree analysis (FTA) is a deductive, top-down method of analyzing system design and performance. It involves specifying a top event to analyze (such as a FALL), followed by identifying all of the associated elements in the system that could cause that top event to occur. Fault trees provide a convenient symbolic representation of the combination of events resulting in the occurrence of the top event. Events and gates in fault tree analysis are represented by symbols. The entire system, as well as human interactions, would be analyzed when performing a fault tree analysis. While building the fault tree and performing the analysis, a review of information regarding codes, standards, and research documents associated with processes, procedures, and equipment, along with the human interface and factual evidence relevant to the accident event, is required.
The following is an example of basic fault tree logic:
Fault Tree
Separating Qualitative and Quantitative Considerations in FTA as Exemplified in a Phased Mission Analysis
For certain aerospace applications, the goal is to model the different phases of a mission. An example is the Space Shuttle, which can be modeled as having three phases in its mission— Ascent, Orbit and Entry. If a system goes through different phases in a mission then the failure of the system in each phase should be modeled. The success criteria for the system as well as the system configuration and system boundary may change from phase to phase. For each phase a fault tree can be constructed for system failure accounting for the success criteria, configuration, and system boundary for that phase. In constructing the fault tree for the given phase other Focus of this handbook phases can be ignored. However, when the fault trees are quantified for the different phases then the interactions among the different phases need to be taken into account. For example, if there is an event in the fault tree that describes the component being failed in a given phase, then the component may be failed due to its failing in the given phase or may be failed due to its failing in a previous phase. When the probability of the component being failed in a phase is quantified then the probabilities of the component failing in the past phases need to be evaluated as well as in the present phase. In this quantification, different component failure rates in the different phases may be used as well as different repair criteria. A number of computer codes handle these inter-phase evaluations to simplify the analyst's tasks. This separation of the qualitative and quantitative considerations in phased missions is an example of the general separation of the qualitative and quantitative considerations in FTA in general.
9.2 Fault Trees for System Design
A fault tree can also be constructed for a system that is being designed as well as for a system that is implemented and operating. Even though the general principles used in constructing these
two different types of fault trees are the same, there are differences in the strategies used, in the scope of the fault trees, and the level of resolution of the fault trees. The basic principles applicable to the construction of fault trees for design are discussed here. Fault trees constructed for already operating systems are discussed in the next section.
In constructing a fault tree for a system that is being designed, the detailed specifications for the system or the detailed schematic for the system are not generally available. Often only the top level logic for the system is available, this consisting of its basic functions and general interfaces.
Even with this limited information, a fault tree can be an important tool in assisting in the design of the spystem. Furthermore, a fault tree can serve as a primary tool for providing a performancebased design for the system. For evaluating a system design, the fault tree to be developed is a top-level fault tree showing the general logic and relationships for the design. To quantify the fault tree, where specific data are not available, generic data from published data sources are used. When using generic databases, data on heritage systems, suitably modified to take into account risk significant design changes, are used and the data that brackets the component or subsystem that is being investigated is determined by comparing the general characteristics of the design with the characteristics associated with the generic data. The bracketed results from the design fault tree gives useful information on the range of failure probability or reliability achievable with the design. One of the example fault trees that will be described was used for a system design. When the design fault tree is quantified the importances and sensitivities of the different parts of the design are obtained. This is useful information and shows what parts of the system drive the failure probability and reliability. The designer can then focus on the important and sensitive parts. One of the greatest benefits resulting from carrying out any FTA is the establishment of design priorities for all elements of the fault tree and thereby for the design effort. Often, only a few of the elements, or contributors, will drive the failure probability and reliability. The FTAs that have been performed in the past generally show that less than 20% of the contributors dominate the failure probability and the reliability. Often, in fact, 90% or more of the result is driven by as little as 10% or less of the contributors.
The application of the design fault tree can be carried one step further. In this case it can be used as a tool for performance-based design. The example of the design fault tree that will be described is an application of FTA for performance-based design. In carrying out a performance-based fault tree evaluation, a failure probability goal is assigned to the top event.
This goal is then allocated down through the fault tree to the modules and subsystems in the design. The allocated values that are obtained for the modules and subsystems indicate whether the design has the capability of meeting the top goal. In other words, these values indicate what is sometimes called the “achievability” of the design goal. Various allocations can be selected to determine their feasibilities. Furthermore, by incorporating CCFs into the fault tree evaluations, not only can the number of redundant elements required be determined but also whether diversity is required as opposed to simple identical unit redundancy. For diversity, the redundant capabilities must be provided without relying on identical units to guard against common dependencies. It can be further required that proven technologies be used to provide the functions. In this case, the allocated values provide performance requirements to the suppliers of the system. Design fault trees can therefore be important tools to assist in focusing the design effort and providing performance-based requirements for the design.
9.3 Fault Trees for an Implemented System
When a fault tree is constructed for an implemented and operational system, detailed design and operational information is generally available. In this case, the goal in carrying out a FTA is often to improve the system or to diagnose problems within the system. The fault tree may also be constructed to monitor system safety or reliability performance. When a fault tree is constructed for an implemented system the tree is developed down to a level containing the contributors of interest and for which data is available. This often means constructing the fault tree down to the major component level, e.g. to a valve, pump, and control module level. Because of their low failure probabilities, piping and wiring is not generally modeled unless the objective is specifically to go to this level of detail or if there is suspicion that global effects, such as aging or wearout, have increased the failure probabilities. Also, fault trees are generally not developed to a detailed contact or part level for a component such as a valve stem because of the lack of data to support quantification of such detailed models.
The capability of a FTA to establish priorities among fault tree elements is very useful. Different importance measures may be calculated in FTA for different applications. By including resource expenditures, burden-to-importance ratios can be calculated to show the imbalances between resource expenditures and the importance to risk. Using these importance measures and using cost-benefit principles, resource expenditures on operational systems can be optimally reallocated to maximize their effectiveness. In many past applications, resources have been reallocated resulting in significant reductions in total resource expenditures with no impacts on the failure probability or risk. In particular cases, resource expenditures have been reduced as much as 40% or more. If the total resource expenditure is held fixed then the resources can be reallocated to significantly reduce the current failure probability and risk. In many cases the failure probabilities have been significantly reduced, sometimes by a factor of 10 or more, using the same total resources.
In addition to prioritizing the contributors, FTA can be used on an operational system to predict and correct failures before they occur. Failure trending of the components or other lower level elements in a system can be established and input to the fault tree to determine the system implications. Performance criteria on the system can then be used to determine the appropriate actions to take. Monitoring of system performance can also be conducted by periodically updating the fault tree quantification with current data. Formal approaches exist for incorporating new data into the baseline fault tree quantification. Defect data and soft failure (partial failures) data can also be incorporated, in addition to hard failure data. The use of FTA in this way can be referred to as a “proactive” use.
The FTA can also be used reactively. In this case, when a system failure occurs the fault tree can be used to diagnose the potential causes and to identify the most effective corrective measures. If a component failure occurs, the fault tree can be used to identify the significance of the failure with respect to the overall failure of the systems and identify those remaining components in the system that are most important in preventing the top event (precursor or “near miss” analysis). This evaluation can sometimes be performed using the importance measures produced by the FTA.
REFERENCES
www.google.com
www.wikipedia.com