Article

Risk Management (2008) 10, 1–31. doi:10.1057/palgrave.rm.8250040

A Methodology for Learning from System Failures and its Application to Pc Server Maintenance

Takafumi Nakamuraa and Kyoichi Kijimab

  1. aSupport technology Group, Fujitsu Fsas Inc, Minato-ku, Tokyo, Japan
  2. bDepartment of Value and Decision Science, Tokyo Institute of Technology, Meguro-ku, Tokyo, Japan

Correspondence: Kyoichi Kijima, Department of Value and Decision Science, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8552, Japan. E-mail: kijima@valdes.titech.ac.jp

Top

Abstract

The need for more effective ways to manage risk of technological systems and ensure safety continues to grow in industry. Risk management tends to be considered part of the process of planning technological systems, though, not as part of day-to-day operations. However, risk must also be considered with respect to emergent events that require taking action in ways which differ from planned operational procedures. This paper proposes a new methodology for learning from system failures to prevent further occurrences. The proposed methodology is unique and significant in how it can change industrial maintenance systems through a shift from reactive to proactive ways. The clarification of mechanisms that are prone to failure and corresponding preventative measures will be invaluable to society as a whole.

Keywords:

risk management, SSM, structural modeling, ISM, system failure

Top

Introduction

Today's IT infrastructure is highly dependent on computer systems, which have become ever more indispensable in our daily lives and corporate operations. This implies that extended IT infrastructure failure could severely affect our daily lives or corporate activities. Indeed, recent years have seen a series of serious accidents that threaten the safety and security of society. These include a major nuclear power plant accident in Japan, the outbreak of BSE, the cover-up of serious defects by a car manufacturer, and the disastrous explosion of the Challenger space shuttle.

If serious accidents are inevitable, how can we respond to their impact and manage complexity in a timely manner, so as to assure an acceptable level of safety and security? That is the question we try to answer in this paper.

According to Wang and Roush (2000, Chapter 2, pp 44): The engineer must design out failures that could result in loss of property, damage to the environment of the user of that technology, and possibly injury or loss of life. Through analysis and study of engineering failures and their mechanism, modern engineering designers can learn what to avoid and how to create designs with less chance of failure.

The significance of the present paper is that it tries to enhance system's safety through learning from past failures not individually but collectively as a whole and extracting effective preventative measures that will not be found otherwise.

We first propose a new methodology that re-examines the current cognitive framework for maintenance techniques through the structuring and visualizing of failure factors. We call this the failure factor structured method (FFSM).

We then demonstrate that FFSM enables quantitative structuralization and visualization of failure factors in such a way that allows new observations to help prevent system failures. FFSM is unique and significant in the sense that it drastically shifts industrial maintenance systems from reactive to proactive ways by promoting double-loop learning through managing emergent events.

Furthermore, to validate the methodology, we apply it to a PC server maintenance activity and confirm that it really enables observations that can be used to prevent further occurrences of system failures.

Finally, we conclude that FFSM is an effective way to manage emergent properties not anticipated when systems are planned, and that it can potentially complement "soft approaches" including the soft systems methodology (SSM) (Checkland and Scholes, 1990; Kijima, 1997; Jackson, 2003) and overcome its shortcomings.

This paper proceeds as follows. The next section provides an overview of a maintenance system and the further sections introduces our new methodology (FFSM) for preventing system failures, describes a case of applying FFSM to a PC server maintenance activity and discusses the effectiveness of FFSM and concludes the paper, respectively.

Top

Overview of maintenance systems

A maintenance system can be thought of as three parts: maintenance worldview, the maintenance objective system, and the maintenance system itself (IEC 60300-3-10, 60706-2; Bignell and Fortune, 1984; Reason and Hobbs, 2004). Figure 1 illustrates the relationship among the three.

Figure 1.
Figure 1 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Maintenance system overview.

Full figure and legend (42K)

Maintenance worldview

A worldview of maintenance provides a cognitive framework for identifying a maintenance system (IEC 60300-3-10, 60706-2). The cognitive framework prescribes a maintenance system and maintenance objective system. The maintenance worldview is influenced by several inter-related factors, including the maintenance organizational culture, customs, the safety and security performance provided by a maintenance objective system, and public opinion. For instance, a system downtime allowance and the maintenance organization (hardware and software) are a tacit proposition for the maintenance worldview for the design, configuration, operation, and rectification of a maintenance system.

Maintenance system

A maintenance system used to maintain an objective system generally includes the following four subsystems (IEC 60300-1, 60300-2). A maintenance system facilitates achievement of a designed target through the stable, safe, and secure operation of the maintenance objective system.

Design subsystem
 

The purpose of the design subsystem is to clarify the operational needs of a maintenance objective system and establish a maintenance plan. A maintenance plan involves the design of scheduled (periodic) maintenance and the application of conditional maintenance (in the event of a failure).

Configuration subsystem
 

A configuration subsystem is needed to establish a maintenance organization, develop maintenance manuals, and construct a maintenance object system based on a maintenance plan defined by the design subsystem.

Operation subsystem
 

An operation subsystem is used to maintain an objective subsystem through a maintenance organization by using maintenance manuals established and developed by the configuration subsystem.

Evaluation subsystem
 

An evaluation subsystem is used to evaluate both the maintenance system itself and the maintenance objective system, based on the operational records and system failure events. This subsystem facilitates the learning loops shown in Figure 1. Single-loop learning rests on the ability to detect and correct errors relative to a given set of operating norms, while double-loop learning depends on being able to take a second look at a situation by questioning the relevance of the operating norm (Morgan, 1997, pp 86–89).

Maintenance objective system

According to IEC 60300-3-10 and 60706-2, this paper assumes the maintenance system to have the following characteristics.

Devices
 

The maintenance objective system consists of logical (or physical) and single (or multiple) device(s).

Corrosion
 

The maintenance objective system is made obsolete by software (i.e., security guards) or hardware (i.e., corrosion) with the passing of time.

Operation and maintenance
 

The maintenance objective system is operated and maintained by operator(s) and maintenance engineer(s) according to a prescribed maintenance manual. We call a failure of the maintenance objective system a system failure.

Overview of managing system failures

System failures are outcomes not expected given the design of either a maintenance system or a maintenance objective system. The operation subsystem is responsible for managing system failures. It restores system operation after a system failure through failure detection, recovery method selection, and recovery, although the root causes of a system failure vary.

An evaluation subsystem feeds back every countermeasure into an appropriate subsystem within the maintenance system or performs preventative measures for appropriate maintenance objective system groups. However, the evaluation subsystem usually works reactively only after system failures. This is one reason a series of serious accidents can occur. An effective preventive methodology will need to overcome this problem.

This will require various types of feedback to rectify the current worldview of a maintenance system as shown in Figure 1.

Top

FFSM: A new methodology for learning from failures

Why do system failures happen, seemingly without end? One reason is that lessons learned from system failures are simply not thoroughly understood by the organization. James Reason (Reason, 1997, Chapter 7, p 126) states that "errors are consequences, not causes." If so, we need a methodology to manage the consequences as well as to isolate the root cause.

In the following, we first list the features that such a methodology should have and then compare the list against several typical existing structuring methodologies; this comparison leads to the conclusion that we need an innovative methodology to realize the required features. We then propose a methodology that sufficiently offers the required features.

Required features of the methodology

We believe that such a methodology should have the following features, because viewing a system holistically and reflecting learning upon a current cognitive worldview is indispensable for managing risk proactively (Checkland and Scholes, 1990).

  1. It will enable us to maintain a system while also providing a means of creating countermeasures derived from an analysis of each specific system failure. In other words, it should support implementation of double-loop learning.
  2. It will structure, visualize, and provide a bird's eye view of the factors that seem to cause system failures in such a way that it enables us to obtain observations needed to rectify or abandon the current worldview of maintenance. That is, it should help us view the system in a holistic way from a conceptual worldview as well as from a real worldview.
  3. As a result, it will aid decision-making through a structural understanding of the essence of problems inherent in failure factors and how they are related.

The above implies that a structuring methodology should internalize a double-loop learning mechanism and review procedure from the viewpoint of the conceptual world as well as that of the real world, in a holistic way.

Limitations of existing structuring methodologies and risk analysis techniques

Unfortunately, existing structuring methodologies and risk analysis techniques do not sufficiently satisfy the requirements. Table 1 summarizes the existing methodologies for clarifying problem structures, where SSM (Checkland and Scholes, 1990) has the capability to manage emergent properties to invent preventative measure.


There are two widely used failure analysis techniques: failure mode effect analysis (FMEA: IEC 60812) and fault-tree analysis (FTA: IEC 61025). FMEA deals with single-point failures by taking a bottom-up approach, and is presented as a rule in the form of tables. In contrast, FTA analyzes combinations of failures in a top-down way, and is visually presented as a logic diagram.

Both the methodologies are mainly employed in the design phase. However, these methodologies are heavily dependent on personal experience and knowledge, and FTA in particular has a tendency to miss some failure modes in failure mode combinations, especially emergent failures.

The major risk analysis techniques (including FMEA and FTA) are explained in Bell (1989, pp 24–27), Wang and Roush (2000, Chapter 4)and Beroggi and Wallace (1994). Most failure analyses and studies are based on either FMEA or FTA. FMEA and FTA are rarely both performed, though, and when both are done they will be separate activities executed one after the other without significant intertwining.

Current methodologies tend to lose the holistic view of root causes of system failures. In addition, most of them may be able to clarify problem structures, but do not effectively promote double-loop learning as a preventative measure. Therefore, a system will often repeat similar failures. In other words, current methodologies do not combine all of the required features explained in the previous sub-section within one methodological strand.

FFSM: A methodology for maintenance systems

We now propose a methodology, FFSM, that we believe satisfies the requirements stated in the section Required features of the methodology, taking into account the shortcomings discussed above of the existing techniques.

This new methodology should promote double-loop learning through viewing a system in a holistic way. Complex system failures generally arise from various factors acting in combination. Each factor often has a qualitative nature that makes it necessary for a maintenance system to include human intervention. The proposed methodology therefore has both a qualitative and a quantitative capability.

It is important to reveal quantitative relations between qualitative factors and extract hidden factors through grouping problems and known factors as a whole. The methodology also provides the observations needed to rectify the maintenance worldview (i.e., double-loop learning). Figure 2 illustrates a general overview of FFSM, while Table 2 shows the objectives of each phase. Figure 3, 4, and 5 show detailed flows for phases 1, 2, and 3, respectively.

Figure 2.
Figure 2 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

A system able to provide its own maintenance system (FFSM).

Full figure and legend (33K)

Figure 3.
Figure 3 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Detailed flow of Phase 1.

Full figure and legend (21K)

Figure 4.
Figure 4 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Detailed flow of Phase 2.

Full figure and legend (13K)

Figure 5.
Figure 5 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Detailed flow of Phase 3.

Full figure and legend (15K)


In the following, we will explain about each phase of FFSM.

Phase 1. Structuring a system as a whole (factor dependence and relationships).

This phase enables structuring of the causes of a system failure. Steps 1 and 2 in Figure 3 define a group of problems (failures) and corresponding factors. This is important to analyze problems as a whole rather than each as a specific event. To do this, quantitative factor relationships must be revealed from qualitative factors. For this purpose, we apply ISM (Warfield, 1976, 1980; Sage, 1977) in this phase.

Phase 2. Visualizing failure factors and grouping similar factors.

This phase reveals hidden factors not extracted by analyzing each specific failure event. Our idea is to apply type III quantification theory (Hayashi, 1952; Greenacre, 1984, 1993; Gifi, 1990; Van de Geer, 1993) to find such hidden factors. This method is a type of correspondence analysis (Greenacre, 1984, 1993) and is useful to quantify and visualize the entire set of failure factors that have a qualitative nature (Figure 4).

Phase 3. System exploration (obtaining observations): factor and example mapping into a maintenance framework.

This phase enables preventative measures to be found by promoting double-loop learning to manage emergent problems. Figure 6 shows a learning loop for recognizing maintenance failures. The four subsystems are classified into four quadrants defined by two dimensions; one dimension is time (pre-operation vs post-operation) and the other is the boundary between the conceptual and real worlds.

Figure 6.
Figure 6 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Learning loop for recognizing maintenance system failures.

Full figure and legend (32K)

The closed loops in each quadrant show closed learning processes improving within that quadrant (single-loop learning). An evaluation subsystem provides feedback to maintenance objective systems to enable horizontal deployment of the improvement countermeasures by analyzing the operational log and failures of the maintenance objective systems.

The arrows in Figure 6 denote feedback into the maintenance system itself, including alteration of the maintenance worldview (double-loop learning). By mapping factors, clarified in phases 1 and 2 into the maintenance frame shown in Figure 6, it is possible to obtain observations that can be used to alter the conventional maintenance cognitive frame.

Table 3 summarizes the three phases of FFSM, and Table 4 shows a new configuration for a maintenance system.



Applicable boundary conditions for FFSM

System maintenance is not the only feature to which FFSM can be applied. The following are other system features with which it can be used.

  • System components that include human processes
  • Human knowledge dependent on the behavior of the system
  • Cause and effect relationships whose effect on outcomes is complex and uncertain.

Top

Application to PC servers (Extended Downtime Analysis)

This section describes the application of FFSM to a PC server maintenance system that manages extended downtime incidents, and explains the result of the application. In such a case, it is necessary to clarify the structures and determine an appropriate quantitative weight for each factor leading to an occurrence of extended downtime by analyzing the PC server incidents that have occurred during a given period.

Sample data for the analysis

  • Period: April to July, 2004
  • Number of samples: 58 PC server extended downtime incidents occurred within the period (more than three hours from detection to the resumption of normal operation)

The following data classification was applied to each incident to produce a 58 times 8 incident-factor matrix (Appendix A). All incidents were related to an appropriate factor(s) from among eight extended downtime factors. These eight factors, extracted from the experience-based knowledge of engineers who had dealt with extended downtime incidents, were

S1:
 Product
S2:
 Isolation (diagnosis of faulty parts)
S3:
 Maintenance organization (skills, scale, and deployment)
S4:
 Spare parts availability (deployment and logistics)
S5:
 Faulty spare parts
S6:
 Fix has not been applied (EC has not been applied)
S7:
 Recovery process
S8:
 Software bugs

Phase 1 (structured model analysis: ISM)

Figure 7 shows the direct influential matrix X* obtained by analyzing the causal relationship between the eight factors introduced in the previous sub-section (S1S8). The direct influential matrix is the causal relationship matrix where the columns and rows contain factors S1S8.

Figure 7.
Figure 7 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Direct influential matrix X*.

Full figure and legend (13K)

X*=(xjk): xjk=3, 2, 1 (if there is a direct causal relationship from column j to row k) (3: strong relationship, 2: moderate relationship, 1: weak relationship). xjk=space (if there is no direct causal relationship from column j to row k).

Figure 8 shows that in the adjacent matrix A, each element is either 1 or a space, which, respectively, indicate that there is a direct relationship or no direct relationship. Figure 9 shows the vector graph created from Figure 7 and 8, where an arrow indicates if there is a direct causal relationship between the factors.



Figure 10 shows the reachable matrix T obtained by adding unit matrix I to adjacent matrix A and repeating the following Boolean algebra:


(A+I)r-1not equal(A+I)r=(A+I)r+1=T, T=(tij) is the reachable matrix.

The following condition is then checked:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Here, S={S1S2, ..., S8} is the group of nodes, Ri={Sjset symbolS | tij=1}is the group of nodes (lower nodes) reachable from node i, Ai={Sjset symbolS | tij=1}is the group of nodes (upper nodes) reachable from node i, and RicapAi=Ri means that Ai includes Ri, so Si is the lowest node.

First, we find the lowest of the eight nodes. As Table 5 shows, S7 is the lowest node on level 1. We then eliminate S7 from the lowest node on level 1 from Table 5 and then find the next lowest node. As Table 6 shows, S2 is the lowest node on level 2. Repeating the above process reveals that S4, S5, and S8 are the lowest nodes on level 3 (Table 7). S6 is the lowest node on level 4 (Table 8), and S1 and S3 are the lowest nodes on level 5 (Table 9). All these nodes are classified into five levels. The significance of the levels is that factors in upper levels are the root causes of factors in lower levels.






To consider indirect causes in causal analysis, it is necessary to introduce the normalized direct influential matrix X (Figure 11). This is obtained by dividing the maximum load factor (11) that is obtained from the maximum value within the summation of each column of X*. The total influential matrix Z (Figure 12), which includes the indirect cause, can then be obtained based upon the following operation of X:

Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Figure 11.
Figure 11 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Normalized direct influential matrix X.

Full figure and legend (17K)

Figure 12.
Figure 12 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Total influential matrix Z.

Full figure and legend (20K)

The element Z represents the relative weight of each causal relation.

Figure 13 shows the overall structure of the eight factors in five levels. The number attached to each arrow represents the element Z.

Figure 13.
Figure 13 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Overall structure of eight factors in five levels.

Full figure and legend (40K)

Discussion of Phase 1 analysis
 

The uppermost factors (i.e., root causes) in Figure 13 for level 5 correspond to S1 (Product) and S3 (Maintenance organization). S6 (EC not applied) is in level 4, S4 (Spare parts), S5 (Faulty spare parts), and S8 (Software bug) are in level 3, S2 (Isolation) is in level 2, followed by S7 (Recovery process) in level 1. In this structured model, S2 (Isolation), and S7 (Recovery process) are positioned at the lower level, which is counter-intuitive for those factors that tend to be treated as direct causes of extended downtime incidents. This suggests to us that merely rectifying the isolation (S2) or recovery process (S7) as a countermeasure to reduce extended downtimes is not adequate. In addition, S6 (EC not applied) in level 4 shows us that the promotion of an EC application to PC servers is a fairly good countermeasure to reduce the number of extended downtime incidents. This justifies the field engineer's wisdom of applying EC as a preventative measure. The major upper factors of S7 (Recovery process) are S1 (Product) (0.45), S8 (Software bug) (0.36), and S3 (Maintenance organization) (0.35). The number in the parentheses shows the relative weight of each causal relation. This indicates that the product and software-related maintenance organizations are the root cause of an extended period being needed for recovery. The major upper factors of S2 (Isolation) are S1 (Product) (0.37) followed by S8 (Software bug) (0.29). The numbers in parentheses also indicate that the product and software bugs are the root causes for an extended period being needed for recovery.

Phase 2 (quantification theory type III)

All 58 incidents were analyzed based on the eight factors using the factor matrix (Appendix A). Quantification theory type III was then applied to the factor matrix. Quantification theory type III is a well-known analytical method for multi-variable quantification analysis. A PC program named "excel toukei 2002" (Kabushiki-kaishiya Shiyakai-Jiyouhou-Service, 2002) was used for this analysis. Table 10 shows the factor axes for up to six axes with the eigenvalues and contribution ratios.


Table 10 shows that the accumulated factor contribution ratio up to the third axis is about 53%. This indicates that about half of the extended downtime incidents are related to these three axes. The names of the first three axes are given below, where each name also indicates which hidden factor is extracted by phase 2 analysis.

  • 1st axis, isolation of faulty parts
  • 2nd axis, software recovery
  • 3rd axis, hardware maintenance organization

Figure 14 shows the factor scores for the first axis. Positive values of the factor scores (S2 (isolation) and S3 (maintenance organization) in this case) relate to maintenance processes, while negative values (S8 (software bug) and S6 (EC not applied)) relate to products. Therefore, the first axis relates to a maintenance process that is specifically for isolating faulty parts.

Figure 14.
Figure 14 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Factor scores for the first axis (isolation of faulty parts).

Full figure and legend (36K)

Figure 15 shows the factor scores for the second axis. The positive values of the factor scores (S8 (software bug) and S3 (maintenance organization)) relate to the software recovery process, while the negative values of the factor scores (S4 (spare parts) and S6 (EC not applied)) relate to products. Therefore, the second axis relates to the software recovery process.

Figure 15.
Figure 15 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Factor scores for second axis (software recovery).

Full figure and legend (33K)

Figure 16 shows the factor scores for the third axis. The positive values of the factor scores (S3 (maintenance organization) and S1 (product)) relate to the hardware maintenance organization, while the negative values of the factor scores (S8 (software bug)) relate to software bugs. Therefore, the third axis relates to the hardware maintenance organization.

Figure 16.
Figure 16 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Factor scores for the third axis (hardware maintenance organization).

Full figure and legend (34K)

Figure 17 shows the space for the first axis (isolation) and second axis (software recovery), as well as the factor mapping into the space. Figure 18 shows the space for the first axis (isolation) and third axis (hardware maintenance organization), and the factor mapping into this space. The maintenance organization factor (S3) is located in the first quadrant in both Figure 17 and 18, which is closely related to all three axes (i.e., isolation, software recovery, and hardware maintenance organization). The isolation factor (S2) is close to 0, as it is for the second and third axes, although the first axis has a high positive number. This means that the isolation factor is independent of the second and third axes and forms a single group that can be isolated. The software bug factor (S8) has a high positive number in the same way as for the second axis, a negative number in the same way as for the first axis, and is close to 0 in the same way as for the third axis; therefore, the software bug factor is heavily related to software products for the causes of extended downtimes. Other factors are located around 0 for every axis and therefore no specific feature is shown.

Figure 17.
Figure 17 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Factor distribution in the first–second axes space.

Full figure and legend (38K)

Figure 18.
Figure 18 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Factor distribution in the first–third axes space.

Full figure and legend (39K)

Discussion of the Phase 2 analysis
 

Each sample has three scores relating to the three axes (Appendix B), with every sample mapped depending on the three scores in the first axis–second axis space (Figure 19). The number in parentheses represents the number of sample incidents within the group in Figure 19 (i.e., Gm(n): "m" represents the group number, 1less equalmless equal12, and "n" represents the number of sample incidents). Also, the number next to each data point in Figure 19 represents the number of sample incidents having the same axis' value. The 58 samples are classified into 12 groups based on their adjacency to the space coordinate and are shown in Appendix B.

Figure 19.
Figure 19 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Sample mapping in the first axis–second axes space.

Full figure and legend (45K)

Figure 20 shows the overlapping of the factor mapping (Figure 17) and sample mapping (Figure 19). G1, G7, and G12 are located near factors S2 (isolation), S4 and S5 (spare parts), and S8 (Software bug), respectively. Therefore G1, G7, and G12 are the groups whose dominant factors lead to extended down times are RAS, spare parts, and software (i.e., software recovery or bug). G8 and G10, though, are located near the recovery (S7), spare parts (S4 and S5), and product (S1) factors. Therefore, G8 and G10 are groups that cannot be clearly related to the dominant factor of extended down times. However, all incidents in G8 have the same symptom of multi-dead hardware disk. The symptoms of G10 are different. Therefore, G8 and G10 are, respectively, called the multi-dead group and miscellaneous group.

Figure 20.
Figure 20 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Sample group distribution in the first–second axes space.

Full figure and legend (40K)

Phase 3 (system exploration: obtaining observations)

This section explores further steps, especially for G8 and G10 which have many factors and for which no feasible countermeasures can be found through Phases 1 and 2.

The quadrant number shown in Appendix B corresponds to the quadrant number in Figure 6 and which quadrant is responsible for managing the cause. The decision criteria used to determine which quadrant should manage the cause, as well as the counter measures for each phase, are shown in Table 11. It should be possible to prevent further occurrences of system failures whose root causes relate to RAS, software recovery, and human error criteria, if new observations can be obtained and used to alter the existing maintenance cognitive frame (i.e., double-loop learning).


Discussion of Phase 3 analysis
 

Of the total 58 incidents, 48 (83%) were classified into quadrant III when the decision criteria shown in Table 11 were used. In addition, 20 incidents (34%) and 14 incidents (24%) were, respectively, classified into quadrants I and II (Figure 21). From a preventative point of view, this means that there are more reactive measures (quadrant III) than proactive measures (quadrants I and II), indicating that the ability to learn is indispensable; that is, the ability to analyze a failure to determine its cause and understand why any proactive measures have failed is essential. According to the analysis of an operation subsystem (quadrant III of Figure 21 and Appendix B), there are three main factors affecting extended down time. The first is the RAS function (43 incidents), the second is the engineer's skill (30 incidents), and the last is human error (1 incident). Although only one incident relates to human error, since human error is inevitable, the system should be able to appropriately manage such errors.

Figure 21.
Figure 21 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Factors contributing to PC server extended downtimes and a maintenance cognitive frame.

Full figure and legend (76K)

As a result, the Phase 3 analysis lead to the following three new worldviews for PC server maintenance (Table 12):

  1. Development of RAS features (around boot-up functions)
  2. Fostering of hybrid engineers (skilled in both hardware and software) to manage multi-dead situations
  3. Development of fail–safe features to prevent catastrophic events caused by human error


Measure (a) is aimed at managing emergent problems related to the interface between the parts. The initial design of an interface between parts is unable to manage exceptional cases (i.e., around boot-up) in which the combination of parts creates emergent problems. Measure (b) is intended to manage interface problems between hardware and software engineers. Measure (c) is to manage human errors (the man–machine interface) that could lead to catastrophes such as extended downtimes.

Top

Learning from the application

The following points were clarified in each FFSM phase. First, Phase 1 of the structured model analysis clarified quantitatively the overall structure of the factors contributing to extended downtime. Phase 2 of the quantification theory type III then extracted the three axes of the extended downtime and mapped each sample incident into the factor space. After this, the 58 sample incidents were classified into 12 groups (Appendix B) based on the adjacency of the space coordinate and to determine countermeasures for each group. However, G8 and G10 had multiple factors and were located in a central position in the space map (Figure 20), making it difficult to determine appropriate countermeasures.

The Phase 3 analysis clarified three new worldviews of PC server maintenance, all of which relate to factors S1 (product) and S3 (maintenance organization), which were the uppermost factors (i.e., root causes) from the Phase 1 analysis (i.e. Figure 13 extracts S1 and S3 as uppermost root causes and Table 12 shows three new worldviews are related either with S1 or S3). It is advisable to also confirm the results of these countermeasures by applying FFSM periodically.

Finally, we also confirmed that FFSM can manage emergent properties by introducing the required features introduced in the section Required features of the methodology that are the same as those from SSM (Checkland and Scholes, 1990). As Jackson explains, however, one of the criticisms toward SSM is that "To hard systems thinkers, SSM offers a limited perspective on why the problem situation occurred" (Jackson, 2003, pp 202–207). In this paper, we have shown that FFSM can potentially complement SSM and overcome its shortcoming by providing quantitative rationality regarding why problems happen and appropriate countermeasures.

Table 13 summarizes the results revealed by FFSM, that is, Phase 1 reveals quantitative causal relations of failure factors, Phase 2 extracted three hidden factors for extended down time and Phase 3 discovered three new worldviews for preventative measures, and Table 14 explains the expected performance improvement of countermeasures under current (reactive) and new (proactive) worldviews. The proactive measure under the new worldview is a far better improvement (83%) than the reactive measure under the current worldview (58%).



Top

Conclusion

In this paper, FFSM is proposed to prevent further occurrence of system failures. As a result of actual application of FFSM to PC server maintenance, we found quantitative causal relations, hidden factors, and three new worldviews to manage emergent failures that will not be acquired otherwise. This confirmed that FFSM has a significant ability to change industrial maintenance systems through a shift from reactive to proactive ways. And the application also suggests that this holistic approach is effective for promoting double-loop learning even in a well-established engineering domain as PC server maintenance. Although we mentioned applicable boundary for FFSM, further study is needed to find decision criteria for suitable applicable domains of FFSM.

Top

References

  1. Bell, T.E. (ed.) (1989). Special Report: Managing Murphy's Law: Engineering A Minimum-Risk System. IEEE Spectrum, June, pp 24–57.
  2. Beroggi, G.E.G. and Wallace, W.A. (1994). Operational Risk Management: A New Paradigm for Decision Making. IEEE Transactions on Systems, Man and Cybernetics. Vol.24, No.10 (October), pp 1450–1457. | Article |
  3. Bignell, V. and Fortune, J. (1984). Understanding System Failures. Manchester: Manchester University Press.
  4. Checkland, P. and Scholes, J. (1990). Soft Systems Methodology in Action. New York: John Wiley & Sons Inc.
  5. Gifi, A. (1990). Nonlinear Multivariate Analysis. New York: Wiley.
  6. Greenacre, M.J. (1984). Theory and Applications of Correspondence Analysis. New York: Academic Press.
  7. Greenacre, M.J. (1993). Correspondence Analysis in Practice. New York: Academic Press.
  8. Hayashi, C. (1952). Prediction of Phenomena from Qualitative Data and Quantification of Qualitative Data. Annals of the Institute of Statistical Mathematics. Vol.III, pp 69–98.
  9. Huff, A.S. (1990). Mapping Strategic Thought. New York: Wiley.
  10. IEC 60300-1 (2003). Dependability management – Part 1: Dependability management systems, http://webstore.iec.ch/webstore/webstore.nsf/artnum/030734.
  11. IEC 60300-2 (2004). Dependability management – Part 2: Guidelines for dependability management, http://webstore.iec.ch/webstore/webstore.nsf/artnum/031854.
  12. IEC 60300-3-10 (2001). Dependability management – Parts 3–10: Application guide maintainability, http://webstore.iec.ch/webstore/webstore.nsf/artnum/026649.
  13. IEC 60706-2 (2006). Maintainability of equipment – Part 2: Maintainability requirements and studies during the design and development phase, http://webstore.iec.ch/webstore/webstore.nsf/artnum/035857.
  14. IEC 60812 (2006). Procedure for failure mode and effect analysis (FMEA), http://webstore.iec.ch/webstore/webstore.nsf/artnum/035494.
  15. IEC 61025 (2006). Fault tree analysis (FTA), http://webstore.iec.ch/webstore/webstore.nsf/artnum/037347.
  16. Jackson, M.C. (2003). Systems Thinking: Creative Holism for Managers. New York: John Wiley & Sons Inc.
  17. Kabushiki-kaishiya Shiyakai-Jiyouhou-Service (2002). Excel Toukei 2002 for Windows.
  18. Kijima, K. (1997). Exploring System Knowledge I. Japan: Nikka Girenn (in Japanese).
  19. Morgan, G. (1997). Images of Oraganization, New edition.Beverley Hills, CA: Sage.
  20. Reason, J. (1997). Managing the Risk of Organizational Accidents. Aldershot: Ashgate Pub. Ltd.
  21. Reason, J. and Hobbs, A. (2004). Managing Maintenance Errors: A Practical Guide. Aldershot: Ashgate Pub. Ltd.
  22. Sage, A.P. (1977). System Engineering – Methodology and Applications. New York: IEEE Press.
  23. Van de Geer, J.P. (1993). Multivariate Analysis of Categorical Data: Application. Beverley Hills, CA: Sage.
  24. Wang, J.X. and Roush, M.L. (2000). What Every Engineer Should Know about Risk Engineering and Management. New York: Marcel Dekker, Inc.
  25. Warfield, J.N. (1976). Societal Systems Planning, Policy and Complexity. New York: John Wiley & Sons.
  26. Warfield, J.N. (1980, June). ISM and related work – annotated bibliography. Department of E.E., University of Virginia.