Fault Management And Fault Recovery In Wsn Information Technology Essay

Published: November 30, 2015 Words: 4526

With the size and complexity of modern Wireless Sensor Networks systems, the ability of the systems to manage and recover from faults is becoming more important. A self management system is one that has the efficiency to recover from faults without human intervention during execution. Since WSNs are normally fault-prone and since the maintenance side is impracticable, scalable self-healing is crucial for enabling the deployment of large scale sensor network applications.

Fault management and fault recovery has already been recognized as an efficient paradigm (system management without human intervention) by existing research approach to design robust self-managed sensor networks. The general idea of this is to facilitate sensor nodes with more self-healing capabilities instead of heavily relaying on the centralized management manner.

The general idea is to develop self-healing WSN that can define self-organization and self-maintenance to facilitate sensor nodes with more flexible (more adaptable to the application requirements and network dynamic properties), more efficient (employ lower energy and traffic cost) and fault tolerant (worried about every possible type of failure), The aim of our project is employ the fault management and fault recovery in WSNs, allowing them to discover, examine, diagnose and react to dysfunctions. We propose a distributed fault management task among sensor nodes by introducing more "fault management" function. We propose also failure detection and recovery algorithm that can be compared with some existing related work and improving with more energy efficient.

Introduction

Fault management is widely considered a key part of network management today. Recent rapid growth of interest in wireless sensor networks (WSNs) has further strengthened the importance of fault management and fault recovery, as it plays a crucial role in their effective use. To understand fault management processes and fault recovery, we first want to answer the question: why is fault management important for WSNs? The answer, in essence, is due to the unique features and characteristics of WSNs and making the WSN more reliable to the informant and every place.

There is one way to deal with faults is to design a system that have ability to deal with any fault occur (fault-tolerant). However, this needs the network designer must be a fully knowing. At design time, of the different types of faults that can be occur at the system once the network is putted at the place that can be work (deployed). The limitation of energy is the most significant. Sensor nodes usually carry only limited battery-power resources. They are expected to operate autonomously from time ranging years to months. In addition, sensor nodes cannot be contacted easily to replacement battery because of their physical deployment locations. Because of these situations, faults are likely to occur more frequently and expected in WSN compared to the traditional wired or wireless network. Therefore, the performance management in a WSN is the quality of information acquisition and distribution service, in performance management, there is a trade-off to be considered: the highest the number of nodes to manage; the highest the energy consumption and the lowest the network lifetime. Moreover, sensor nodes are mostly suffer faults due to the unpleased environment (ex. Floods, fair, rain, etc). In such cases, faults might make temporarily in active from misbehavior ranging from the simple crush faults to faults where node behaves maliciously. For all of these reasons, fault management and fault recovery for WSNs must deal with differently and managed with more extra protection (look after) [1][2].

In this paper, we extend an existing cellular architecture for fault detection and recovery [3] and describe a new fault management and fault recovery to detect failing nodes and recover the connectivity in WSNs. this paper attempts to examine the efficiency of the new cellular architecture for fault detection and recovery. In our proposed cellular architecture, the whole network into a virtual grid of cells. A cell header (cell manager) is chosen in each cell to perform head tasks (manager task). These cells combine to form different groups and each group selects one of their cell headers (managers) to be a group head (group management). We propose using a hierarchical management structure to ensure that self-management ability is respectively distributed. The hierarchical management framework and node management role is also expected to be able to change to the suite changes occurred in the network. For examples, replacing the failed cell head; moving over some workload from the sensor nodes whose remaining resource status is in a dangerous level. The faulty sensor nodes are detected and recovered in their respective cells without affecting overall structure of the network.

Related Work:

The intention of fault management in WSN is deferent from traditional networks. For example we might be interested in the results of fault management within a given region rather than simple fixing an individual wireless connection between two nodes. Existing fault management approaches for WSNs vary in forms of architectures, protocols, detection algorithm or detection decision fusion algorithm etc [fault management in WSN]. A survey on fault tolerance in wireless sensor networks can be found in [1]. This section starts by reviewing the fault detection and fault recovery approaches, then we present fault management and fault recovery task.

Fault detection

Fault detection is the first thing been happen in the fault management to detect or where the expected failure should be properly identified by the network system. Over all, fault detection in WSNs has two types: explicit detection and implicit detection [fault management in WSN]. The first one is performed directly by the sensing devices and their sensing applications (in terms of detection frequency, operation objective, etc). The implicit detection refers to anomalistic phenomena (e.g., harsh condition from the environment) might disable a sensor node from communication or behave properly, and has to be identified by the network itself. Implicit detection is normally achieved in two ways: active and passive model. The active detection model is carried out by the central controller of sensor network. Sensor nodes continuously send keep-alive messages to the central controller to confirm their existence. If the central controller does not receive the update message from a sensor node after a pre specified period of time, it may believe that the sensor is dead. Triggers the alarm only when failure has been detected. However this model will not work properly if a sensor is disabled from communication due to intrusion, tampering or being out of range. Fault detection mainly depends on the type of application and the type of failures. Some exiting fault detection schemes are discussed below. We classify the existing failure detection approaches into two primary types: centralized and distributed approach.

Centralized approaches:

The centralized fault management system, usually a geographical or logical centralized sensor node accept the position of authority for faulty management in the whole network. The central node can be a base station, a central controller or a manager. This central node often has unlimited resources and perform wide range of fault management tasks [2]. Some shared centralized fault management approaches are as follows:

Sympathy[4] is a debugging system and is used to clarify and localize the cause of the failures in sensor network application. Sympathy algorithm does not provide automatic bug detection. It depends on historical data and metrics analysis in order to isolate the cause of the failure. Sympathy may require nodes to exchange neighborhood list, which is expensive in terms of energy. Also, Sympathy flooding approach means exact knowledge of global network states and may cause incorrect analysis.

Jessica Staddon et al [5] enabled the base station to construct an overview of network by integrating each piece of network topology information (i.e. node neighbour list) embedded in node usual routing message. This approach uses a simple divide-and-conquer rule to identify faulty nodes. It assumes that base station is able to directly transmit messages to any node in the network and rely on other nodes to route measurements to the base station. Also, this approach assumes that each node has a unique identification number. This first step enabled the base station to know the network topology and for this purpose it executes route-discovery protocols. Once the base station knows the node topology it then detects the faulty node by using a simple divide-and-conquer strategy based on adaptive route update messages. Centralized approach is suitable for certain application. However, it is composed of various limitations. It is not scalable and cannot be used for large networks. Also, due to centralized mechanism all the traffic is directed to and from the central point. This creates communication overhead and quick energy depletions. Moreover, central point is a single point of data traffic concentration and potential failure. Lastly, if a network is portioned, then nodes that are unable to reach the central server are left without any management functionality.

Distributed Approaches

This is an efficient way of deploying fault management. Each manager controls a sub network and may communicate directly with other managers to perform management functions. Distributed management provides better reliability and energy efficiency and has lower communication cost than centralized management systems [Network Management in Wireless Sensor Networks10]. The algorithm proposed for faulty sensor identification in [6] is purely localized. Nodes in the network coordinate with their neighbouring nodes to detect faulty nodes before contacting the central point. In the scheme, the reading of a sensor is compared with its neighbouring' median reading, if the resulting difference is large or large but negative then the sensor is very likely to be faulty. This algorithm can easily be scaled for large network. However, the probability of sensor faults need to be small as this approach works for large networks. Also, if half of the sensor neighbours are faulty and the number of neighbours is even, algorithm cannot detect the fault as expected. But the algorithm developed in [7] tried to overcome the limitations of this approach by identifying good sensor nodes in the network and uses their results to diagnose the faulty nodes. These results are then propagated in the network to diagnose all other sensor nodes. This approach performs well with even number of sensors nodes and do not require sensors physical locations. This approach is not fully dynamic and is required to be preconfigured. Also, each node should have a unique ID and the centre node should know the existence and ID of each node. Another scheme proposed in [8], where sensor nodes police each other in order to detect faults and misbehaviour. Nodes listen-in on the neighbour it is currently routing to and can determine whether the message it sent was forwarded. If the message it sent was not forwarded then it conclude its neighbour as a fault node and chooses a new neighbour to route to.

As we have seen, the distributed approach will be the design trends for fault management in WSNs. Sensor nodes gradually take more management responsibility and decision-making in order to achieve the vision of self-managed WSNs. Node self-detection scheme [10] and neighbour coordination [11] have provided us a good example of management distribution, but their focuses are on a small region (a group of nodes) or individual node. Research work as MANNA [3], WinMS [12] etc proposed management architecture to look after the overall network from a central manager scheme. MANNA [3] is a policy-based approach using external managers to detect faults in the network. MANNA assigns different management roles to various sensor nodes depending on the network characteristics (Homogenous vs. heterogeneous). These distinguish nodes exchange request and response messages with each other for management purpose. To detect node failures, agents execute the failure management service by sensing GET operations for retrieving node states. Without hearing from a node, manager declares it as a faulty node. MANNA has a drawback of providing false debugging diagnosis. There are several reasons a node can be disconnected from the network. It can be disconnected from its cluster and not able to receive any GET message. GET message can be lost during environmental noise. Random distribution and limited transmission range can also cause disconnection. Also, this scheme performs centralized diagnosis and requires an external manager. WinMS [12] provides a centralized fault management approach. It uses the central manager with global view of the network to continually analyses network states and executes corrective and preventive management actions according to management policies predefined by human managers. The central manager detects and localized fault by analyzing anomalies in sensor network models. The central manager analyses the collected topology map and the energy map information to detect faults and link qualities. It has the ability to self configure in case of failure, without prior knowledge of network topology. Also, it analyzes the network state to detect and predict potential failures and perform action accordingly.

Fault diagnosis

In this stage, detected faults are properly identified by the network system and distinguished from the other irrelevant or spurious alarms. The accuracy and correctness of fault detection already has been partly achieved using a number of fault detection methods. However, there is still no comprehensive description or model to distinguish various faults in WSNs that is capable of supporting the network system in achieving accurate fault diagnosis or fault recovery action [ fault management in WSN]. Existing approaches address fault models from the individual node point of view (including node hardware component malfunctions). In particular, both [13, 14] assume that the system software is already fault tolerant. Farinaz [9], described two fault models. The first one corresponds to sensors that produce binary outputs. The second fault model is based on sensors with continuous (analog) or multilevel digital outputs. In [15], the proposed work only consider faulty nodes are due to harsh environment. Thus, there is a need to address a generic fault model that is not based on individual node level, but also consider the network and management aspects.

Faulty recovery

In this stage, the sensor network is reconfigured in such a way that failures or faulty nodes do not bring any further impact on the network performance. Most existing approaches isolate faulty (or misbehaving) nodes directly from the network communication layer. For examples, in [8], after the failure of a neighboring node, a new neighboring node is selected for routing.

WinMS [12], used a proactive fault management maintenance approach i.e. the central manager detect areas with weak network health by comparing the current node or network state with historical network information model (e.g. energy map and topology map). It takes a proactive action by instructing nodes in that area to send data less frequently for node energy consumption. In [17], when gateway nodes die, the cluster is dissolved and all its nodes are reallocated to other healthy gateways. This consume more time as all the cluster members are involved in the recovery process. Farinaz [9], suggested a heterogeneous backup scheme for healing the hardware malfunctioning of a sensor node. They believe a single type of hardware resource can backup different types of resources. Although this solution is not directly relevant to fault recovery in respect of the network system level management [2]. In consideration of complexity of fault management design and constrains of a sensor node, we are seeking a localized hierarchical solution to update and reconfigure the management functionality of a sensor node.

In this section, we highlighted different issues and problems existed in already proposed fault management approaches for WSNs. It is clear from the literature survey that different approaches for fault management in WSNs suffer from the following problems:

Most existing fault management solutions mainly focus on failure detection, and there is still no comprehensive solution available for fault management in WSNs from the management architecture perspective.

Different mechanisms proposed for fault recovery [9] are not directly relevant to fault recovery in respect of the network system level management i.e. network connectivity and network coverage area etc.

Failure recovery approaches are mainly application specific, and mainly focus on small region or individual sensor nodes thereby are not fully scalable.

Some management frameworks require the external human manager to monitor the network management functionalities.

Another important factor that needs to be considered is vulnerability to message loss. For example, in MANNA [4], if a cluster head does not hear from its cluster member than it announced it as a faulty node. However, a message can be lost due to various reasons. It can be lost during transmission and cause a correct node to be declared as faulty.

We therefore content that there is still a need of a new fault management scheme to address all the problems in existing fault management approaches for wireless sensor networks. We must take into account a wide variety of sensor applications with diverse needs, different sources of faults, and with various network configurations. In addition, it is also important to consider other factors i.e. mobility, scalability and timeliness.

Fault Management and Fault recovery task in Wireless Sensor Network

The proposed fault management and fault recover task into two phases

Fault detection and fault diagnoses

Fault recovery.

Fault detection and fault diagnoses task

Fault detection of the sensor node can be succeed in two ways, self detection and active detection, in self detection, sensor nodes are required to look after carefully there energy efficient, and recognize the condition of the failure happening. In our problem we are looking the battery depletion as a main reason of the node death. A sensor node is termed failure when it is energy depletion drop below the level value (>=20), when a node failing because of energy depletion, it sends a message to its cluster head that is telling is going to sleep because of energy is below the level. Self detection is happening locally process of a sensor node, and requires less network to keep the sensor node energy. Also this local process makes the reaction delay of the management move towards the prospective failure of sensor node.

To detect the node death in an efficient manner, fault management worked an active detection mode. In this way, the message of update the sensor node remain battery is applied in the existing node. In active detection, cluster head asks its cluster nodes on often basis to send their updates. Such as; the cluster head send "get" messages to the connected cell members on regular basis and in return sensor nodes send their updates. This called in-cluster-update-cycle. The update message consists of node ID, energy and location. There is also swap of update message take place between cluster head and its cluster nodes. If the cluster head does not does not receive update from any sensor node then it sends an immediate message to the node obtaining about it is status. If the cluster head does not receive the acknowledgment in certain time, it then announces the node faulty and moves this information to the rest of nodes in the cluster. Cluster head only gif attention on it is cluster members and only inform the group manager for further assistant if the network quality of it is small area has been in a critical situation.

All sensor nodes in the first time have same energy, a cluster head use self-detection approach and usually monitors it is residual energy status. After going a lot of transmission inside the cluster, the node energy decrease. If the node power becomes less than or equal to 20% of battery life, the node positioned as low energy node and becomes likely to put to sleep. If the sensor node power is greater or equal to 50% of battery life, it is positioned as high and be the candidate for the cluster head. Thus if the cluster head residual energy becomes less than or equal to 20% of battery life, the cluster head makes alarm and officially tells it is members and the group manager for it is low energy status and choose officially a new cluster heads to change it.

Also every cluster head sends it is health status information to its group header. If the group manager does not hear from a specific cluster head during update messages cycle, it then sends acknowledgment to the cluster head and investigate about it is status. If the group head does not hear from the same cluster head one more time through update cycle, it then announced the cluster manager faulty and tells its cluster members. This technique use to detect the ongoing death of the cluster manager.

Group head also monitor it is status every time and respond when it is energy below the level. It notices its cluster members and neighboring group mangers of it energy status and suggestion to appoint a new group manager. The sudden death of group manger can be detected by the base station. If the base station does not receive any transmission from a particular group manager, it then discus the group manager and request for it is current status. if the base station does not get any status, it the declares the group manger fault (death) and sends this information to the cluster to its cluster mangers. The primary work of base station is to look for the group head for their sudden death and detect it. Most of the work take cluster node and cluster head in self detection and active detection in the network communication.

Fault recovery

Fault recovery been happen after getting the result of fault detection in passive active detection way, the sleeping node can be awake to cover the density of cluster node. After time of happing a lot of transmission and most cluster nodes are same level of energy and most of them are equal to 20%, the cluster manager appoints a secondary cluster head within its cluster to acts as backup cluster manger. Cluster head and secondary head are known to their cluster nodes, if the cluster manger drops below the level of threshold value, it then sends a message to the cluster member and even the secondary cluster head. It also tells the group manager for it is level of energy status and the coming of the secondary cluster head. This time the secondary cluster head will stand as the cluster head and the existing cluster head becomes a cluster node and goes to low level of computational mode. The common node will certain deal with secondary cluster head as new cluster manager and the new cluster manager send messages to the cluster members to receive updates from cluster nodes. The failure recover are performed locally y each cluster.

In figure 1, assume that cluster manager is going to fail because of energy depletion and node 4 is select as secondary cluster head. Cluster head will send a message to sensor node 1, 2, 3 and 4 and this will start the recovery by invoking node 4 to stand up as a new cluster manger.

In a scenario, where the battery energy of a particular cluster manager is not enough to support it is management role, and the secondary cluster manager does not sufficient manager to replace its cluster manager. Thus, the cluster node will swap energy messages within the cluster if there is no candidate nodes have enough energy to replace cluster manager. This time cluster manager sends a request message to the group manager to combine the remaining nodes with neighboring clusters.

Performance Evaluation (expected result analytical solution)

As we don't have result for the simulation we compare our algorithm to some existing solutions, our solution can be considered as a special one of clustering but more systematic, robust and scalable. Clustering has been used to address various issues i.e. routing, energy efficiency, management and huge scale control. Therefore clustering can be formed in several ways. Node overall make a cluster in to two ways: a) a header is selected among the nodes through election algorithm, randomized election, degree of connectivity or pre-definition, and (b) the headers and the nodes interact to form a group or a cluster [18]. Cluster heads are responsible for coordinating the nodes in their clusters and generally are more resourceful than its cluster members. Cluster heads are where the data cross to the group manager; there failure may cause several problems. Also, if a cluster head failed to operate then no messages of its cluster will be forwarded to the base station and selection of the new cluster head is energy consuming.

Our proposed design also divides the network into small clusters and each cluster consists of a group of nodes. It is homogenous network where all network nodes are equal in resources.

We also choose cluster manager for each cluster to carry out management tasks with mutual co-ordination of gate way node but a cluster manager can easily be replace by any other common node if the cluster manager stop operating. Gateway node is responsible for routing information to other clusters and if a cluster manager get faulty than gateway node (secondary cluster head) start acting as both cluster manager and gateway node until it appoint a new cluster manager. Also, if a gateway node stops operating than the cluster manager will take its place. Therefore, network operate normally even in the presence of a fault. Our suggestion solution is a distributed approach and easily scales with the growth of the network. Unlike Centralized solutions, there is no central manager and all management tasks are performing locally and in distributed way.

Energy efficiency is an important research challenge to succeed the vision of a self organized wireless sensor network. Our goal is to addresses this challenge by employing a load balancing strategy so that all nodes operate together for as long as possible. We consider that all the nodes in the network are equal in resources and no node should be more resourceful than any other node.

The main idea the node does not consume much energy or memory. It is based upon the coordination due to which clusters are built automatically and there is no need to exchange too much messages to build the clusters. The cluster members remain on the same cluster regardless of the cluster manager. Our architecture divides the whole network into grids and enables the network to perform local detection and distribute the management tasks across the network. This approach helps sensor nodes to take more management responsibility and decision making in order to success the vision of self managed and self recovery WSNs. Also, this increases network life time.

Conclusion

Wireless sensor network are highly dynamic, prone to faults, and typically deployed in remote and harsh environment. However, in sensor is not frequent to have failure behavior, node or network failure. In this paper we propose, fault management and fault recovery for wireless sensor network to diagnose the failure and recovery sensor network from faults. The suggest fault management system is energy efficient and respective network topology. Considering on the role assignment, sensor nodes execute the appropriate functions to complete their fault management tasks. Most of the existing solution used some type of central type to perform fault management but ours, the goal is to perform fault detection locally and distributed in a good way. Our proposed algorithm achievements failure detection and recovery much faster than others existing algorithms, and exhaust significantly lower energy (as I expect).