The grid will act as the tool for intensive computing and storage space as the internet had become the tool of mass communication. More and more grid related projects are coming up making the need for automation of the component management in the systems. Self-healing, self-configuring are among the steps constituting the automation process. These require complex fault identification and management systems. The fault presence not only affects the concerned component but it propagates its effect throughout the system.
Fault management in networks can be done through monitoring and setting up traps in the network. These steps should make a good deal between the root cause identification and the overload induced for the detection. Fault management without automation is complex in distributed systems like grid where high monitoring is required. Framing rules as such from the enormous amount of events occurred is not possible. Therefore, event correlation and filtering should be done to bring about the source of the fault. One such event in networks is the node failure which forms the root cause of many faults. Failures in networks are unavoidable but earlier detection and classification of their source is essential. This ensures robustness in the system. Various faults may occur in grid due to its inherent nature of heterogeneity and distribution. The taxonomy of faults [2] occurring in grid covers all the
resources which directly or indirectly affects its working. This taxonomy based on the location, originator, time, duration, caused behavior provides a good view of the events regarding faults. The work done in this paper mainly deals with the node failure detection among the network faults in the worker node, cluster head and network.
Agents are the monitoring components in the proposed architecture for node failure detection. Once the network changes its state, the node that identifies the change, disseminate the state through agents. The state change is used up by the fault management system for event correlation function. Agents are implemented at the application level and therefore, they can cover all activities needed to make the fault tolerance process efficient and manageable. It also provides a freedom to select any node in the network to handle the fault tolerance at any stage of the procedure making the system flexible for managing. This will also provide the facility of monitoring, studying the network by the schedulers and meta- schedulers distinctly before scheduling a job. Grid Information Service (GIS) can also make use of the agent system for querying the resources of the clusters.
The error once caused by the fault will make the management to restart the job at a new location or select the suitable redundant system for job completion. There will be a switch delay in each of these two techniques. The normal approaches need complex interactions to make the transition to happen increasing the switch delay. But the agent based system handling the fault will transfer the information to the new
node through data agents reducing the switch delay involved. Defining separate strategies for each combination of faults is a tedious and difficult job. Hence, classify the faults, handle them, and correct them.. Hence, the node failure which is one of the root causes for many faults should be handled in an extensible and flexible manner.
II. Related work:
The membership management technique and heart-
beat messaging discussed in [1] form the base for the detection process. [1] shows the strategies for the partitioning and
backup node designating. It also proposes an algorithm for membership management for the dynamic nature of grid. They make the grid to be viewed as a group of networks each
headed by a node whose position may be replaced by a backup and the member nodes. It helps to identify all the type of failures as such a single node failure, a group failure etc. Research on fault-tolerant grid and high performance computing has traditionally focused on recovery strategies implemented principally through checkpointing and job migration applied to local networks of computing hosts under
a centralized management [11].While checkpointing and job migration techniques should be part of any fault-tolerant grid management framework, more emphasis needs to be given to monitoring, detection, and prediction of faults so as to include preventive approaches in addressing the challenges of the grid computing environment. This is particularly important since preventive maintenance measures would diminish the need for frequent checkpointing and complex recovery procedures which may involve rescheduling jobs on different execution environments [7].A fault-tolerance framework that provides the necessary models to manage the local faulty behavior associated with the operation of hosted services is proposed [4]. The framework includes a quantification mechanism of
the fault vulnerability of grid nodes and their hosted services. The resulting measures of fault vulnerability are globally disseminated to enable the synthesis of decentralized fault- tolerant decision making strategies. Efforts need to concentrate on improving the fault tolerance by enhancing the ability of fault detection and fault prediction. Good fault tolerance mechanisms can ensure the reliable execution of Grid workflow applications, which is a critical issue in grid computing. Fault detection is the start point of any fault tolerance. On the one hand, early fault prediction and
detection is able to enhance the reliability and eliminate
considerably amount of overheads from belated fault handling [8, 9]. Artificial Intelligence should be brought into the fault management systems as the rules are to be made dynamic. Human-assisted computer discovery techniques and computer- assisted human discovery techniques are discussed in [26]. In this paper we have proposed a model which can bring transparency in the fault management system. The working nodes are managed through their proxy agents working in the cluster head. This makes the system a failure detector of minimal false findings and reaching earlier consensus time compared with the traditional approach of direct management of the working nodes. The vital modification from the architecture discussed in [1] is the introduction of proxy- agents for the working members at the cluster head site. The waiting time defined inside the proxy reflects the network parameters such as roundtrip time and delay for the link between cluster head and the respective working member.
III. Architecture and Design:
The architecture of the nodes participating and the agent environment is shown in Fig 1. The agent system surrounding the component nodes of each sub-cluster is also shown in fig1. The head agents of each sub-cluster interact to form the virtual cluster. The distinction of the nodes involved in grid as head, backup, member makes the agents functionalities to depend on the type of node on which it is working. This proxy binds the working member and the head. The component agents are:
3.1 Head Agent:
It is a stationary agent working on the head node. It functions as the controlling agent for the network. It communicates with the member nodes through the member representative agents present within the head node. The communication between them is based on the application level broadcasting that helps the representative agents to do the job as specified in the message. The return of the message from
the member node is handed over by the representative agents
to the head agent. It also messages the back-up node to ensure its presence. Head agents from different networks manage to form a group and represent their group through messaging the group. Head agents work based on two timer agents (timer1, timer3) that are discussed below and their interrelationship with head agent forms the significant part to manage the delay in node failure detection. They should be based on the individual paths or routes in the network to dynamically
update the timings.
3.2 Back-up Agent:
It is a stationary agent working over the back-up node. It is mainly used for the substitution of the head node when the head fails. All the members are made to contact the back-up agent when the current head fails. This will help the failure detection to go smoothly once the head fails. A new back-up agent is created by the new head agent from the member agents either randomly or through some algorithm. The communication between the head and back-up agent can be done efficiently if the selected back-up is based on nearest neighbor or the strongest link. Back-up node depends on a timer agent (timer2) to arrive at the head node failure which is based on the link between the two agents.
3.3 Member Agent:
It is a stationary agent working at each member node. The functionalities of the agent are low as to reply its presence to the head agent whenever the head asks for it though the member representative present in the head node. They are made minimum so that more amount of resources can be allocated to the job rather than wasting the resources in fault tolerance activity.
3.4. Proxy or Member Representative Agent:
It is a mobile agent made to move from the respective member node to the head node by the member agent. It is like
a proxy agent for the member agent present in the head node. Once a new head node is created and the old representative agents will be inactive. New representative agents will be created and sent to the new head node by the respective member agents. The member representative agents will be made to wait for the reply from the member agent based on the link between the head and the respective member. Thus,
the head node communicates with the member nodes based on the different link properties, which is expensive without the use of proxy agents at the application level. This will help to reduce the false positive and false negative findings about
each member node.
3.5. Timer1 Agent:
It is a stationary agent and also a control agent present in the head node and created by head agent used for the time between the regular heartbeat messages between the
head and members. The timer alerts the head agent to broadcast the message to the proxy agents present locally. It controls the intra-cluster communication.
3.6 Timer2 Agent:
It is a stationary and control agent in the back-up node and created by the back-up. It is based on the link between the head and the back-up nodes. Efficient back-up selection will help us to low this time which will decrease the failure detection time. It is helpful in finding the head failure.
3.7. Timer3 Agent:
It is a stationary and control agent present in the head node created on the whole for the entire system and cloned for each sub-cluster network. The head nodes of each sub-cluster communicate with each other based on this timer. It controls inter-cluster communication.
3.8 Ping Agent:
This mobile agent knows the reliable path to each working member and updated whenever new node joins. Whenever the waiting member representative agent did not receive any reply from the member agent it complaints the head agent about the abnormality. In turn the head creates the ping agent to check and it takes the original reliable path. If it is unreachable, then the node is declared failed.
3.9. Agent Repository
This represents the database for the agents. It can be made available to all the nodes by distributing the codes to all the nodes believing that each node can take the role of different types. This will reduce the delay in switching the node to a different role as the codes are available locally. A data agent can also be used to transfer the code to the node which is about to take a different role. This helps to avoid the exhaustive changes to the repository when a particular functionality changes.
IV Implementation
Agent environment is created through IBM aglets.
Message passing interface provided by them is used for the inter-communication of agents. The sequence diagram in Fig 2 depicts the working of the agent system in detecting the node failure. Initially when the working node joins the cluster, a proxy agent representing the member is sent to the cluster
head. Timer 1 agent induces the head agent to broadcast a message which reaches member through proxy. Proxy waits
for the time defined in it. If there is no reply from the member agent, head agent creates the ping agent. The ping agent routes the working node through the reliable path and get the status.
V. Results and Discussions:
Let a node ‘i' fails after successfully
replying to the head. Consider the total nodes be divided into ‘m' groups of average working nodes
‘n'. Let T1 be the timer1 agent value, T2 be the timer2 agent value and T3 be the timer3 agent value. Let us consider the processing of messages at each node takes a constant time and also the ping agent is designed to take constant time for routing as it searches within the sub-cluster. The consensus time
‘C' required for the failure detection can be discussed as:
Head agent comes to know at, then
……….(1)
where, is the round trip time between head and the failed node.
is the time taken by the ping agent to test the
status of the node which is linear. Other members come to know at
..…….(2)
where, is the average round trip time between head and jth node.
Back-up agent comes to know at , then
……….(3)
where, is the average round trip time between head and back-up nodes. Other group heads come to
know at, then
………(4)
where, is the average round trip time between head nodes. Overall time
……(5) Equation (5) forms a linear time of order (m). Thus,
the consensus time is linear which is verified from the graph which is shown in Fig 3. It also signifies the importance of proxy agent which reduces the consensus time.
Comparison of consensus time
2 * 2 2 * 3 2 * 4 3 * 2 3 * 3 3 * 4
m(numbe r of groups) * n(av e rage me mbe rs pe r group)
with proxy agent without proxy agen.
carry out the functions in parallel. The mobile grid coming up in future will be highly benefited by the concept of proxy agents. The location transparency of the system makes the application layer working unaware of the working member location. This will help in building grid portals for submitting jobs over the internet. Head agents as the controller of the sub- cluster can monitor it, with security policies. Rules
for fault management regarding inter-cluster interactions can be implemented at the intelligent head agents, intra-cluster interactions being controlled by the member agents..
A study for suitable timer values is shown in fig.4 which is given by
……….(6)
where, L is an integer constant. It shows that L value between 1.5 and 2 reduces false findings which ensures reliability.
False findings vs Timers for different groups
References
[1] Amit Jain and R.K. Shyamasundar, Failure
Detection and Membership Management in Grid Environments: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04).
[2] Jurgen Hofer, Thomas Fahringer, A Multi- Perspective Taxonomy for Systematic Classification of Grid Faults, Proceedings in: 16th Euromicro Conference on Parallel, Distributed and Network- Based Processing 2008.
[3] Youcef Derbal, A new fault-tolerance
framework for grid computing, Multiagent and Grid
Systems - An International Journal 2 (2006) pp: 115-
133.
[4] Abdul Waheed, Warren Smith, Jude George, and Jerry Yan, An Infrastructure for Monitoring and Management in Computational Grids, Computer Sciences Corp., NASA Ames Research Center,
VI. Conclusion:
Thus , the proxy agents bring out earlier
detection of node failure which helps the event correlation to avoid extensive analysis of events for node failure detection as it is made explicit. The scalability and the network bandwidth usage remain unaffected with the use of the multi-agents. The minimum false findings from the proxy agent system gives a clear picture of the reliability the agent system which provides the detection of failures and faults. Amid all these advantages, a location transparency exist which abstracts the individual clusters as grid.
VII. Future Work:
Artificial Intelligence can be made into the
system making the agents more intelligent. A replication based fault resistant system can be built through the agents. The replica is easily made through the cloning of the agents and can be made to
Moffett Field, CA 94035-1000.
[5] Chi-Shih Chao, Don-Ling Yang, An-Chi Liu, A LAN Fault Diagnosis Systems, Computer Communications 24(2001),pp:1439-1451.
[6] Edidiong Uyai Ekaette and Behrouz Homayoun Far, A Framework For Distributed Fault Management Using Intelligent Software Agents, CCECE 2003 - CCGEI 2003, Montréal, May/mai
2003.
[7] G. Wrzesinska, R.V. Van Nieuwpoort, J. Maassen and H.E. Bal, Fault-tolerance, malleability and migration for divideand- conquer applications on the grid, in: Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, Institute of Electrical and Electronics Engineers Computer Society, Piscataway, NJ 08855-1331, United States,2005, 13a.
[8] Hwa Min Lee, Sung Ho Chin, Jong Hyuk Lee, Dae Won Lee, Kwang Sik Chung,Soon Young Jung, and Heon Chang Yu, A Resource Manager for Optimal Resource Selection and Fault Tolerance Service in Grids, Proceedings in: IEEE International
Symposium on Cluster Computing and the
Grid(2004).
[9] Nitin B. Gorde, Sanjeev K. Aggarwal, A Fault Tolerance Scheme for Hierarchical Dynamic Schedulers in Grids, Proceedings in: IEEE International Conference on Parallel Processing(2008).
[10] Raissa Medeiros, Walfredo Cirne, Francisco Brasileiro, Jacques Sauvé,Faults in Grids: Why are they so bad and What can be done about it? , Proceedings in: IEEE Fourth International Workshop on Grid Computing (GRID'03).
[11] L.Wang, K. Pattabiraman, Z. Kalbarczyk, R.K. Iyer, L. Votta, C. Vick and A. Wood, Modeling coordinated checkpointing for large-scale supercomputers, in: Proceedings of the International Conference on Dependable Systems and Networks, Institute of Electrical and Electronics Engineers Computer Society, Piscataway, NJ 08855-1331, United States, 2005, 812-821.
[12] Christopher Dabrowski, Reliability in Grid Computing Systems, http://www.ogf.org/OGF_Special_Issue/GridReliabili tyDabrowski.pdf.
[13]. Rubing Duan , Radu Prodan, Thomas Fahringer, Data Mining-based Fault Prediction and Detection on the Grid. IEEE 2007.
[14]. Scalzo , R.C; Roth, H. A Meta-Model for Fault Management .Proceedings in: Conference on Object-Oriented Real-Time Dependable Systems
1994. pp:135-144.
[15]. Baras, J.S.; Ball, M.; Gupta, S.; Viswanathan, P.; Shah, P. Automated Fault Management. IEEE (1997) , pp:1244 - 1250.
[16] Burgess.J; Guillermo.Raising .Network Fault Management Intelligence. Proceedings in:Network Operations and Management Symposium IEEE/IFIP( 2000) .pp:861 - 874
[17] Sterritt,Marshall; Shapcott; C.M.McClean. Exploring dynamic Bayesian belief networks for intelligent fault management systems. Proceedings in:IEEE International Conference on Systems, Man, and Cybernetics( 2000) pp:3646 - 3652.
[18] Guiagoussou, M.H.; Boutaba, R.; Kadoch,M.
A Java API for Advanced Fault Management Proceedings in: Integrated Network Management Proceedings, 2001 IEEE/IFIP International Symposium (2001) pp:483 - 498 .
[19] Duart , dos Santos, A.L.Network Fault Management Based on SNMP Agent Groups. Proceedings in: International Conference on Distributed Computing Systems Workshop ( 2001). pp:51 - 56.
[20] Li-Der Chou, Chi-Chia Kao ,Network Fault Management Systems Using MultipleMobile Agents for Multihomed Networks.Proceedings in: IEEE
international conference on Grid systems ( 2003). pp:
620- 624
[21] Yanxiang He; Weidong Wen; Hui Jin; Haowen Liu. Agent-based Mobile Service Discovery in Grid Computing.Proceedings in: Fifth International Conference on Computer and Information Technology (2005) pp:351-355
[22] Hao, R.; Lee, D.; Ma, J.; Yang, J. Fault management for networks with link state routing protocols. Proceedings in: Network perations and Management Symposium, IEEE/IFIP.(2004)pp:103
- 116.
[23] Jiann-Liang Chen; Pei-Hwa Huang. A Fuzzy Expert System for Network Fault Management Systems.Proceedings in: IEEE International Conference on Man and Cybernetics 1996. pp:328 -
331
[24] Manoj Kumar Kona and Cheng-Zhong Xu. A Framework for Network Management using Mobile Agents, Proceedings in IEEE International Parallel and Distributed Processing Symposium (IPDPS
2004).
[25] A. De Paola et al, Rule based Reasoning for Network Management, Proceedings in Seventh International Workshop on Grid systems, IEEE,
2005.
[26] Roy Sterritt, David W. Bustard, Practical Intelligent Support for Rule Discovery in Fault Management Systems. Cybernetics and Systems
33(6): 579-601 (2002).