Fault-tolerant

Abstract-A Fault-tolerant dynamic scheduler meant for grid applications that uses job replication along with combines the compensation of static scheduling, specifically no overhead for the mistake free case, and low overhead in case of a fault. The dependency arrangement of a parallel program can be described by means of a directed acyclic graph (DAG). Scheduling graph consists of ordering the jobs and mapping them to processing units. Faults can happen at any point during time. A fault can be detected either because a processing unit and it does not send data at a prearranged time. Provide that fault-tolerance can be achieved while detecting that a processing unit has stopped through completing of a job, the status of this job is recovered from some checkpoint, and this job is re-executed on another processing unit, and all jobs (potentially) to be executed on this processing unit are reorganize to one more processor. Co-allocation involves transmission jobs to favorable resources. This is achieved by obtaining information from a variety of nodes of the grid and updating them as and when a modification occurs and assigning resources to the jobs based on their specifications and performances.

Index Terms -Grid computing, dynamic scheduling, checkpointing, job migration, and fault tolerance.

INTRODUCTION

Grid environments are vibrant in personality. The resources in a network are assorted in personality and are not below a middle control. Grid computing organizes organically circulated computing resources into a sole system in order to provide presentation beyond that classically available at a different site. Any job submission that executes in such a surrounding must handle with failing mechanism. As in [1, 2] the sites implicated are not below a central power but belong to various organization. However, achieving fault tolerance in Grids consisting of several layers of software and hardware is a grand challenge. Most client level applications and even middleware services imagine that lower-level services are reliable. To ensure high-quality grid presentation, fault tolerance should be taken into account. A sole node failure will halt the execution of the entire computation. Using enthusiastic checkpoint servers saturates the network links of a few machines, resultant in degraded performance. In this article, we focus exclusively on reducing the fault overhead by checkpointing on shared storage.

We consider numerous existing checkpoint storage solutions and exhibit the low overhead of our planned replication-based approach. As in [3, 4] one well researched method is familiar as incremental check-pointing. It reduces data stored during check-pointing to only blocks of memory customized while the last checkpoint. In [5], the presentation of the grid depends robustly on the effectiveness of the arrangement. A schedule is a mission of the jobs of a task to a set of resources. Every task consists of a set of jobs, and each one job has to be executed by one of the grid resources for an assured time. The final limits the re-execution to only the unsuccessful processor and allows distributing the failed work between the other processors. In agreement with the crash fault model failures of processors are assumed to be independent from each other [6].

While dependent failure of multiple homogeneous processors may occur as an effect of an attack, we desire to cover up failure from fault and not from attack. A grid application or job consists of a set of tasks, each task being a chronological program. If one job requires data that have to be computed by one more task first, there exists data dependence among these tasks, preventing their similar processing. The dependency arrangement of a parallel program can be described by means of a directed acyclic graph (DAG), as well as job graph. Scheduling a job graph consists of ordering the jobs and mapping them to processing units. Scheduling can happen preceding to execution, assuming a stable state of the network engine (static scheduling), or can be done through execution (dynamic scheduling). Providing fault-tolerance can be achieved in some ways: either, when detecting that a processing unit has blocked through execution of a work, the state of this job is recovered from various checkpoint. The advantage of the Grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on a different site.

The remainder of this article is organized as follows. In Section 2, we review related work on task scheduling and fault-tolerant of grid applications. In Section 3, we present our Check point mechanism in grid systems, and the preliminary results of our experimental evaluation. In Section 4, we illustrate variants of our approach. In Section 5, we summarize and conclusions. Finally, in Section 6, we present our Future work.

RELATED WORK

A great number of investigate hard work have already been dedicated to fault tolerance in the scope of circulated environments. Aspects that have been explored include the design and execution of fault finding services [4], [5], as well as the development of failure calculation [3], [6], [7] and revival strategies [8], [9]. While both methods plan to get better system presentation in the occurrence of crash, their effectiveness largely depends on tuning runtime parameters such as the checkpointing interval and the number of replicas [10], [11].

FAULT-TOLERANCE AND DYNAMIC SCHEDULING

While a processing unit fail, then the effort it would have to take in the future must be circulated onto additional processing units1. At this time we have to differentiate whether extra processing units are presented or not. If there is single unused processor in the active setting, after that we still have p processing units existing, and after the re-execution of the stopped task, the grid job can carry on as if not something had happened.

While there is no extra processor in the energetic situation, then the exceptional work must be distributed onto p-1 processing units2 and the remaining runtime increases by a factor

During observe but, the processor effectiveness increase with decreasing processor count, hence that we can sharpen the disproportion above to

Note that occasionally extra processors may appear for free; if a grid of a prearranged amount is allocated for a grid job, but the dependency prevent the scheduler from using every processing unit in the grid. A main motive for this contra-intuitive performance is communication cost. We already explain this by the help of Fig. 2.1. Part(a) of this figure shows an easy job graph. Nodes 0 with 5 are non-natural nodes that serve to produce unique source and sink nodes in the job graph. So their execution time (rt) is 0, as are the communication times on their departing/arriving edges. In Part(b) shows a program for this job chart on 2 processing units. Odd jobs 1 and 3 are schedule on processing unit P0, with task 3 can begin instantly after task 1 has completed, in spite of an edge (1, 3) with weight 20. This model is aggravated by the argument that information transfer within single processing unit happen via hard disk or memory, and is essentially free compared to communication over a set of connections. Now compare, task 4 can only be happening on P1 8 time unit later than the finishing of task 1. This sort of activities also aggravated job duplication in fixed scheduling. In Part(c) shows a schedule anywhere two copy of task 1 are execute on P0 also P1. As the runtime of job 1 is lesser than the communication cost on edge (1, 4), job 4 can currently be started earlier, and the schedule length, i.e. the time to run the schedule fully, reduces from 20 to 16 time units.

If every job be duplicated, after that the fail of a sole processor can be beat in a fixed program, since at least single replica of each job will survive. One would need p extra processors to assurance a continuance in a static schedule after a fault without performance loss. Every of the extra processors execute accurately the similar tasks as one of the p original processors, so that the job graph is executed twice on different hardware. No subject which processor fails, one of the copy of the job chart will be execute without trouble. More general, if p+k processing units are accessible, for 0 = k < p, then the schedule duration increases by a factor

While in the vibrant case only remain runtime was enlarged, at this time the total schedule duration is improved, as from beginning of the grid job on, the job have to be scheduled in two copy each on p+k processing units. For k = 0, (i.e.) p processing units, the schedule length increases by a factor of

While this looks like a serious loss of performance, our experiments indicate that in practice the loss of performance is only around 10% even for p = 4, because the efficiency is often quite low due to data dependencies, so that both copies of the task graph can be interleaved.

CHECKPOINT METHOD IN GRID SYSTEMS

Periodic snap of the status of a request is defined as checkpoint. The status of application reflects impact of different jobs performed by the application at that particular point of instance. Status is saved so that it can be used for revival of any job during crash. Saved position will drive unfinished failed work to completion initial from that state.

Our goal is to design scheduling approach such that applications are efficiently and reliably executed. By this approach we mean that the Grid scheduler has the knowledge of:

As the figure shows, simplify a scheduling procedure in the Grid into three stages: resource discovering and filtering, resource selecting and scheduling according to certain objectives, and job submission [10]. Mostly, a Grid scheduler (GS) receives application from Grid users, selects possible resources for these applications according to acquired data from the Grid Information Service module, and at last generates application-to-resource mappings, based on certain objective functions and predicted resource performance. Information about the position of available resources is very essential for a Grid scheduler to build an appropriate schedule, particularly when the heterogeneous and dynamic nature of the Grid is taken into account. The role of the Grid information service (GIS) is to provide such data to Grid schedulers. GIS is answerable for collecting and predicting the resource position information, such as CPU capacity, memory size, network bandwidth, software availabilities and load of a position in a particular period.

NEED FOR CHECKPOINT METHOD

In Distributed Systems, recovery mechanism and checkpointing are already exists. Grid require a middleware for cooperation of geologically spread assorted resources for building its complete grid transportation in which grid applications using these resources are supported by checkpointing methods for recovering from collapse.

This article addresses revival methods in grid surroundings using stored checkpoints. Revival becomes added difficult when sites are circulated, independent and assorted. Proposed checkpointing and recovery mechanism add a grid wrapper service using grid WSRF framework that co-ordinates and integrate with existing available grid services like Remote File Transfer Service (RFT) using Monitoring and Discovery Service (MDS), Grid File Transfer Protocol (GridFTP), Data Location Service (DLS) and Open Grid Service Architecture-Data Access and Integration Service (OGSA-DAI) among intent of providing fault tolerance to applications running on grid. All grid applications follow Open Grid Service Architecture (OGSA). Future checkpointing method is implementing in the type of platform independent and inter-operable grid services. A task on heavily loaded node can start again on idle node using checkpoints. Utilize of SOAP (Simple Object Access Protocol) message-passing interface and un-coordinated checkpoints are proposed. Features provided are reliable checkpoints and recovery from breakdown across scattered heterogeneous grid resources using usual grid structure. Any request running on grid node can register and use checkpoint and revival services.

METHOD FOR CHECKPOINTING COMPUTATIONAL GRID APPLICATIONS

In Computational grid, composite computations on enormous amount of investigational information are performed using number of open normal services of grid structure. In this section, we suggest structure for checkpoint and revival service. It should incorporate with grid middleware services.

Architecture shown in figure 2 describes how the job submitted by client is processed in a computational grid surroundings using WSRF structure. Grid Resource Allocation and Management Service (GRAM) along with limited schedulers like Condor decides on which grid node the job should be submitted depending on the availability of nodes and matching resources. It run as a background service which occasionally obtain situation of application specific service (which has registered with it) and stores it on a remote storage using Remote File Transfer Service (RFT) which in turn uses Grid File Transfer Protocol (GridFTP).

Monitoring and Discovery Service (MDS) doe's event management and monitors grid services at present running in that grid. In case of any kind of failure, Failure Discovery Service (FDS) as shown in figure 3 which continuously checks MDS for status of grid services (running or failed) will call Recovery Service (RS) which in turn will obtain data stored by 'Checkpoint Service' restore previous state and start again the remaining portion of job after resubmitting

Its have one additional GRAM job manager. If crash on a grid node wherever some job was processed using 'Processing Service' is critical, then Condor scheduler can also resubmit remaining portion of job to some other idle and authenticated grid node which can process the job using 'Processing Service' restarting it from the stored state which was stored by 'Checkpoint Service'.

GENETIC ALGORITHM (GA) IS AN EVOLUTIONARY TECHNIQUE.

The general procedure of GA search is as follows [4]:

The stopping criteria can be, for example, 1) no improvement in recent evaluations; 2) all chromosomes converge to the same mapping; 3) a cost bound is met. For its simplicity, GA is the most popular Nature's heuristic used in algorithms for optimization problems

Genetic algorithm ()

{

Initialize nodes;

Evaluate the primary node;

while (execution condition not reach)

{

Choose result for next node;

Present crossover and alteration;

Estimate node;

}

CHECK POINT TECHNIQUE

Check pointing and message sorting are few of the accepted and general-purpose method for provide error acceptance in spread system. Check pointing allows the recovery to a previous correct state. Checkpoint is often a costly operation and may have a sever performance penalty. In a system where only checkpointing is used, process will be restored to a checkpointed state.

Algorithm is based Resources, Memory, Failure Rate, and Speed.

Algorithm 2

As the figure shows, the use of additional networks to offload checkpoint replication can lead to substantial reductions in overhead. Our results show an average improvement of 40 percent at 8-minute checkpointing intervals and 50 percent at 16-minute checkpointing intervals. We note that while utilizing a secondary network improves the performance of our replication scheme, it cannot fully eliminate all networking overhead. Ultimately, each node must individually transmit and receive all checkpoint data (as well as write the replicas to disk). Processing such data leads to a small amount of overhead for increasing amounts of data.

CONCLUSION

In Grid surroundings, job execution failure can occur for a variety of reason such as network crash, overloaded resource condition or non-availability of necessary software mechanism. In this article, we presented fresh dynamic Algorithm for grid applications that uses job duplication to permit a grid task to survive processor failures. The scheduling algorithm has the benefit of dynamic scheduling in the case of an error, i.e. Low increase of schedule duration. We have presented some very beginning experimental outcome that supports our hypothetical analysis. Revival in fault-tolerant grid systems is invariably achieved by checkpointing the state of the system on a usual basis.

FUTURE WORK

Our expectations work will consist of a complete execution of this scheduler and further explorations of the extension. Instead of insisting on no overhead in the fault-free case, it is also possible to allow a small amount of overhead in the fault-free case, while improving the situation in the case of a fault. As part of our potential job we plan to expand our focus on the performance study of suitability rate. This has the potential to decrease the capacity of data send over the network as complete replicas need not be sent. However, the encoding overhead is a concern that must be balanced against the reduction in duplication overhead.

Checkpoint proceedings can be compressed to decrease space overhead due to checkpoints. Logs and checkpoints should store minimal information to serve recovery during breakdown. Proposed checkpoint and recovery mechanism can be improved to lessen the time overhead on standard implementation of task or transaction. This can be modeled for different courses of jobs or transaction unstable in amount of complication.

REFERENCES

You may also find these documents helpful