Abstract - In this report, the testing and deployment of a grid simulator is discussed, followed by the features required for a successful simulator are described. The grid simulator OptorSim is then presented with a description of its design, features and implementation, with an emphasis on the selected deployment environment used to make it as realistic as possible. Finally, a comparison is made with other grid simulators.
Keywords: Optorsim, grid simulators, Grid topology, Storage element, Computing element, GridSim.
Introduction to Grid Simulation
Grid technology is emerging as the solution to the data handling and storage problems posed by the next generation of experiments. It is important to make the best use of a grid's resources, whether computational, storage or network and it has been shown that data replication is an important mechanism for reducing data access times and hence improving the overall resource usage. Simulation is a useful way of exploring possible replication algorithms and this has led to the development of the grid simulator OptorSim.
Data grids are highly complex environments. The many components, users and sites, with their interactions, mean that it is impossible to predict behaviours from first principles. Before time and effort is spent on developing a new and experimental component, however, it is desirable to know its effects. The only way this can be done without implementing a prototype and deploying it on a testbed is by simulation.
Even if the required components were developed and tested on the real grid, then, it would still not be possible to gain a true idea of how they would perform under the conditions of a grid in its production phase. Simulation is again necessary, therefore, to indicate how the components would perform under future conditions.
In our project assignment, we deployed OptorSim in order to simulate the Grid computing.
Components of a Grid Simulator
In any simulation, it is important to model the system components at the right level of abstraction. For a grid simulator, it is therefore necessary to examine a real grid system and decide which components are necessary for the simulation and at what level of detail. For a data management simulator such as OptorSim, the important components are:
jobs or applications which run on the grid
computing resources to which jobs can be sent
storage resources where data can be kept
network infrastructure to connect the resources
scheduler to decide where jobs should be sent
transfer system to move data around the grid
component to perform replica management
In simulation, it is necessary to make sure the simulation model and its results are sufficiently correct. This is achieved through the twin processes of validation and verification.
In [1], which is a standard reference for simulation modelling at an introductory level, a variety of techniques for validation and verification are outlined, some of which are relevant to the case of grid simulation and some of which are not. Those used in the development of OptorSim are as follows:
Degenerate Tests: Appropriate values of the simulation's input and internal parameters are selected and the output is examined to see whether the simulation's behaviour is what would be expected given these parameters. If the rate of job submission, for example, is much higher than the rate of job processing at a site, one would expect the queue size at a site to increase with respect to time until all jobs had been submitted.
Fixed Values: Subset of Degenerate Tests. Fixed values of the simulation parameters are chosen such that the expected results can be calculated analytically. A very simple grid with a few jobs, for example, could be used as input and the expected job times calculated, then compared with the actual results.
Internal Validity: A number of runs of the simulation are performed, with the same parameters, and the variability of the output is observed. If there is a lack of consistency in the results, i.e. high variability, the results may not be valid.
Face Validity: Simulated check by system experts whether the model and its behaviour are reasonable.
This is useful at the start of the modelling process for determining if the logic in the model is correct, and also for determining whether the output is reasonable for a given input.
Also of great importance to simulation accuracy is the validity of the inputs. In the case of a grid simulation, these will be the characteristics of the jobs and files being simulated, storage and computing capacity at grid sites, network capacity and so on.
OptorSim Architecture and Design
The conceptual model of the architecture used for OptorSim is shown in Figure 1.
In this model, the grid consists of a number of grid sites, connected by network links in a certain topology. A grid site may have a Computing Element (CE), a Storage Element (SE) or both. Each site also has a Replica Optimiser (RO) which makes decisions on replications to that site.
Resource Broker (RB) handles the scheduling of jobs to sites, where they run on the CEs. Jobs process files, which are stored in the SEs and can be replicated between sites according to the decisions made by the RO.
Replica Catalogue holds mappings of logical filenames to physical filenames and a Replica Manager handles replications and registers them in the Catalogue. Note that the central Replica Manager and Catalogue in this model would constitute a single point of failure in a real grid, and thus a distributed solution would be preferred.
Simulation Inputs
As well as the basic architecture, the detailed conceptual model of the grid applications and fabric components is crucial in achieving a valid simulation. These are components which must be easily configurable for each simulation and so are classified together here as simulation inputs.
Figure 1: OptorSim Architecture
To achieve an accurate idea of how these should be modelled, a close examination was made of both current systems and planned future systems. In particular, the Computing Model documents from the LHC experiments [2] [3] [4] [5] were used, as these detail the experiments' assumptions on dataset and file sizes, data distribution, processing times and computational and storage requirements as well as the kind of grid structure they envisage. These simulation inputs are now described in
turn.
Grid Topology
The topology of the grid - the number of sites and the way they are connected - has a strong impact on grid behaviour. It is therefore important that OptorSim take any grid topology as input, i.e. that the user can specify the storage capacity and computing power at each site, and the capacity and layout of the network links between each. SEs are defined to have a certain capacity.
Jobs and Files
A job usually processes a certain number of files. This is simulated in OptorSim by defining a list of jobs and the files that they need, with their sizes. A job, when sent to a CE, will process some or all of the files in its dataset, according to the access pattern which has been chosen. The time a file takes to process depends on its size and on the number and processing power of worker nodes at the CE.
Site Policies
Different grid sites are likely to prioritise different kinds of job.
Network Bandwidth Variation
Apart from the layout of sites on the grid, it is important to model the underlying behaviour of the network over which files are transferred. Networking is a highly complex field and it would be possible to simulate the network to a high level of detail; indeed, there exist many specialised network simulators which do this. For OptorSim, however, it is not necessary to go down to the level of individual packets or different protocols. It is necessary only to know how the available bandwidth on a link varies over time and adjust file transfers accordingly.
Simulation Parameters
There are a number of parameters, other than the simulation inputs, which are of interest when performing simulations of a data grid. The most important are outlined here; a full description of all the parameters is given in [6]. These are:
Initial File Distribution: The distribution of files around the grid at the start of the simulation can affect the behaviour of the replica optimisation strategies. A set of the master files is therefore placed at designated sites, and the option is also given to fill up the sites with replicas of these files before the simulation gets under way.
Access Patterns: Different kinds of job may access the files in the dataset in a different way. Some jobs may process each file in sequence; others may miss some files or read them in a different order.
Users: The pattern and rate of job submission by users may affect the behaviour of job queues at sites and hence the behaviour of the Resource Broker and Replica Optimisers.
Number of Jobs: Optimisation strategies may perform differently under conditions when the grid is under-utilised (few jobs) or congested (many jobs) so it is important to vary the number of jobs and measure this effect.
Job Scheduler: The algorithm used by the Resource Broker for its scheduling decisions will clearly have an important effect on the running of the jobs and performance of the grid.
Optimiser: The algorithm used by the Replica Optimisers for the file replication strategy, which is the original research focus of OptorSim.
OptorSim Deployment and Testing
There are several configuration files can be used to control various inputs to OptorSim. Sample configuration files can be found in the example/directory. Using the supplied example, the grid and job configuration files needs to be match in order to run the program.
The program code is structured into several packages, each of which deals with a different part of the simulation. The list of packages and their main remit is shown in Table 1. Details of most of the classes within these packages is not given here, but can be found in the Javadoc API which comes with the OptorSim release [7].
Table 1. List of packages in OptorSim, with their main functionality, ordered from lowest to highest level.
There are two ways in which OptorSim can output useful information to the user: the terminal and the Graphical User Interface (GUI). In each case, a number of statistics are gathered at the levels of CE and SE, sites and whole grid. When the simulation finishes, these are output to the terminal according to the level of detail stipulated by the user in the parameters file. If the GUI is being used, more detailed monitoring of the state of the grid is performed and output to the GUI in real time as the simulation progresses. This includes continuous calculation of job times, SE usage, network usage and so on. Figure 2 shows a screenshot of the GUI during a simulation run.
Degenerate tests were applied by inspection during the development process. The simulation would be run with a high volume of output information, and if anomalous or unreasonable behaviour was observed, the problem would be searched for and solved. This was sometimes due to implementation bugs and sometimes due to shortcomings
in some aspect of the model, such as the way of estimating future values of files.
The Fixed Value tests were administered via a suite of functional tests, which also verified the correctness of the implementation during development. The functional tests covered basic network behaviour, schedulers, access patterns, replication times and optimisation algorithms. These used very simple grid configurations with only a few sites, where the correct results could be calculated and used to check the simulation results, to show that the implementation was indeed correct. Some simple configurations were also tested in [8], which also showed that OptorSim behaves as expected.
Figure 2. Screenshot of OptorSim GUI.
Experimental Results
The following snapshots are based on the default parameters of OptorSim.
Comparison with other Simulators
There exist several other grid simulators, each with a different focus and design. This section briefly discusses each of them and how they differ from OptorSim.
GridSim
GridSim [9] is a grid simulation toolkit developed to investigate resource allocation techniques and in particular, a computational economy. Again, it is a discrete-event simulator based on Java. It supports modeling of heterogeneous computing resources from individual PCs to clusters, and various application domains from biomedical science to high energy physics. The focus is very much on scheduling and resource brokering with very detailed implementation of application resource requirements, budgets and local policies and influences at grid sites, down to the effects of public holidays and weekends. There is not, however, capability for data management such as would be required for investigation of replica optimisation strategies.
GridNet
GridNet [10] is written in C++ as a layer on top of the NS [10] network simulator. It aims to investigate dynamic data replication strategies, with a focus on network behaviour. Its architecture is simple, containing Storage Elements, Replica Managers and the network components.
Bricks Grid
Bricks Grid [11] is another Java-based discrete event simulator, based on the Grid Datafarm (Gfarm) architecture. In this architecture, each grid site is designated as a Gfarm filesystem node, acting as both storage and computing node with a single virtual filesystem. Its replica management components, however, are centralized rather than the distributed Replica Optimiser architecture of OptorSim.
Others
Other simulators also exist, such as EDGSim [12], which models job and data flows around the EDG framework;
Conclusion
OptorSim has given valuable results and is easily adaptable
to new scenarios and new algorithms. Expansion will include continued experimentation with different site policies, job submission patterns and file sizes in the context of a complex grid.