Replica Location Service offers a mechanism to maintain and provide information about physical locations of replica.This will also be used to improve overall system robustness, scalability, performance and security. A Secured replication is done using Triple DES cryptographic algorithm which is used to encrypt and decrypt the file with the help of a key. Replica Location Service is attained with two components, Local Replica Catalog to maintain mappings between logical names and target names. The other, Replica Location Index stores mappings from logical names to the LRCs that, in turn, contain mappings from those logical names to target names. Information in the distributed RLIs is maintained using soft-state update protocols. The use of soft-state protocols is to populate a distributed index and Bloom filter compression to reduce overheads for distribution of index information. Hence we propose RLS framework for design and implementing the Globus Replica Location Service along with cryptographic techniques.
DATA management in Grid environments is a challenging problem. Data-intensive applications may produce data sets on the order of terabytes or petabytes that consist of millions of individual files. These data sets may be replicated at multiple locations to increase storage reliability, data availability, and/or access performance.For example, scientists performing simulation or analysis often prefer to have local copies of key data sets to guarantee that they will be able to complete their work without depending on remote sites or suffering wide-area access latencies.The Replica Location Service (RLS) is one component of
an overall Grid data management architecture.The RLS provides a mechanism for registering the existence of replicas and for
discovering them. We have designed and implemented a distributed RLS that consists of two components: a Local Replica
Catalog (LRC) that stores mappings from logical names of data items to their addresses on a storage system and a Replica Location Index (RLI) that contains information about the mappings stored in
one or more LRCs and answers queries about their contents. Our RLS server implementation performs well in Grid environments, scaling to register millions of entries and supporting
up to 100 simultaneous clients.
A.L. Chervenak, R. Schuler, M.A. Amer, S. Bharathi, and C. Kesselman are with the Information Sciences Institute, Viterbi School of Engineering , University of Southern California , 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292. E-mail : {annc, schuler, carl}@isi.edu, {mamer, shishir}@usc.edu.
M. Ripeanu is with the Department of Electrical and
Computer Engineering, University of British Columbia, 2332
Main Mall, Vancouver, BC V6T 1Z4, Canada. E-mail:
The Existing System does not have framework as well as any tool kit to find the location of the specified file in Grid Environment. The data transfer will not be achieved effectively, access performance and then file cannot be quickly.
The Replica Location Service (RLS) is a tool that provides the ability to keep track of one or more copies, or replicas, of files in a Grid environment. This tool, which is included in the Globus Toolkit, is especially helpful for users or applications that need to find where existing files are located in the Grid.
RLS consists of two components Local Replica Catalog (LRC) and Replica Location Index (RLI). RLS can be integrated with Grid data management to access and maintain the replicated data. The information in RLI is maintained by soft state protocol. The Bloom filter has been used to compress the replicated files and reduces the overload in network. Peer to peer Technique is used . This paper presents the design of the RLS , its
implementation and performance.
II. THE RLS FRAMEWORK
The RLS design and implementation are based on a framework jointly developed by the Globus and European Data Grid projects. This section presents the five elements of the RLS framework: LRCs, RLIs, soft-state update mechanisms, optional compression of updates, and Membership services.
A. The Local Replica Catalog
LRCs maintain mappings between logical names and target names. Logical names are unique Identifiers for data items that may have one or more physical replicas. Logical name spaces are application-specific, and tools are available for generating unique identifiers.
Target names are typically the physical locations of data, but they may also be logical names. An LRC implementation may associate attributes with logical or target names.
Clients query LRC mappings to discover target names associated with a logical name and the reverse. Clients also query for logical or target names with specified attribute values,
where attributes are arbitrary name-value pairs defined by an RLS
configuration.
An LRC may periodically end updates that summarize its state to one or more RLIs. In addition, an LRC maintains security over its contents, performing authentication and authorization of users via standard Grid mechanisms.
B. The Replica Location Index
Each RLS deployment may contain one or more RLIs that store and answer queries bout mappings held in one or more LRCs. An RLI stores mappings from logical names to the LRCs that, in turn, contain mappings from those logical names to target names. Each RLI responds to queries about logical names, providing a list of one or more LRCs believed to contain target mappings associated with that logical name.
RLIs accept periodic soft-state updates (described below)
from one or more LRCs that summarize LRC state. An RLI
provides security, performing authentication and authorization operations using standard Grid mechanisms.
The RLS framework also supports partitioning of updates
from an LRC server among RLI servers to reduce the amount of information sent to an RLI.
Options include partitioning based on the logical name-
space, with updates for portions of the namespace sent to different RLIs, and partitioning based on the domain namespace of target names.
C. Soft-State Updates
Information in the distributed RLIs is maintained using softstate update protocols.Soft-state information timesout and must be
periodically refreshed.Soft-state update protocols provide two advantages. First, they decouple the state producer and consumer; the state consumer removes stale data implicitly via timeouts rather
than using an explicit protocol that requires communication with the producer. Second, state consumers need not maintain persistent state; if a consumer fails and later recovers, its state can be reconstructed using soft-state updates. In our design, each LRC
sends soft-state information about its mappings to one or more
RLIs.
D. Compression of Soft-State Updates
Soft-state updates may optionally be compressed to reduce the amount of data sent from LRCs to RLIs and to reduce
storage and I/O requirements on RLIs. The RLS framework proposes two compression options. The first option reduces the size of soft-state updates by taking advantage of structural or
semantic information present in logical names, for example, by sending information about collections of related logical names rather than individual logical names. A second option is to use a
hash digest technique such as Bloom filter compression. While compression schemes based on hash digests greatly reduce network traffic and provide better scalability, they do not support
wildcard queries. This limitation, however, is not a major concern in practice for the RLS, since many data grid deployments include an additional service, a metadata catalog that supports queries
based on metadata attributes. These queries identify a list of logical files with specified attributes. The RLS is then used to locate physical replicas of these logical files.
E. Membership Management for RLS Services
The final component of the RLS framework is a membership service that tracks the set of participating LRCs and RLIs and
responds to changes in membership, for example, when a server joins or leaves the RLS. Existing RLS production deployments use
a simple, manually maintained, static configuration for RLS
servers that must be modified by an administrator to change update patterns when servers enter or leave the system. Ideally,
membership management would be automated so that LRCs and
RLIs can discover one another and updates are redistributed to balance load when servers join or leave the system. More
automated membership management alternatives include peer-to- peer (P2P) self-organization of servers and a service registry that
maintains a list of active servers and facilitates reconfiguration.
III. CRYPTOGRAPHY
TripleDES cryptographic algorithm has been used to encrypt and
decrypt the file. LRC uses the tripleDES encryption algorithm to encrypt the file using some key. RLI uses the tripleDES decryption algorithm to decrypt the file using same key to retrieve the original file.
The Data Encryption Standard (DES) is a block cipher (a form of shared secret encryption) which is based on a symmetric-
key algorithm that uses a 56-bit key. DES is now considered to be insecure for many applications. This is chiefly due to the 56-bit key size being too small. The algorithm is believed to be
practically secure in the form of Triple DES, although there are theoretical attacks. In recent years, the cipher has been superseded by the Advanced Encryption Standard (AES). Many former DES users now use Triple DES (TDES) which was described and
analysed by one of DES's patentees.
Triple DES was the answer to many of the shortcomings of
DES. Since it is based on the DES algorithm, it is very easy to modify existing software to use Triple DES. It also has the
advantage of proven reliability and a longer key length that eliminates many of the shortcut attacks that can be used to reduce the amount of time it takes to break DES.
IV. COMPRESSION
The compression scheme provided by our implementation uses
Bloom filters, which are arrays of bits. A Bloom filter that summarizes the state of an LRC is constructed by performing multiple hash functions on each logical name registered in the LRC and setting the corresponding bits in the Bloom filter. The resulting bit map is sent to an RLI, which stores one Bloom filter per LRC. For RLS version 2.0.9, no relational database back end is deployed for RLIs that receive Bloom filter updates. Rather, all Bloom filters are stored in memory, which provides fast soft state update and query performance. When an RLI receives a query for a logical name, it performs the same hash functions against the logical name and checks whether the corresponding bits in each LRC Bloom filter are set. If the bits are not set, then the logical name is not registered in the corresponding LRC. However, if the bits are set, there is a small possibility that a false positive has occurred, i.e., a false indication that the LRC contains a mapping for that logical name.
The probability of false positives is determined by the size of the Bloom filter bit map as well as the number of hash functions calculated on each logical name. Our implementation sets the Bloom filter size based on the number of mappings in an LRC (e.g., 10 million bits for approximately 1 million entries). We calculate three hash values for every logical name. These parameters give a false positive rate of approximately 1%. Different parameters can produce an arbitrarily small rate of false positives, at the cost of larger bit maps or more overhead for calculating hash functions.
V. THE GLOBUS RLS IMPLEMENTATION Based on the RLS framework above, we have implemented an RLS that has been included in the Globus Toolkit starting with version 3.0. This section describes the features of this implementation. The current implementation does not include a membership service, but instead uses a simple static configuration of LRCs and RLIs. We are investigating mechanisms for automatic configuration of RLS servers, including P2P techniques.
A. The Common LRC and RLI Server
Although the RLS framework treats the LRC and RLI servers as separate components, our implementation consists of a common server that can be configured as an LRC, an RLI, or both. The RLS
server is multithreaded and is written in C. The server supports Grid Security Infrastructure (GSI) authentication. One level of authorization is provided using an optional Globus gridmap file
that contains mappings from Distinguished Names (DNs) in users’ X.509 certificates to local usernames; if a DN is not included in the gridmap file, then access is denied. The RLS server can also be
configured with an access control list that specifies the RLS operations each user is authorized to perform. Access control list entries are regular expressions that grant privileges such as
lrc_read and lrc_write access to users based on either the DN in the user’s certificate or based on the local username specified by the gridmap file. The RLS server can also be run without
authentication or authorization controls.
The RLS server back end is a relational database accessed through an Open Database Connectivity (ODBC) layer. RLS interoperates with MySQL, PostgreSQL, SQLite, and Oracle
databases. The LRC database structure includes a table for logical names, a table for target names, and a mapping table that provides associations between logical and target names. There is also a
general attribute table that associates user-defined attributes with either logical names or target name as well as individual tables for each attribute type (string, int, float, and date). Attributes are often
used to associate such values as size with a physical name for a file or data object. Finally, one table lists all RLIs updated by the LRC, and another table stores regular expressions for LRC namespace
partitioning.
The RLI server uses a relational database back end when it receives full, uncompressed updates from LRCs. This relational
database contains three simple tables: one for logical names, one for LRCs, and a mapping table that stores {LN, LRC} associations. Soft-state updates with Bloom filter compression, when used, are
stored in memory.
B. Soft-State Updates
LRCs send periodic summaries of their state to RLI servers. In our implementation, soft-state updates may be of four types: uncompressed updates, incremental updates, and updates using
Bloom filter compression or name space partitioning. Soft-state information eventually expires and must be either refreshed or deleted. An expire thread runs periodically and examines timestamps in the RLI mapping table, discarding entries older than
the allowed timeout interval. With periodic updates, there is a delay between when LRC mappings change and when those changes are propagated to RLIs. Thus, an RLI query may return
stale information, and a client may not find a mapping for a desired logical name when it queries an LRC. Applications must be sufficiently robust to recover from this situation and query for
another replica.
Uncompressed updates. An uncompressed soft-state update contains a list of all logical names for which mappings are stored
in an LRC. The RLI creates associations between these logical names and the LRC. To discover one or more target replicas for a logical name, a client queries an RLI, which returns pointers to
zero or more LRCs that contain mappings for that name. The client then queries LRCs to obtain target name mappings. Incremental updates. To reduce the frequency of full softstate updates and the
staleness of RLI information, our implementation supports a combination of infrequent full updates and more frequent incremental updates that reflect recent changes to an LRC.
Incremental updates are sent after a short, configurable interval has elapsed or after a specified number of LRC updates have occurred. Periodic full updates are required to refresh RLI information that
eventually expires.
Compressed updates. Our implementation includes a Bloom filter
compression scheme. An LRC constructs a Bloom filter (a bit array) that summarizes LRC state by computing multiple hash
functions on each logical name registered in the LRC and setting the corresponding bits in the Bloom filter. The resulting bitmap is
sent to an RLI, which stores one Bloom filter in its memory per LRC. (The RLI may also store Bloom filters on disk for fast recovery after reboots.) When an RLI receives a query for a logical
name, it computes the same hash functions against the logical name and checks whether the corresponding bits in each LRC Bloom filter are set. If any of the bits is not set, then the logical
name was not registered in the corresponding LRC. If all the bits are set, then that logical name has likely been registered in the corresponding LRC. There is a small possibility, however, of a
false positive, i.e., a false indication that the LRC contains the mapping. The probability of false positives is controlled by the size of the Bloom filter and the number of hash functions used. Our
implementation sets the size of the Bloom filter based on the number of mappings in an LRC (e.g., 10 million bits for one million entries) and uses three hash functions by default. These parameters give a false positive rate of approximately 1 percent.
Partitioning. Finally, our implementation supports partitioning of updates based on the logical name space. When partitioning is enabled, logical names are matched against regular expressions,
and updates relating to different subsets of the logical namespace are sent to different RLIs. The goal of partitioning is to reduce the size of soft-state updates between LRCs and RLIs. We have
observed that, in practice, partitioning is rarely used, because users consider that Bloom filter compression offers sufficient performance and efficiency.
C.
RLS Bulk Operations
For user
convenience and improved performance,
the RLS
implementation includes bulk operations that bundle together a large number of add, query, and delete operations on mappings and on attributes. Bulk operations are particularly convenient for scientific applications that perform many RLS query or update operations at the same time, for example, when publishing or processing large data sets. Bulk operations are essential for high performance in RLS, since many operations can be submitted to the catalog while incurring the overhead of a single client request. Thus, bulk operations avoid the overhead of issuing each request separately.
VI. THE REPLICA LOCATION SERVICE IN PRODUCTION USE
This section presents several Grid projects that use the RLS in
production deployments, often in conjunction with higher-level data management services.
A. Laser Interferometer Gravitational Wave
Observatory
The Laser Interferometer Gravitational Wave Observatory (LIGO) collaboration conducts research to detect gravitational waves. LIGO uses data replication extensively to make terabytes of data
available at 10 LIGO sites. The LIGO deployment registers RLS mappings from more than 25 million logical file names to 120 million physical files. The LIGO deployment is a fully connected
RLS: each RLI collects state updates from LRCs at all 10 sites. Thus, a query to any RLI identifies all LRCs that contain mappings for a logical file name. To meet LIGO data publication, replication,
and access requirements, researchers developed the Lightweight Data Replicator system, which integrates RLS, the Grid FTP data transfer service, and a distributed metadata service. LDR provides
rich functionality, including pullbased file replication, efficient data transfer among sites, and a validation component that verifies
that files on storage systems are correctly registered in each LRC.
Replication of published data sets occurs when a scheduling daemon at each site performs metadata queries to identify
collections of files with specified metadata attributes and checks the LRC to determine whether the files in a collection exist on the
local storage system; if not, it adds the files to a priority-based transfer queue. A transfer daemon at each site periodically checks this queue, queries the RLI server to find locations where desired
files exist, and initiates Grid FTP transfer operations to the local site. After the file transfers complete, the transfer daemon registers new files in the local LRC.
B. Earth System Grid
The ESG [5] provides infrastructure for climate researchers that integrates computing and storage resources at five institutions.
This infrastructure includes RLS servers at five sites in a fully connected configuration that contain mappings for over one
million files. ESG, like many scientific applications, coordinates data access through a web-based portal. This portal provides an interface that allows users to request specific files or query ESG data holdings with specified metadata attributes. After a user
submits a query, the portal coordinates Grid middleware to deliver the requested data. This middleware includes RLS, a centralized metadata catalog, and services for data transfer and subsetting. In
addition to using RLS to locate replicas, the ESG web portal use uses size attributes stored in RLS to estimate transfer times.
VII. FUTURE ENHANCEMENT
As a future work we are investigating Peer to Peer organizations and higher-level data services integrate RLS. This RLS framework
will be integrate with Peer to Peer organization to support self- organization, greater fault tolerance, and improved scalability. Higher-level Grid data management Data Replication
Service(DRS) users RLS framework to move desired data files to the local storage system efficiently.
VIII. CONCLUSION
We have implemented the Globus replica location service with cryptography techniques.The design, implementation, evaluation, and deployment of the Globus RLS is done. Our framework is
based on a distributed RLS to be configured based on required scalability, reliability and security.Performance study is also done to demonstrate the RLS implementation, and found that individual
servers perform well and scale up to millions of entries and 100 clients and that soft-state updates of the distributed index scale well using Bloom filter compression. RLS has been successfully
integrated into a variety of customized, higher-level data management systems.