Abstract- Distributed systems and shared memory in distributed systems has been studied and researched in the field of Computer Science. In this paper we give the overview on Distributed system, shared memory and various classifications of shared memory in distributed systems. Paper cover the different concepts are terminologies in the field of Shared memory and distributed systems and finally different classifications of shared memory in distributed systems.
Keywords- Distributed Systems, Shared memory, Cache Coherence
Introduction
Despite of the advanced processor design & technology there is always a user demand for higher system performance. Eventually single processor systems have given way to multiprocessor systems. It's always cost effective to run multiple (n) inexpensive processor cooperatively than to run a single system which is equally (n time) faster. The need of additional computational power has thus lead to the specialization and advancement in CPU technology.
In parallel systems, there are two kinds of fundamental models: shared memory and message passing. From a programmer's perspective, shared memory computers, while easy to program, are difficult to build and aren't scalable to beyond a few processors. Message passing computers, while easy to build and scale, are difficult to program. In some sense, shared memory model and message passing model are equivalent.
One of the solutions to parallel systems is Distributed Shared Memory (DSM) whose memory is physically distributed but logically shared. Distributed Shared Memory appears as shared memory to the applications programmer, but relies on message passing between independent CPUs to access the global virtual address space. The advantages of Distributed Shared Memory programming model are well known. Firstly, shared memory programs are usually shorter and easier to understand than equivalent message passing programs. Secondly, shared memory gives transparent process-to-process communication.
Distributed System's & Shared Memory
Distributed Concepts
Distributed system consists of multiple autonomous computers that communicate and connect through a computer network in order to achieve a common goal.
A computer program that is designed or written to run in a distributed environment is known as Distributed program. The process of writing such programming language is known as Distributed programming.
Distributed computing is a field that are involved in the study of Distributed systems.The word Distributed in terms such as "Distributed system", "Distributed Program", "Distributed Computing" originally referred to Computer networks where individual computers were distributed within some geographical areas. Nowadays the term is used in much wider sense to the extend to autonomous processes that run on the same physical computer and which interact with each other by message passing.
In short we don't have a single black and white definition for Distributed systems, the below definition properties are commonly used.
There are several autonomous computational entities, each of which has its own local memory.
The entities communicate with each other by message passing.
Fig.1 CPU Connection's in Distributed system
Shared Memory Concepts
In Computing Shared memory is a memory that may be simultaneously accessed by multiple programs. We can define Shared memory in terms of hardware and software concepts.
In hardware terms, shared memory refers to a random access memory (RAM) that is accessible by different central processing units (CPU) in a multi processor computer system. The issue with shared memory systems is that many CPU need fast access to memory, and the scalability is also an issues. The alternatives to shared memory are the distributed memory and distributed shared memory which also has got its own limitations.
In Software terms, it's a method of exchange of data between programs that are running at a same time which is also called as inter-process communication and for which they make use of the RAM.
Fig.2 Single processor system with No shared memory
Fig.3 A hypothetical shared-memory
Computer Systems
Flynn proposed a simple model of categorizing all computers. He uses the stream concept for describing a machine's structure. A stream simply means a sequence of items (data or instructions).
Four main types of computer organizations can be found.
SISD : (Singe-Instruction stream, Singe-Data stream)
SIMD: (Singe-Instruction stream, Multiple-Data streams)
MISD: (Multiple-Instruction streams, Singe-Data stream)
MIMD: (Multiple-Instruction streams, Multiple-Data streams)
SISD: (Singe-Instruction stream, Singe-Data stream): SISD corresponds to the traditional mono-processor (Von Neumann computer). A single data stream is being processed by one instruction stream.
SIMD: (Singe-Instruction stream, Multiple-Data streams): In this organization, multiple processing units of the same type process on multiple data streams. This group is dedicated to array processing machines. Sometimes, vector processors can also be seen as a part of this group.
MISD: (Multiple-Instruction streams, Singe-Data stream): In case of MISD computers, multiple processing units operate on one single-data stream. In practice, this kind of organization has never been used.
MIMD: (Multiple-Instruction streams, Multiple-Data streams): This last machine type builds the group for the traditional multi-processors. Several processing units operate on multiple-data streams.
Fig.4 Flynn Classification
Distributed Shared Memory
As we said earlier the need of more computing power constantly increases which eventually lead to the multi processor & distributed system advance technologies. Thus numerous ongoing research efforts are focused on this increasingly attractive class of parallel computing and the distributed shared memory systems. The gap between processor and memory speed is apparently widening, and that is why memory system organization has become one of the most critical design decisions to be made in the field of computer architecture. Depending on the memory system organization, systems with multiple processors can be classified into two main groups: shared memory systems and distributed memory systems.
In Shared memory system (often called as tightly coupled multiprocessor), a single global physical memory is equally accessible to all processors. The ease of programming is the main advantage of such systems. The limitation is the increased connection in accessing which intern limits the scalability. The design complexity also adds to words it.
Fig.5 Centralized shared memory Architecture
A Distributed memory system (often called as multicomputer) consists of a collection of autonomous processing nodes having an independent flow of control and local memory modules. Communication between processes residing on different nodes is achieved through a message passing module, via a interconnection network. Such a structure is always a burden for the programmer and includes an overhead on software side also. On the other hand these systems are claimed to have better scalability and cost effectiveness.
Fig.6 Distributed Memory Architecture
A relative new concept - Distributed Shared Memory (DSM) tries to combine the best of these two approaches. A DSM system logically implements a shared memory model in a physically distributed memory system. The ease of programming and the portability typical of shared memory systems, as well as the scalability and cost-effectiveness of distributed memory systems, can be achieved with less effort.
Fig.7 Distributed Shared Memory Architecture
The physically separated memories can be addressed as one logically shared address space, meaning that memory reference can be made by any processor to any memory location. These systems are what we call as DSM Architecture systems. Shared memory means the address space is shared. The same physical address on two processors refers to the same location in the memory. A memory reference can be made by any processor to any memory location and also this is called NUMA (Non Uniform memory access).
Classifications of Distributed Shared Memory
We have seen what Distributed Shared Memory (DSM) is; now let's see what the classification of DSM is and which holds good for the Implementation aspects of DSM. On a broader picture we can classify Distributed shared memory into three which also covers the implementation aspect of DSM.
Hardware based distributed shared memory
Software based distributed shared memory
Hybrid distributed shared memory
Hardware Based Distributed Shared Memory
As the name says shared memory has been implemented at hardware level, hardware based DSM systems have several advantages on software based systems, mainly that if things can be done transparently in hardware they are likely to run faster. However custom hardware is expensive and history has shown that custom hardware has a hard time keeping up with technology advances. Hardware implementations of the DSM concept usually extend the principles found in traditional cache coherence schemes of scalable shared-memory architectures, Therefore, the unit of sharing is smaller-typically cache line size. Communication latency is considerably reduced, based on the advantage of fine grain sharing that also minimizes the effects of false sharing and thrashing. Searching and directory functions are much faster, compared to the software level implementations, as well. We can mainly classify the hardware level approach into four types.
Non Uniform Memory Access(NUMA)
Cache Coherent Non Uniform Memory Architecture(CC-NUMA)
Cache Only Memory Architecture(COMA)
Reflective Memory System architecture(RMS)
Non Uniform Memory Access (NUMA)
The Non Uniform Memory Access architecture, NUMA, approach is the simplest approach to hardware based distributed shared memory. Rather than maintaining a uniform access to memory, access time and memory behavior differs, depending on the address that is accessed. NUMA machines are usually made up of several nodes with one or more CPUs and a block of memory each. Memory on the same node can then be accessed quickly, as the CPUs memory bus is connected directly to the local memory, while access to memory on a different node is routed via a scalable interconnect and is often not cached. NUMA Architecture has been implemented in The Cray T3E machines.
Fig.7 2 node layout of the Cray T3E NUMA machine
Cache Coherent Non Uniform Memory Architecture (CC-NUMA)
NUMA architectures has several advantages, primarily simplicity and scalability, the lack of memory coherence makes programming more difficult and solving certain problems thus requires more memory accesses which in turn slows execution down. To achieve memory coherence one need to add a cache coherence mechanism to the non uniform architecture, this then becomes Cache Coherent Non Uniform Memory Access architecture, CC-NUMA. CC-NUMA architectures are basically an approach to achieve the scalability of the NUMA architectures and maintain the programmability of UMA architectures.
CC-NUMA machines are made up much like the NUMA machines, i.e. one or more processors are placed on the same node as a memory block and different nodes are then interconnected. But while the NUMA architecture has no coherence of memory, which is placed on remote nodes, and thus usually does not cache it, CC-NUMA architectures do cache remote memory locations and keep them coherent, e.g. invalidate all cached copies on a write if a looser consistency model is not defined. This invalidation requires that the system keeps track of all cached copies of a cache line, and to maintain sequential consistency that all copies are invalidated before a write operation returns. Maintaining a picture of which addresses are mapped where, requires significant amounts of extra memory. The extra memory to track cache lines favors larger cache lines, because larger cache lines results in fewer entries to keep track of, however larger cache lines also increases the possibility of false sharing, which is a phenomena that occurs when two addresses are not actually shared, but the sharing granularity is so large that they appear to be shared anyway. False sharing degrades performance and because of this identifying the optimal coherence granularity is an important issue when designing CC-NUMA architectures. CC-NUMA Architecture has been implemented in SGI Origin 2000 machines.
Fig.8 3 Layout of an origin 2000 node
Cache Only Memory Architecture (COMA)
A drastically different approach to maintaining cache coherence in distributed memory systems, is to eliminate the concept of main memory entirely. Instead of working with the concept of physical main memory, the system provides only cache memory. This cache memory is associated with a virtual memory space, rather than a physical memory, and thus the concept of main memory disappears. Instead memory blocks are always tagged; the cache can be replicated and migrated freely among the nodes in the system. When a processor needs a memory block, the required block is attracted to the node and placed in what is otherwise thought of as the main memory, which in COMA architectures is denoted attraction memory. Like ordinary cache entries other attraction blocks can replace cached attraction blocks when the capacity of the cache is exhausted. However, in COMA machines the system cannot allow the last copy of a block to be flushed, since there is no actual memory where a coherent version can be stored. COMA Architecture has been implemented in KSR1 machines.
Fig.9 COMA memory architecture
Reflective Memory System architecture (RMS)
Reflective memory systems are characterized with hardware- implemented update mechanism on the low level of data granularity. Some parts of local memory in each cluster can be declared as shared, and appropriately mapped into the common virtual space. Coherence maintenance of shared regions in these systems is based on full-replication algorithm. Following the assumption that all data written will be soon read by other sharing processors, those systems immediately propagate every change to all sharing sites, using a broadcast or multicast mechanism. Because of the property of "reflection," this kind of memory is also called "mirror memory." It results in high cost of write operations, especially when multiple writes to the same location occur; consequently, this architecture is the most convenient for the applications with a lower write frequency. Typical reflective memory systems are the Encore's RMS and Merlin.
Fig.10 Reflective memory DSM architecture
Software Based Distributed Shared Memory
While the hardware based Distributed Shared Memory all provide high performance scalable solutions, they all carry the inherent penalty for hardware based systems. The custom hardware is expensive and upgrading to new processors with new features is complex and often requires modifications to the custom hardware. To address these problems, some or all of the DSM functionality may be migrated to software. Software solutions are cheaper to implement and easier to modify, and will often be able to run on new processors without modifications. Thus software based distributed shared memory reduces the cost over hardware based systems significantly, but is a change to runtime efficiency.
In the literature software based DSM systems are often divided into only two groups, page-based and object-based DSM systems. The philosophy behind this coarse definition is that the page-based systems use partial hardware support and have a fixed coherence granularity, namely a page, while the object based systems are usually completely software based and work on software defines coherence granularities, e.g. objects. In the following the object based systems has been divided into a set of more finely defined groups, and while most systems include elements from other models, they share the overall group characteristics. We can mainly classify the software level approach into four types.
Variable based DSM
Shared Virtual Memory based DSM
Distributed Objects based DSM
Structured DSM
Variable Based Distributed Shared Memory
Variable based DSM systems model CC-NUMA systems in software, thus keeping cache coherence on a very small granularity level, even smaller than cache line size if this is convenient. The major challenge in these systems is to ensure cache coherence, since the hardware provides no support for keeping the coherence. To this end all memory operations to shared regions are instrumented, which means that an instruction block that checks whether the data is available or not will precede i.e. a load from an address. This instrumentation of all accesses to shared memory regions is very costly and several techniques have been developed to reduce this cost. However this approach has never become really popular.
The variable based DSM systems can be divided into two major blocks, actual variable based DSM and region based DSM. The default variable based DSM system has a close integration with the compiler and often the language. Variables may be defined as local or shared and only access to shared data is instrumented, and the instrumentation is added when the compiler does code generation. This technique can utilize a large amount of semantic information that can be derived from the source code, and may keep coherence on the variable level as it knows the type of the variable when the code is generated. Alternatively, the instrumentation can take place after the code has been generated, as it is done in the region based DSM systems. This approach allows applications to be written in any language and compiled with any compiler. The DSM system then analyzes the executable and attempts to differentiate between local and shared data, e.g. information on the stack is local and all other is shared. The post-compiler then inserts instrumentation code around the operations that access shared data. Variable based DSM is implemented in Midway based on a compile time identification of shared variables and synchronization variables and in Shasta were region based distributed shared memory system is used.
Shared Virtual Memory
While most other models for software based distributed shared memory require the programmer or the compiler to adapt to a DSM model, one model need not make such requirements. This model is the shared virtual memory (SVM). The idea is to use the systems memory management unit MMU to trap access to distributed shared memory, very similar to demand paging systems. As the MMU already monitors the presence of pages, in order to support demand paging, shared virtual memory may easily be added to the paging handler, so that it supports paging with remote memory in addition to paging to disk. Examples of SVM systems are IVY, Tread marks and SHRIMP SVM.
The concept of SVM is very simple. Processes or threads share the same virtual address space on all nodes in the system, but the physical pages are not available at all nodes. When a process accesses a page that is not at the node where it is executing a page not present trap will occur and the SVM system can step in, get the page and return to the faulting process to try again. As the MMU control precedes the cache lookup in a CPU, SVM has no immediate problem with maintaining cache coherence. One feature of SVM systems that is not available in any of the other DSM memory models is the fact that the text section of a process can also be handled by the DSM system.
Distributed Objects
Perhaps the simplest way to implement DSM is the distributed object method. This method lends itself directly to the object-oriented analyses, design and programming philosophy that had a huge growth in the 1980's. The constraint enforced by this method is that shared data must be encapsulated in an object and that these data can only be accessed by methods within this object. Object based DSM systems have been a research topic for quite some time and are still an active research area. Examples of object based DSM systems are Orca and Emerald.
The distribution aspects of object based DSM can be divided into three main areas, first the non-distributed method, second the migration method and last the replication method. Efficiency of execution and complexity of implementation also order these three methods. The simplest implementation is to not distribute at all. When an object is created it stays at the node that created it. When other nodes wish to access the object they transfer the control to the remote node and continue there. The advantage of this approach is that it can be implemented very easily, and in general won't require a lot of network bandwidth. On the other hand objects need to be distributed in a way that favors concurrency, i.e., if the same object is used intensively by all nodes no real parallelism is achieved, and the rationale behind DSM is lost.
Migrating objects has the advantage that once an object is located at a node, that node can access the object at native speed. Migrating objects takes a little more work than the static approach, but is still fairly simple, and as an object can only be accessed by the node that it is placed on. The problem of course is that as objects grow larger the required bandwidth to sustain object migration raises dramatically.
Replication of objects requires a lot more of the run-time system than the previous two models. By replicating objects all nodes can hold their own copy of the object, and only access control and changes to the object need to be sent on the network. Another advantage of replication of objects is that this approach allows the system to use some of the more relaxed consistency model. In fact, by using a replicated object DSM system, one can hide the additional complexity that comes with the most relaxed consistency models.
It is obvious that distributed objects are an easy way to implement DSM, and with the use of entry consistent replicated objects, they can be both easy to use and efficient. The problem occurs with applications that do not lend themselves to the object orientated approach.
Structured Distributed Shared Memory
Rather than choosing between a distributed object approach that requires data access to be performed via the associated methods, or a distributed variable approach that requires no control over the data, intermediate models have been proposed. These models are known as structured memory, but most often these models have little in common with each other. However, the one thing that structured shared memory models usually have in common is the fact that rather than requiring a new programming language, or living within the constraints of an existing language, they work with the basis of an existing language and expand this language with a few functions. Examples of such structured distributed shared memory models are Linda, Global Arrays and JavaSpaces.
Hybrid Distributed Shared Memory
As the name goes Hybrid DSM is the approach were we combine the implementation aspects of Hardware as well as the software DSM. Some of the systems which we have mentioned in the Hardware & software classification make use of Hybrid mechanism.
CONCLUSIONS
In this article we have covered the different concepts of distributed system & shared memory and we have described different classification of shared memory in distributed systems.
Acknowledgment
This work was partially supported by the staff of Computer Science department of Karunya University.