Peer to peer architectural pattern sets the foundations of software systems and applications which implements a decentralized network of distributed resources to perform certain function. The resources can be in the terms of processing power, data (storage and contents), bandwidth for the purpose of distributed computing, content sharing and communication etc.
Figure 1: P2P Network
Architectural Goals of P2P systems
Decentralization
Decentralization of resources and control is the basic them of a P2P system. As the client-server architecture provides high level of security and control over the resources being accessed but it can produce bottlenecks and wastage of resources as well. So every peer in the P2P system acts both as client and server for other peers. Decentralization also poses problems in terms of the authentication and identification of clients as well as searching for the resources that's why most of the P2P system implements hybrid architecture for keeping some information on centralized nodes while most of the work is done on peers.
Ad-Hoc Connectivity
As P2P system tries to use the general computers connected to network in their idle timings, they are not always connected and there is no reliability of their connection that is why we call it as Ad-hoc connectivity. Any peer can connect to the network anytime and it can leave the network anytime it wishes. The downside to ad-hoc connectivity is that the peer will have to send its query blindly to the connected peers, which will in turn do the same until the document is found or the maximum number of broadcast levels is over.
Cost of ownership
One of the high goals of P2P system is that it reduces the cost of ownership for keep a resource such as a system, its contents and/or the cost of maintenance of such resources. This characteristic is most obvious when P2P systems are used for grid computing, where
Thousands of PCs are used by an individual to perform his/her computational tasks.
Scalability
The more a system relies on decentralization the more is gets scalable. Scalability is hindered by the amount of centralized data operations for synchronization and coordination, the communication overhead by the clients with the central server and the extent of parallelism the centralized application can provide. Decentralization eliminates these issues by letting the peer communicate to each other directly without having to save any state for their connection with any centralized entity. P2P systems such as Napster, at its peak time reached up to 6 million peers.
Anonymity
One Goal of P2P systems is to hide the identity of the peer using the system in order to avoid any legal issues. In P2P system there are 3 types of anonymity
Hiding the identity of the sender
Hiding the identity of the receiver
Hiding the identities of both sender and receiver from each other and the rest of the peers.
Various techniques have been used by different P2P systems to implement anonymity like covert paths, IP spoofing, ID spoofing etc. Peers are by default anonymous to some extent if not wholly controlled by a single authority (system or user).
Performance
One of the most significant aim of P2P systems is the performance. P2P systems try to enhance the performance by aggregating widely distributed resources such as storage (Napster, Gnutella) or processing cycles (SETI@home). In such systems performance is affected by the amount of communication taking place among the peers and the bandwidth requirement for the consumption of the resources located at distant locations on the internet. Bandwidth is a major factor which can affect the performance of P2P system if there is lots of communication among the peers and heavy files are being transferred across the network, which in turn can also limit the scalability of the system.
Robustness
One of the main goals of P2P systems is to avoid a central point of failure. While most of the pure P2P systems already has this characteristic there are issues like node failures, unreachability, partitions and disconnections etc. The major issue a P2P system has to handle is to continue the operation being performed by using the still connected peers even if some peers disconnects from the network. For example when a Gnutella servant is downloading a file from other peers, if all the peers disconnects from the network it still provides an option to resume the download after any of those peers comes back to the network.
Self-organisation
P2P systems are able to scale unpredictably. With high scalability comes high maintenance and reconfigurations which are not possible for such widely distributed networks. So P2P systems must be able to reorganize itself in terms of the increased failures because of high scalability as well as the ad-hoc nature of connections.
In P2P systems, self-organization is mostly performed on the basis of location and routing infrastructure because of the intermittent connectivity (OceanStore) or through protocols for the arrival and departure of new nodes (Pastry).
Security
P2P systems have the same security needs as any other distributed system. Lots of research has been done in this regard and different P2P systems have used different techniques such as multi-key encryption, sandboxing, reputation etc to ensure the security of the peers connected to each other.
Peer to peer algorithms
P2P systems are basically different from each other in terms of the architectural goals previously defined and then way peers respond to queries by other peers or super nodes. Following are the three main algorithms which are mostly used by today's P2P systems to query and respond to queries by other peers.
Centralized Directory Model
This model represents a hybrid P2P architecture in which the peers of the community connects to a central directory to publish the contents they want to share with other peers. A peer which needs to access those contents will have to first connect to the central directory to search for the intended peers in terms of the published contents by other peers of the community. Once the search is complete, the peer communicates directly with the other peer to download the specific contents. Napster used Centralize Directory Model to share music contents to the community.
Flooded Request Model
This is a pure P2P model in which no publishing of the resources is needed. Instead the request for the resources from a peer is flooded/broadcast to immediate peers, who in turn broadcast the same request to their peers until the request is answered or a maximum number of flooding occurs (7 to 9).
Gnutella network follows the same request model, which is said to consume lots of bandwidth but there are alternative techniques which have been developed to improve the scalability issue raised by the bandwidth consumption i.e super-peers and caching of the most recent searches etc.
Document Routing Model
The Document Routing Model is also a pure P2P model in which no centralized publishing system is needed. Instead each peer is assigned a random ID and each peer knows a predefined numbers of peers. When a document is published on such a system, the document is given a unique ID based on the hash of the document name as well as its contents. This ID is then routed by the peers towards the peer having the most similar ID and a copy of the document is stored at each routing peer. When a peer requests the document, the request will go to the peer with the most similar ID to the document ID. This process repeated until a copy of the document has been found.
Document ID must be known to peers in order to search it in the network. This is the technique used by most of the torrent clients these days.
Reference Architecture for general P2P systems
Given below is informal reference architecture for P2P systems drawn based on the analysis of the case studies. Applications don't exactly follow the ordering of the components or layers in the particular fashion. Any layer or particular module can be omitted by any implementation of P2P architecture based on its architectural goals.
Applications layer for Data presentation
Class layer for Decentralization, Parallelism, state maintenance etc
Fault Tolerance Layer
Group Management layer
Communication Layer
Existing infrastructure for internet or intranet
P2P Case Studies
Following are 3 case studies of P2P systems. Each system has been briefly introduced in the given paragraphs and later on all the three systems have been compared in a tabular form in terms of the architectural goals and qualities.
Avaki
Avaki started as the a research project "Legion" in 1993 at the university of Virginia and eventually was named as "Avaki" in the form of a commercial venture around 2001. Avaki presents a single virtual computer view of a heterogeneous network of computing resources. Computing resources can be computational cycles, storage, contents, applications or devices.
Avaki has an object-based design and can be thought of as one of the first object oriented parallel processing systems. The middleware of avaki is structured as a layered virtual machine which maps the system to a network of computing resources. Handling heterogeneity and automatic error detection and recovery are the hallmarks of the success of this system.
Avaki objects interact with each other in a P2P fashion and individually decide which object to be connected to and hence no intermediary objects/nodes are followed. During the process of execution the objects contacting each other may change rapidly.
Given below is the architectural diagram of the Avaki system.
Figure 2: Avaki Architecture [taken from Peer to peer computing HP labs]
SETI@home
SETI (Search for Extra-terrestrial Intelligence ) is a collection of research project started in early nineties with the aim to discover if there are any alien civilizations present in the universe.
SETI@home is a part of the SETI project which analyses the radio emissions received from space and collected by a giant telescope, using the processing power from millions of idle PCs connected to Internet. SETI@home produced dozens of Tflops of raw processing power by connecting to about 3 millions of Internet connected PCs.
SETI@home initial design was kept centralized for the sake of simplicity, in which a distribution server would send data to client applications on PCs in the form of a screensaver. This centralized architecture made the scalability issues and hence proxy servers had to be introduced in order to scale up the system wide and optimize the use of bandwidth. Proxy servers would connect to the data server in the middle of the night to get tasks from it and during the day they would distribute those tasks to the clients configured to be connected to those proxy servers. Bandwidth was a limit because the system distributes files of size 350KB or more to its clients. Computation then would take place in P2P manner and results from each client are aggregated and then sent to the database. [Anderson 2002]
Given below is an architectural overview of the whole system.
Figure 3: SETI@home Architecture [From Sullevan et al 1997]
Gnutella
Gnutella is a file sharing technology, basically a protocol which enables the applications implementing this protocol to allow the users to search for and download files from other users connected to the internet. All the peers of Gnutella network must have the Gnutella client application in order to perform P2P search and download.
Gnutella was introduced by AOL's two employees as an open source program and is a pure P2P system for sharing data resources/contents over the internet. It thus provide a level of anonymity for users as publisher of a data, receiver of a data.
In order to connect to a Gnutella node, a user must first know the IP of that node, which can be discovered by going to a well-known website where a number of Gnutella users are posted. The search for a file by a peer is then forwarded to its immediately connected nodes (flooding), which again floods their neighbours till the query has been answered by some peer or the TTL (time to live) has been reached.
Given below the structure of a Gnutella network and how search query travels through the network till it finds the answer from one of the peers.
Figure 4: Gnutella, Searching [taken from howstuffworks.com]
Comparison of the case studies in terms of the architectural goals
P2P System
Avaki
SETI@home
Gnutella
Architectural Goals
Decentralization
Distributed and no central manager required.
Master-slave architecture
Pure P2P, fully decentralized architecture
Scalability
Scalable up to thousands of peers, 2.5-3K peers tested
Highly scalable (Millions of peers) on average 3 million.
Scalable
Up to thousands of peers.
Anonymity
Not permitted
Medium
Low
Self-Organizaiton
Restructures around failures
Low
High because of pure P2P connectivity
Cost of ownership
Low
Very low
Low
Ad-hoc-ism
Join/leave allowed to computing resources anytime during the process
Join/leave allowed to computing resources anytime during the process
Join/leave allowed to peers anytime.
Performance
Initially slow but Speeds up with increased number of peers
High performance
Huge speedups in the processing time
Low
Security
Encryption, Authentication of users,
Admn domains defined,
Virtual organizations in grid in order to limit the access to resources.
Proprietary, client software needed to join the network which only serves the outgoing connection and doesn't listen for any connections.
Not addressed
Fault tolerance
Checkpoints/ Restart
Reliable messaging
Timed checkpoints to resume in case of failures
Not addressed in the protocol.
Applications based on Gnutella may implement logics such as retry/resume logic as in Limewire.