It is one of the technical foundations of Voice over IP and in this context is often used in conjunction with a signaling protocol which assists in setting up connections across the network. RTP was developed by the Audio-Video Transport Working Group of the Internet Engineering Task Force (IETF) and first published in 1996 as RFC 1889, superseded by RFC 3550 in 2003.
The data transfer protocol, RTP, which deals with the transfer of real-time data. Information provided by this protocol include timestamps (for synchronization), sequence numbers (for packet loss and reordering detection) and the payload format which indicates the encoded format of the data
A protocol, RTCP, used to specify quality of service (QoS) feedback and synchronization between the media streams. The bandwidth of RTCP traffic compared to RTP is small, typically around 5%.[8][9]
An optional signaling protocol such as H.323, MGCP, Megaco, SCCP, or Session Initiation Protocol (SIP) signaling protocols
An optional media description protocol such as session description protocol
An RTP Session is established for each multimedia stream. A session consists of an IP address with a pair of ports for RTP and RTCP. For example, audio and video streams will have separate RTP sessions, enabling a receiver to deselect a particular stream.[10] The ports which form a session are negotiated using other protocols such as RTSP (using SDP in the setup method)[11] and SIP. According to the specification, an RTP port should be even and the RTCP port is the next higher odd port number. RTP and RTCP typically use unprivileged UDP ports (1024 to 65535),[12] but may use other transport protocols (most notably, SCTP and DCCP) as well, as the protocol design is transport independent.
RTP header, version 2:
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Ver
P
X
CC
M
PT
Sequence Number
Timestamp
SSRC
CSRC [0..15] :::
Ver, Version. 2 bits.
RTP version number. Always set to 2.
P, Padding. 1 bit.
If set, this packet contains one or more additional padding bytes at the end which are not part of the payload. The last byte of the padding contains a count of how many padding bytes should be ignored. Padding may be needed by some encryption algorithms with fixed block sizes or for carrying several RTP packets in a lower-layer protocol data unit.
X, Extension. 1 bit.
If set, the fixed header is followed by exactly one header extension.
CC, CSRC count. 4 bits.
The number of CSRC identifiers that follow the fixed header.
M, Marker. 1 bit.
The interpretation of the marker is defined by a profile. It is intended to allow significant events such as frame boundaries to be marked in the packet stream. A profile may define additional marker bits or specify that there is no marker bit by changing the number of bits in the payload type field.
PT, Payload Type. 7 bits.
Identifies the format of the RTP payload and determines its interpretation by the application. A profile specifies a default static mapping of payload type codes to payload formats. Additional payload type codes may be defined dynamically through non-RTP means. An RTP sender emits a single RTP payload type at any given time; this field is not intended for multiplexing separate media streams.
Sequence Number. 16 bits.
The sequence number increments by one for each RTP data packet sent, and may be used by the receiver to detect packet loss and to restore packet sequence. The initial value of the sequence number is random (unpredictable) to make known-plaintext attacks on encryption more difficult, even if the source itself does not encrypt, because the packets may flow through a translator that does.
Timestamp. 32 bits.
The timestamp reflects the sampling instant of the first octet in the RTP data packet. The sampling instant must be derived from a clock that increments monotonically and linearly in time to allow synchronization and jitter calculations. The resolution of the clock must be sufficient for the desired synchronization accuracy and for measuring packet arrival jitter (one tick per video frame is typically not sufficient). The clock frequency is dependent on the format of data carried as payload and is specified statically in the profile or payload format specification that defines the format, or may be specified dynamically for payload formats defined through non-RTP means. If RTP packets are generated periodically, the nominal sampling instant as determined from the sampling clock is to be used, not a reading of the system clock. As an example, for fixed-rate audio the timestamp clock would likely increment by one for each sampling period. If an audio application reads blocks covering 160 sampling periods from the input device, the timestamp would be increased by 160 for each such block, regardless of whether the block is transmitted in a packet or dropped as silent.
SSRC, Synchronization source. 32 bits.
Identifies the synchronization source. The value is chosen randomly, with the intent that no two synchronization sources within the same RTP session will have the same SSRC. Although the probability of multiple sources choosing the same identifier is low, all RTP implementations must be prepared to detect and resolve collisions. If a source changes its source transport address, it must also choose a newSSRC to avoid being interpreted as a looped source.
CSRC, Contributing source. 32 bits.
An array of 0 to 15 CSRC elements identifying the contributing sources for the payload contained in this packet. The number of identifiers is given by the CC field. If there are more than 15 contributing sources, only 15 may be identified. CSRC identifiers are inserted by mixers, using the SSRC identifiers of contributing sources. For example, for audio packets the SSRC identifiers of all sources that were mixed together to create a packet are listed, allowing correct talker indication at the receiver.
RTP TERMINOLOGY
RELAY: A realy is an intermediate system that acts as a sender as well as a receiver.It is of two types:
MIXER RELAY
An intermediate system that receives RTP packets from one or more sources, possibly changes the data format, combines the packets in some manner and then forwards a new RTP packet. Since the timing among multiple input sources will not generally be synchronized, the mixer will make timing adjustments among the streams and generate its own timing for the combined stream. Thus, all data packets originating from a mixer will be identified as having the mixer as their synchronization source.
TRANSLATOR RELAY
An intermediate system that forwards RTP packets with their synchronization source identifier intact. Examples of translators include devices that convert encodings without mixing, replicators from multicast to unicast, and application-level filters in firewalls.
RTCP
The Real Time Transport Control Protocol (RTP control protocol or RTCP) is based on the periodic transmission of control packets to all participants in the session, using the same distribution mechanism as the data packets. RTCP provides out-of-band statistics and control information for an RTP flow. It partners RTP in the delivery and packaging of multimedia data, but does not transport any media streams itself. Typically RTP will be sent on an even-numbered UDP port, with RTCP messages being sent over the next highest odd-numbered port[2]. The primary function of RTCP is to provide feedback on the quality of service (QoS) in media distribution by periodically sending statistics information to participants in a streaming multimedia session.
The underlying protocol must provide multiplexing of the data and control packets, for example using separate port numbers with UDP. RTCP performs four functions:
RTCP provides feedback on the quality of the data distribution. This is an integral part of the RTP's role as a transport protocol and is related to the flow and congestion control functions of other transport protocols.
RTCP carries a persistent transport-level identifier for an RTP source called the canonical name or CNAME. Since the SSRC identifier may change if a conflict is discovered or a program is restarted, receivers require the CNAME to keep track of each participant. Receivers may also require the CNAME to associate multiple data streams from a given participant in a set of related RTP sessions, for example to synchronize audio and video.
The first two functions require that all participants send RTCP packets, therefore the rate must be controlled in order for RTP to scale up to a large number of participants. By having each participant send its control packets to all the others, each can independently observe the number of participants. This number is used to calculate the rate at which the packets are sent.
An OPTIONAL function is to convey minimal session control information, for example participant identification to be displayed in the user interface. This is most likely to be useful in "loosely controlled" sessions where participants enter and leave without membership control or parameter negotiation.
Protocol Structure - RTCP (RTP Control Protocol)
2
3
8
16 bit
Version
P
RC
Packet type
Length
Version - Identifies the RTP version which is the same in RTCP packets as in RTP data packets. The version defined by this specification is two (2).
P - Padding. When set, this RTCP packet contains some additional padding octets at the end which are not part of the control information. The last octet of the padding is a count of how many padding octets should be ignored. Padding may be needed by some encryption algorithms with fixed block sizes. In a compound RTCP packet, padding should only be required on the last individual packet because the compound packet is encrypted as a whole.
RC- Reception report count, the number of reception report blocks contained in this packet. A value of zero is valid. Packet type Contains the constant 200 to identify this as an RTCP SR packet.
Length - The length of this RTCP packet in 32-bit words minus one, including the header and any padding. (The offset of one makes zero a valid length and avoids a possible infinite loop in scanning a compound RTCP packet, while counting 32-bit words avoids a validity check for a multiple of 4.)
MESSAGE TYPES
RTCP distinguishes several types of packets: sender report, receiver report, source description, and bye. In addition, the protocol is extensible and allows application-specific RTCP packets. A standards-based extension of RTCP is the Extended Report packet type introduced by RFC 3611.[3]
Sender report (SR)The sender report is sent periodically by the active senders in a conference to report transmission and reception statistics for all RTP packets sent during the interval. The sender report includes an absolute timestamp, which is the number of seconds elapsed since midnight on January 1, 1900. The absolute timestamp allows the receiver to synchronize RTP messages. It is particularly important when both audio and video are transmitted simultaneously, because audio and video streams use independent relative timestamps.
Receiver report (RR) The receiver report is for passive participants, those that do not send RTP packets. The report informs the sender and other receivers about the quality of service.
Source description (SDES)The Source Description message is used to send the CNAME item to session participants. It may also be used to provide additional information such as the name, e-mail address, telephone number, and address of the owner or controller of the source.
End of participation (BYE)A source sends a BYE message to shut down a stream. It allows an end-point to announce that it is leaving the conference. Although other sources can detect the absence of a source, this message is a direct announcement. It is also useful to a media mixer.
Application-specific message (APP)The application-specific message provides a mechanism to design application-specific extensions to the RTCP protocol.
In large-scale applications, such as in Internet Protocol Television (IPTV), very long delays (minutes to hours) between RTCP reports may occur, because of the RTCP bandwidth control mechanism required to control congestion.Acceptable frequencies are usually less than one minute. This affords the potential of inappropriate reporting of the relevant statistics by the receiver or cause evaluation by the media sender to be inaccurate relative to the current state of the session. Methods have been introduced to alleviate the problems:[4] RTCP filtering, RTCP biasing and hierarchical aggregation
SIP
The Session Initiation Protocol (SIP) is an IETF-defined signaling protocol, widely used for controlling multimedia communication sessionssuch as voice and video calls over Internet Protocol (IP). The protocol can be used for creating, modifying and terminating two-party (unicast) or multiparty (multicast) sessions consisting of one or several media streams. The modification can involve changing addresses or ports, inviting more participants, and adding or deleting media streams. Other feasible application examples include video conferencing, streaming multimedia distribution, instant messaging, presence information, file transfer and online games.
SIP was originally designed by Henning Schulzrinne and Mark Handley starting in 1996. The latest version of the specification is RFC 3261from the IETF Network Working Group. In November 2000, SIP was accepted as a 3GPP signaling protocol and permanent element of the IP Multimedia Subsystem (IMS) architecture for IP-based streaming multimedia services in cellular systems.
The SIP protocol is an Application Layer protocol designed to be independent of the underlying transport layer; it can run on Transmission Control Protocol (TCP), User Datagram Protocol (UDP), or Stream Control Transmission Protocol (SCTP).[2] It is a text-based protocol, incorporating many elements of the Hypertext Transfer Protocol (HTTP) and the Simple Mail Transfer Protocol (SMTP).
SIP messages
SIP is a text-based protocol with syntax similar to that of HTTP. There are two different types of SIP messages: requests and responses. The first line of a request has a method, defining the nature of the request, and a Request-URI, indicating where the request should be sent.[11] The first line of a response has a response code.
For SIP requests, RFC 3261 defines the following method[
REGISTER: Used by a UA to indicate its current IP address and the URLs for which it would like to receive calls.
INVITE: Used to establish a media session between user agents.
ACK: Confirms reliable message exchanges.
CANCEL: Terminates a pending request.
BYE: Terminates a session between two users in a conference.
OPTIONS: Requests information about the capabilities of a caller, without setting up a call.
The SIP response types defined in RFC 3261 fall in one of the following categories
Provisional (1xx): Request received and being processed.
Success (2xx): The action was successfully received, understood, and accepted.
Redirection (3xx): Further action needs to be taken (typically by sender) to complete the request.
Client Error (4xx): The request contains bad syntax or cannot be fulfilled at the server.
Server Error (5xx): The server failed to fulfill an apparently valid request.
Global Failure (6xx): The request cannot be fulfilled at any server.
SIP Network Elements
A SIP user agent (UA) is a logical network end-point used to create or receive SIP messages and thereby manage a SIP session. A SIP UA can perform the role of a User Agent Client(UAC), which sends SIP requests, and the User Agent Server (UAS), which receives the requests and returns a SIP response. These roles of UAC and UAS only last for the duration of a SIP transaction.[5]
A SIP phone is a SIP user agent that provides the traditional call functions of a telephone, such as dial, answer, reject, hold/unhold, and call transfer.[6][7] SIP phones may be implemented by dedicated hardware controlled by the phone application directly or through an embedded operating system (hardware SIP phone) or as a softphone, a software application that is installed on a personal computer or a mobile device, e.g., a personal digital assistant (PDA) or cell phone with IP connectivity. As vendors increasingly implement SIP as a standard telephony platform, often driven by 4G efforts, the distinction between hardware-based and software-based SIP phones is being blurred and SIP elements are implemented in the basic firmware functions of many IP-capable devices. Examples are devices from Nokia and Research in Motion
Each resource of a SIP network, such as a User Agent or a voicemail box, is identified by a Uniform Resource Identifier (URI), based on the general standard syntax[9] also used in Web services and e-mail. A typical SIP URI is of the form: sip:username:password@host:port. The URI scheme used for SIP is sip:. If secure transmission is required, the scheme sips: is used and SIP messages must be transported over Transport Layer Security (TLS).
In SIP, as in HTTP, the user agent may identify itself using a message header field 'User-Agent', containing a text description of the software/hardware/product involved. The User-Agent field is sent in request messages, which means that the receiving SIP server can see this information. SIP network elements sometimes store this information,[10] and it can be useful in diagnosing SIP compatibility problems.
SIP also defines server network elements. Although two SIP endpoints can communicate without any intervening SIP infrastructure, which is why the protocol is described as peer-to-peer, this approach is often impractical for a public service.
RFC 3261 defines these server elements:
A proxy server "is an intermediary entity that acts as both a server and a client for the purpose of making requests on behalf of other clients. A proxy server primarily plays the role of routing, which means its job is to ensure that a request is sent to another entity "closer" to the targeted user. Proxies are also useful for enforcing policy (for example, making sure a user is allowed to make a call). A proxy interprets, and, if necessary, rewrites specific parts of a request message before forwarding it."
"A registrar is a server that accepts REGISTER requests and places the information it receives in those requests into the location service for the domain it handles."
"A redirect server is a user agent server that generates 3xx responses to requests it receives, directing the client to contact an alternate set of URIs.The redirect server allows SIP Proxy Servers to direct SIP session invitations to external domains."
The RFC specifies: "It is an important concept that the distinction between types of SIP servers is logical, not physical."
Other SIP related network elements are:
Session border controllers (SBC), they serve as middle boxes between UA and SIP server for various types of functions, including network topology hiding, and assistance in NAT traversal.
Various types of gateways at the edge between a SIP network and other networks (as a phone network)
Applications
Many VoIP phone companies allow customers to use their own SIP devices, as SIP-capable telephone sets, or softphones. The market for consumer SIP devices continues to expand, there are many devices such as SIP Terminal Adapters, SIP Gateways, and SIP Trunking services providing replacements for ISDN telephone lines.
The free software community started to provide more and more of the SIP technology required to build both end points as well as proxy and registrar servers leading to a commodification of the technology, which accelerates global adoption. As an example, the open source community at SIPfoundry actively develops a variety of SIP stacks, client applications and SDKs, in addition to entire private branch exchange (IP PBX) solutions that compete in the market against mostly proprietary IP PBX implementations from established vendors.
The National Institute of Standards and Technology (NIST), Advanced Networking Technologies Division provides a public domain implementation of the JAVA Standard for SIP JAIN-SIP which serves as a reference implementation for the standard. The stack can work in proxy server or user agent scenarios and has been used in numerous commercial and research projects. It supports RFC 3261 in full and a number of extension RFCs including RFC 3265 (Subscribe / Notify) and RFC 3262 (Provisional Reliable Responses) etc.
SIP-enabled video surveillance cameras can make calls to alert the owner or operator that an event has occurred, for example to notify that motion has been detected out-of-hours in a protected area.
H.323
H.323 is a recommendation from the ITU Telecommunication Standardization Sector (ITU-T) that defines the protocols to provide audio-visual communication sessions on any packet network. The H.323 standard addresses call signaling and control, multimedia transport and control, and bandwidth control for point-to-point and multi-point conferences.[1]
It is widely implemented by voice and videoconferencing equipment manufacturers, is used within various Internet real-time applications such as GnuGK and NetMeeting and is widely deployed worldwide by service providers and enterprises for both voice and video services over IP networks.
It is a part of the ITU-T H.32x series of protocols, which also address multimedia communications over ISDN, the PSTN or SS7, and 3G mobile networks.
H.323 call signaling is based on the ITU-T Recommendation Q.931 protocol and is suited for transmitting calls across networks using a mixture of IP, PSTN, ISDN, and QSIG over ISDN. A call model, similar to the ISDN call model, eases the introduction of IP telephony into existing networks of ISDN-based PBX systems, including transitions to IP-based PBXs.
Within the context of H.323, an IP-based PBX might be a gatekeeper or other call control element which provides service to telephones or videophones. Such a device may provide or facilitate both basic services and supplementary services, such as call transfer, park, pick-up, and hold.
While H.323 excels at providing basic telephony functionality and interoperability, H.323's strength lies in multimedia communication functionality designed specifically for IP networks.
PROTOCOLS
H.323 is a system specification that describes the use of several ITU-T and IETF protocols. The protocols that comprise the core of almost any H.323 system are:
H.225.0 Registration, Admission and Status (RAS), which is used between an H.323 endpoint and a Gatekeeper to provide address resolution and admission control services.
H.225.0 Call Signaling, which is used between any two H.323 entities in order to establish communication.
H.245 control protocol for multimedia communication, which describes the messages and procedures used for capability exchange, opening and closing logical channels for audio, video and data, control and indications.
Real-time Transport Protocol (RTP), which is used for sending or receiving multimedia information (voice, video, or text) between any two entities.
Many H.323 systems also implement other protocols that are defined in various ITU-T Recommendations to provide supplementary services support or deliver other functionality to the user. Some of those Recommendations are:
H.235 series describes security within H.323, including security for both signaling and media.
H.239 describes dual stream use in videoconferencing, usually one for live video, the other for still images.
H.450 series describes various supplementary services.
H.460 series defines optional extensions that might be implemented by an endpoint or a Gatekeeper, including ITU-T Recommendations H.460.17, H.460.18, and H.460.19 forNetwork address translation (NAT) / Firewall (FW) traversal.
In addition to those ITU-T Recommendations, H.323 utilizes various IETF Request for Comments (RFCs) for media transport and media packetization, including the Real-time Transport Protocol (RTP).
H.323 Network Elements
Terminals
Figure 1 - A complete, sophisticated protocol stack
Terminals in an H.323 network are the most fundamental elements in any H.323 system, as those are the devices that users would normally encounter. They might exist in the form of a simple IP phone or a powerful high-definition videoconferencing system.
Inside an H.323 terminal is something referred to as a Protocol stack, which implements the functionality defined by the H.323 system. The protocol stack would include an implementation of the basic protocol defined in ITU-T Recommendation H.225.0 and H.245, as well as RTP or other protocols described above.
The diagram, figure 1, depicts a complete, sophisticated stack that provides support for voice, video, and various forms of data communication. In reality, most H.323 systems do not implement such a wide array of capabilities, but the logical arrangement is useful in understanding the relationships.
Multipoint Control Units
A Multipoint Control Unit (MCU) is responsible for managing multipoint conferences and is composed of two logical entities referred to as the Multipoint Controller (MC) and theMultipoint Processor (MP). In more practical terms, an MCU is a conference bridge not unlike the conference bridges used in the PSTN today. The most significant difference, however, is that H.323 MCUs might be capable of mixing or switching video, in addition to the normal audio mixing done by a traditional conference bridge. Some MCUs also provide multipoint data collaboration capabilities. What this means to the end user is that, by placing a video call into an H.323 MCU, the user might be able to see all of the other participants in the conference, not only hear their voices.
Gateways
Gateways are devices that enable communication between H.323 networks and other networks, such as PSTN or ISDN networks. If one party in a conversation is utilizing a terminal that is not an H.323 terminal, then the call must pass through a gateway in order to enable both parties to communicate.
Gateways are widely used today in order to enable the legacy PSTN phones to interconnect with the large, international H.323 networks that are presently deployed by services providers. Gateways are also used within the enterprise in order to enable enterprise IP phones to communicate through the service provider to users on the PSTN.
Gateways are also used in order to enable videoconferencing devices based on H.320 and H.324 to communicate with H.323 systems. Most of the third generation (3G) mobile networks deployed today utilize the H.324 protocol and are able to communicate with H.323-based terminals in corporate networks through such gateway devices.
Gatekeepers
A Gatekeeper is an optional component in the H.323 network that provides a number of services to terminals, gateways, and MCU devices. Those services include endpoint registration, address resolution, admission control, user authentication, and so forth. Of the various functions performed by the gatekeeper, address resolution is the most important as it enables two endpoints to contact each other without either endpoint having to know the IP address of the other endpoint.
Gatekeepers may be designed to operate in one of two signaling modes, namely "direct routed" and "gatekeeper routed" mode. Direct routed mode is the most efficient and most widely deployed mode. In this mode, endpoints utilize the RAS protocol in order to learn the IP address of the remote endpoint and a call is established directly with the remote device. In the gatekeeper routed mode, call signaling always passes through the gatekeeper. While the latter requires the gatekeeper to have more processing power, it also gives the gatekeeper complete control over the call and the ability to provide supplementary services on behalf of the endpoints.
H.323 endpoints use the RAS protocol to communicate with a gatekeeper. Likewise, gatekeepers use RAS to communicate with other gatekeepers.
A collection of endpoints that are registered to a single Gatekeeper in H.323 is referred to as a "zone". This collection of devices does not necessarily have to have an associated physical topology. Rather, a zone may be entirely logical and is arbitrarily defined by the network administrator.
Gatekeepers have the ability to neighbor together so that call resolution can happen between zones. Neighboring facilitates the use of dial plans such as the Global Dialing Scheme. Dial plans facilitate "inter-zone" dialing so that two endpoints in separate zones can still communicate with each other.
Border Elements and Peer Elements
Figure 2 - An illustration of an administrative domain with border elements, peer elements, and gatekeepers
Border Elements and Peer Elements are optional entities similar to a Gatekeeper, but that do not manage endpoints directly and provide some services that are not described in the RAS protocol. The role of a border or peer element is understood via the definition of an "administrative domain".
An administrative domain is the collection of all zones that are under the control of a single person or organization, such as a service provider. Within a service provider network there may be hundreds or thousands of gateway devices, telephones, video terminals, or other H.323 network elements. The service provider might arrange devices into "zones" that enable the service provider to best manage all of the devices under its control, such as logical arrangement by city. Taken together, all of the zones within the service provider network would appear to another service provider as an "administrative domain".
The border element is a signaling entity that generally sits at the edge of the administrative domain and communicates with another administrative domain. This communication might include such things as access authorization information; call pricing information; or other important data necessary to enable communication between the two administrative domains.
Peer elements are entities within the administrative domain that, more or less, help to propagate information learned from the border elements throughout the administrative domain. Such architecture is intended to enable large-scale deployments within carrier networks and to enable services such as clearinghouses.
Establishment of an H.323 call
In the simplest form, an H.323 call may be established as follows (figure 3):
In this example, the endpoint (EP) on the left initiated communication with the gateway on the right and the gateway connected the call with the called party. In reality, call flows are often more complex than the one shown, but most calls that utilize the Fast Connect procedures defined within H.323 can be established with as few as 2 or 3 messages. Endpoints must notify their gatekeeper (if gatekeepers are used) that they are in a call. Once a call has concluded, a device will send a Release Complete message. Endpoints are then required to notify their gatekeeper (if gatekeepers are used) that the call has ended.