Internet Engineering Task Force (IETF) R. Mahy
Request for Comments: 5850 Unaffiliated
Category: Informational R. Sparks
ISSN: 2070-1721 Tekelec
J. Rosenberg
jdrosen.net
D. Petrie
SIPez
A. Johnston, Ed.
Avaya
May 2010
A Call Control and Multi-Party Usage Framework for
the Session Initiation Protocol (SIP)
Abstract
This document defines a framework and the requirements for call
control and multi-party usage of the Session Initiation Protocol
(SIP). To enable discussion of multi-party features and
applications, we define an abstract call model for describing the
media relationships required by many of these. The model and actions
described here are specifically chosen to be independent of the SIP
signaling and/or mixing approach chosen to actually set up the media
relationships. In addition to its dialog manipulation aspect, this
framework includes requirements for communicating related information
and events such as conference and session state and session history.
This framework also describes other goals that embody the spirit of
SIP applications as used on the Internet such as the definition of
primitives (not services), invoker and participant oriented
primitives, signaling and mixing model independence, and others.
Status of This Memo
This document is not an Internet Standards Track specification; it is
published for informational purposes.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Not all documents
approved by the IESG are a candidate for any level of Internet
Standard; see Section 2 of RFC 5741.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
http://www.rfc-editor.org/info/rfc5850.
Mahy, et al. Informational [Page 1]
RFC 5850 SIP Call Control Framework May 2010
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Table of Contents
1. Motivation and Background . . . . . . . . . . . . . . . . . . 4
2. Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1. Conversation Space Model . . . . . . . . . . . . . . . . . 7
2.2. Relationship between Conversation Space, SIP Dialogs,
and SIP Sessions . . . . . . . . . . . . . . . . . . . . . 8
2.3. Signaling Models . . . . . . . . . . . . . . . . . . . . . 9
2.4. Mixing Models . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1. Tightly Coupled . . . . . . . . . . . . . . . . . . . 11
2.4.2. Loosely Coupled . . . . . . . . . . . . . . . . . . . 12
2.5. Conveying Information and Events . . . . . . . . . . . . . 13
2.6. Componentization and Decomposition . . . . . . . . . . . . 15
2.6.1. Media Intermediaries . . . . . . . . . . . . . . . . . 15
2.6.2. Text-to-Speech and Automatic Speech Recognition . . . 17
2.6.3. VoiceXML . . . . . . . . . . . . . . . . . . . . . . . 17
2.7. Use of URIs . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.1. Naming Users in SIP . . . . . . . . . . . . . . . . . 19
2.7.2. Naming Services with SIP URIs . . . . . . . . . . . . 20
2.8. Invoker Independence . . . . . . . . . . . . . . . . . . . 22
2.9. Billing Issues . . . . . . . . . . . . . . . . . . . . . . 23
Mahy, et al. Informational [Page 2]
RFC 5850 SIP Call Control Framework May 2010
3. Catalog of Call Control Actions and Sample Features . . . . . 23
3.1. Remote Call Control Actions on Early Dialogs . . . . . . . 24
3.1.1. Remote Answer . . . . . . . . . . . . . . . . . . . . 24
3.1.2. Remote Forward or Put . . . . . . . . . . . . . . . . 24
3.1.3. Remote Busy or Error Out . . . . . . . . . . . . . . . 24
3.2. Remote Call Control Actions on Single Dialogs . . . . . . 24
3.2.1. Remote Dial . . . . . . . . . . . . . . . . . . . . . 24
3.2.2. Remote On and Off Hold . . . . . . . . . . . . . . . . 25
3.2.3. Remote Hangup . . . . . . . . . . . . . . . . . . . . 25
3.3. Call Control Actions on Multiple Dialogs . . . . . . . . . 25
3.3.1. Transfer . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2. Take . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3. Add . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.4. Local Join . . . . . . . . . . . . . . . . . . . . . . 28
3.3.5. Insert . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.6. Split . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.7. Near-Fork . . . . . . . . . . . . . . . . . . . . . . 29
3.3.8. Far-Fork . . . . . . . . . . . . . . . . . . . . . . . 29
4. Security Considerations . . . . . . . . . . . . . . . . . . . 30
Appendix A. Example Features . . . . . . . . . . . . . . . . . 32
Appendix A.1. Attended Transfer . . . . . . . . . . . . . . . . . 32
Appendix A.2. Auto Answer . . . . . . . . . . . . . . . . . . . . 32
Appendix A.3. Automatic Callback . . . . . . . . . . . . . . . . 32
Appendix A.4. Barge-In . . . . . . . . . . . . . . . . . . . . . 32
Appendix A.5. Blind Transfer . . . . . . . . . . . . . . . . . . 32
Appendix A.6. Call Forwarding . . . . . . . . . . . . . . . . . . 33
Appendix A.7. Call Monitoring . . . . . . . . . . . . . . . . . . 33
Appendix A.8. Call Park . . . . . . . . . . . . . . . . . . . . . 33
Appendix A.9. Call Pickup . . . . . . . . . . . . . . . . . . . . 33
Appendix A.10. Call Return . . . . . . . . . . . . . . . . . . . . 34
Appendix A.11. Call Waiting . . . . . . . . . . . . . . . . . . . 34
Appendix A.12. Click-to-Dial . . . . . . . . . . . . . . . . . . . 34
Appendix A.13. Conference Call . . . . . . . . . . . . . . . . . . 34
Appendix A.14. Consultative Transfer . . . . . . . . . . . . . . . 34
Appendix A.15. Distinctive Ring . . . . . . . . . . . . . . . . . 35
Appendix A.16. Do Not Disturb . . . . . . . . . . . . . . . . . . 35
Appendix A.17. Find-Me . . . . . . . . . . . . . . . . . . . . . . 35
Appendix A.18. Hotline . . . . . . . . . . . . . . . . . . . . . . 35
Appendix A.19. IM Conference Alerts . . . . . . . . . . . . . . . 35
Appendix A.20. Inbound Call Screening . . . . . . . . . . . . . . 35
Appendix A.21. Intercom . . . . . . . . . . . . . . . . . . . . . 36
Appendix A.22. Message Waiting . . . . . . . . . . . . . . . . . . 36
Appendix A.23. Music on Hold . . . . . . . . . . . . . . . . . . . 36
Appendix A.24. Outbound Call Screening . . . . . . . . . . . . . . 36
Appendix A.25. Pre-Paid Calling . . . . . . . . . . . . . . . . . 37
Appendix A.26. Presence-Enabled Conferencing . . . . . . . . . . . 37
Appendix A.27. Single Line Extension/Multiple Line Appearance . . 37
Appendix A.28. Speakerphone Paging . . . . . . . . . . . . . . . . 38
Mahy, et al. Informational [Page 3]
RFC 5850 SIP Call Control Framework May 2010
Appendix A.29. Speed Dial . . . . . . . . . . . . . . . . . . . . 38
Appendix A.30. Voice Message Screening . . . . . . . . . . . . . . 38
Appendix A.31. Voice Portal . . . . . . . . . . . . . . . . . . . 39
Appendix A.32. Voicemail . . . . . . . . . . . . . . . . . . . . . 40
Appendix A.33. Whispered Call Waiting . . . . . . . . . . . . . . 40
Appendix B. Acknowledgments . . . . . . . . . . . . . . . . . . 40
5. Informative References . . . . . . . . . . . . . . . . . . . . 40
1. Motivation and Background
The Session Initiation Protocol (SIP) [RFC3261] was defined for the
initiation, maintenance, and termination of sessions or calls between
one or more users. However, despite its origins as a large-scale
multi-party conferencing protocol, SIP is used today primarily for
point-to-point calls. This two-party configuration is the focus of
the SIP specification and most of its extensions.
This document defines a framework and the requirements for call
control and multi-party usage of SIP. Most multi-party operations
manipulate SIP dialogs (also known as call legs) or SIP conference
media policy to cause participants in a conversation to perceive
specific media relationships. In other protocols that deal with the
concept of calls, this manipulation is known as call control. In
addition to its dialog or policy manipulation aspect, call control
also includes communicating information and events related to
manipulating calls, including information and events dealing with
session state and history, conference state, user state, and even
message state.
Based on input from the SIP community, the authors compiled the
following set of goals for SIP call control and multi-party
applications:
o Define primitives, not services. Allow for a handful of robust
yet simple mechanisms that can be combined to deliver features and
services. Throughout this document, we refer to these simple
mechanisms as "primitives". Primitives should be sufficiently
robust so that when they are combined with each other, they can be
used to build lots of services. However, the goal is not to
define a provably complete set of primitives. Note that while the
IETF will NOT standardize behavior or services, it may define
example services for informational purposes, as in service
examples [RFC5359].
o Be participant oriented. The primitives should be designed to
provide services that are oriented around the experience of the
participants. The authors observe that end users of features and
services usually don't care how a media relationship is set up.
Mahy, et al. Informational [Page 4]
RFC 5850 SIP Call Control Framework May 2010
Their ultimate experience is only based on the resulting media and
other externally visible characteristics.
o Be signaling model independent. Support both a central-control
and a peer-to-peer feature invocation model (and combinations of
the two). Baseline SIP already supports a centralized control
model described in 3pcc (third party call control) [RFC3725], and
the SIP community has expressed a great deal of interest in peer-
to-peer or distributed call control using primitives such as those
defined in REFER [RFC3515], Replaces [RFC3891], and Join
[RFC3911].
o Be mixing model independent. The bulk of interesting multi-party
applications involve mixing or combining media from multiple
participants. This mixing can be performed by one or more of the
participants or by a centralized mixing resource. The experience
of the participants should not depend on the mixing model used.
While most examples in this document refer to audio mixing, the
framework applies to any media type. In this context, a "mixer"
refers to combining media of the same type in an appropriate,
media-specific way. This is consistent with the model described
in the SIP conferencing framework.
o Be invoker oriented. Only the user who invokes a feature or a
service needs to know exactly which service is invoked or why.
This is good because it allows new services to be created without
requiring new primitives from all of the participants; and it
allows for much simpler feature authorization policies, for
example, when participation spans organizational boundaries. As
discussed in Section 2.7, this also avoids exponential state
explosion when combining features. The invoker only has to manage
a user interface or application programming interface (API) to
prevent local feature interactions. All the other participants
simply need to manage the feature interactions of a much smaller
number of primitives.
o Primitives make full use of URIs (uniform resource identifiers).
URIs are a very powerful mechanism for describing users and
services. They represent a plentiful resource that can be
extremely expressive and easily routed, translated, and
manipulated -- even across organizational boundaries. URIs can
contain special parameters and informational header fields that
need only be relevant to the owner of the namespace (domain) of
the URI. Just as a user who selects an http: URL need not
understand the significance and organization of the web site it
references, a user may encounter a SIP URI that translates into an
email-style group alias, which plays a pre-recorded message or
runs some complex call-handling logic. Note that while this may
Mahy, et al. Informational [Page 5]
RFC 5850 SIP Call Control Framework May 2010
seem paradoxical to the previous goal, both goals can be satisfied
by the same model.
o Make use of SIP header fields and SIP event packages to provide
SIP entities with information about their environment. These
should include information about the status/handling of dialogs on
other user agents (UAs), information about the history of other
contacts attempted prior to the current contact, the status of
participants, the status of conferences, user presence
information, and the status of messages.
o Encourage service decomposition, and design to make use of
standard components using well-defined, simple interfaces. Sample
components include a SIP mixer, recording service, announcement
server, and voice-dialog server. (This is not an exhaustive
list).
o Include authentication, authorization, policy, logging, and
accounting mechanisms to allow these primitives to be used safely
among mutually untrusted participants. Some of these mechanisms
may be used to assist in billing, but no specific billing system
will be endorsed.
o Permit graceful fallback to baseline SIP. Definitions for new SIP
call control extensions/primitives must describe a graceful way to
fallback to baseline SIP behavior. Support for one primitive must
not imply support for another primitive.
o Don't reinvent traditional models, such as the model used for the
H.450 family of protocols, JTAPI (Java Telephony Application
Programming Interface), or the CSTA (Computer-supported
telecommunications applications) call model, as these other models
do not share the design goals presented in this document.
Note that the flexibility in this model does have some disadvantages
in terms of interoperability. It is possible to build a call control
feature in SIP using different combinations of primitives. For a
discussion of the issues associated with this, see [BLISS-PROBLEM].
2. Key Concepts
This section introduces a number of key concepts that will be used to
describe and explain various call control operations and services in
the remainder of this document. This includes the conversation space
model, signaling and mixing models, common components, and the use of
URIs.
Mahy, et al. Informational [Page 6]
RFC 5850 SIP Call Control Framework May 2010
2.1. Conversation Space Model
This document introduces the concept of an abstract "conversation
space" as a set of participants who believe they are all
communicating among one another. Each conversation space contains
one or more participants.
Participants are SIP UAs that send original media to or terminate and
receive media from other members of the conversation space.
Logically, every participant in the conversation space has access to
all the media generated in that space (this is strictly true if all
participants share a common media type). A SIP UA that does not
contribute or consume any media is NOT a participant, nor is a UA
that merely forwards, transcodes, mixes, or selects media originating
elsewhere in the conversation space.
Note that a conversation space consists of zero or more SIP calls
or SIP conferences. A conversation space is similar to the
definition of a "call" in some other call models.
Participants may represent human users or non-human users (referred
to as robots or automatons in this document). Some participants may
be hidden within a conversation space. Some examples of hidden
participants include: robots that generate tones, images, or
announcements during a conference to announce users arriving and
departing, a human call center supervisor monitoring a conversation
between a trainee and a customer, and robots that record media for
training or archival purposes.
Participants may also be active or passive. Active participants are
expected to be intelligent enough to leave a conversation space when
they no longer desire to participate. (An attentive human
participant is obviously active.) Some robotic participants (such as
a voice-messaging system, an instant-messaging agent, or a voice-
dialog system) may be active participants if they can leave the
conversation space when there is no human interaction. Other robots
(for example, our tone-generating robot from the previous example)
are passive participants. A human participant "on hold" is passive.
An example diagram of a conversation space can be shown as a "bubble"
or ovals, or as a "set" in curly or square bracket notation. Each
set, oval, or bubble represents a conversation space. Hidden
participants are shown in lowercase letters. Examples are given in
Figure 1.
Note that while the term "conversation" usually applies to oral
exchange of information, we apply the conversation space model to any
media exchange between participants.
Mahy, et al. Informational [Page 7]
RFC 5850 SIP Call Control Framework May 2010
{ A , B } [ A , b, C, D ]
.-. .---.
/ \ / \
/ A \ / A b \
( ) ( )
\ B / \ C D /
\ / \ /
'-' '---'
Figure 1. Conversation Spaces
2.2. Relationship between Conversation Space, SIP Dialogs, and SIP
Sessions
In [RFC3261], a call is "an informal term that refers to some
communication between peers, generally set up for the purposes of a
multimedia conversation". The concept of a conversation space is
needed because the SIP definition of call is not sufficiently precise
for the purpose of describing the user experience of multi-party
features.
Do any other definitions convey the correct meaning? SIP and SDP
(Session Description Protocol) [RFC4566] both define a conference as
"a multimedia session identified by a common session description". A
session is defined as "a set of multimedia senders and receivers and
the data streams flowing from senders to receivers". The definition
of "call" in some call models is more similar to our definition of a
conversation space.
Some examples of the relationship between conversation spaces, SIP
dialogs, and SIP sessions are listed below. In each example, a human
user will perceive that there is a single call.
o A simple two-party call is a single conversation space, a single
session, and a single dialog.
o A locally mixed three-way call is two sessions and two dialogs.
It is also a single conversation space.
o A simple dial-in audio conference is a single conversation space,
but is represented by as many dialogs and sessions as there are
human participants.
o A multicast conference is a single conversation space, a single
session, and as many dialogs as participants.
Mahy, et al. Informational [Page 8]
RFC 5850 SIP Call Control Framework May 2010
2.3. Signaling Models
Obviously, to make changes to a conversation space, you must be able
to use SIP signaling to cause these changes. Specifically, there
must be a way to manipulate SIP dialogs (call legs) to move
participants into and out of conversation spaces. Although this is
not as obvious, there also must be a way to manipulate SIP dialogs to
include non-participant UAs that are otherwise involved in a
conversation space (e.g., back-to-back user agents or B2BUAs, third
party call control (3pcc) controllers, mixers, transcoders,
translators, or relays).
Implementations may setup the media relationships described in the
conversation space model using a centralized control model. One
common way to implement this using SIP is known as third party call
control (3pcc) and is described in 3pcc [RFC3725]. The 3pcc approach
relies on only the following three primitive operations:
o Create a new dialog (INVITE)
o Modify a dialog (reINVITE)
o Destroy a dialog (BYE)
The main advantage of the 3pcc approach is that it only requires very
basic SIP support from end systems to support call control features.
As such, third party call control is a natural way to handle protocol
conversion and mid-call features. It also has the advantage and
disadvantage that new features can/must be implemented in one place
only (the controller), and it neither requires enhanced client
functionality nor takes advantage of it.
In addition, a peer-to-peer approach is discussed at length in this
document. The primary drawback of the peer-to-peer model is
additional complexity in the end system and authentication and
management models. The benefits of the peer-to-peer model include:
o state remains at the edges,
o call signaling need only go through participants involved (there
are no additional points of failure), and
o peers may take advantage of end-to-end message integrity or
encryption
Mahy, et al. Informational [Page 9]
RFC 5850 SIP Call Control Framework May 2010
The peer-to-peer approach relies on additional "primitive"
operations, some of which are identified here.
o Replace an existing dialog
o Join a new dialog with an existing dialog
o Locally perform media forking (multi-unicast)
o Ask another user agent (UA) to send a request on your behalf
The peer-to-peer approach also only results in a single SIP dialog,
directly between the two UAs. The 3pcc approach results in two SIP
dialogs, between each UA and the controller. As a result, the SIP
features and extensions that will be used during the dialog are
limited to the those understood by the controller. As a result, in a
situation where both the UAs support an advanced SIP feature but the
controller does not, the feature will not be able to be used.
Many of the features, primitives, and actions described in this
document also require some type of media mixing, combining, or
selection as described in the next section.
2.4. Mixing Models
SIP permits a variety of mixing models, which are discussed here
briefly. This topic is discussed more thoroughly in the SIP
conferencing framework [RFC4353] and [RFC4579]. SIP supports both
tightly coupled and loosely coupled conferencing, although more
sophisticated behavior is available in tightly coupled conferences.
In a tightly coupled conference, a single SIP user agent (called the
focus) has a direct dialog relationship with each participant (and
may control non-participant user agents as well). The focus can
authoritatively publish information about the character and
participants in a conference. In a loosely coupled conference, there
are no coordinated signaling relationships among the participants.
For brevity, only the two most popular conferencing models are
significantly discussed in this document (local and centralized
mixing). Applications of the conversation spaces model to loosely
coupled multicast and distributed full unicast mesh conferences are
left as an exercise for the reader. Note that a distributed full
mesh conference can be used for basic conferences, but does not
easily allow for more complex conferencing actions like splitting,
merging, and sidebars.
Mahy, et al. Informational [Page 10]
RFC 5850 SIP Call Control Framework May 2010
Call control features should be designed to allow a mixer (local or
centralized) to decide when to reduce a conference back to a two-
party call, or drop all the participants (for example, if only two
automatons are communicating). The actual heuristics used to release
calls are beyond the scope of this document, but may depend on
properties in the conversation space, such as the number of active,
passive, or hidden participants and the send-only, receive-only, or
send-and-receive orientation of various participants.
2.4.1. Tightly Coupled
Tightly coupled conferences utilize a central point for signaling and
authentication known as a focus [RFC4353]. The actual media can be
centrally mixed or distributed.
2.4.1.1. (Single) End System Mixing
The first model we call "end system mixing". In this model, user A
calls user B, and they have a conversation. At some point later, A
decides to conference in user C. To do this, A calls C, using a
completely separate SIP call. This call uses a different Call-ID,
different tags, etc. There is no call set up directly between B and
C. No SIP extension or external signaling is needed. A merely
decides to locally join two dialogs.
B C
\ /
\ /
A
Figure 2. End System Mixing Example
In Figure 2, A receives media streams from both B and C, and mixes
them. A sends a stream containing A's and C's streams to B, and a
stream containing A's and B's streams to C. Basically, user A
handles both signaling and media mixing.
2.4.1.2. Centralized Mixing
In a centralized mixing model, all participants have a pairwise SIP
and media relationship with the mixer. Common applications of
centralized mixing include ad hoc conferences and scheduled dial-in
or dial-out conferences. In Figure 3 below, the mixer M receives and
sends media to participants A, B, C, D, and E.
Mahy, et al. Informational [Page 11]
RFC 5850 SIP Call Control Framework May 2010
B C
\ /
\ /
M --- A
/ \
/ \
D E
Figure 3. Centralized Mixing Example
2.4.1.3. Centralized Signaling, Distributed Media
In this conferencing model, there is a centralized controller, as in
the dial-in and dial-out cases. However, the centralized server
handles signaling only. The media is still sent directly between
participants, using either multicast or multi-unicast. Participants
perform their own mixing. Multi-unicast is when a user sends
multiple packets (one for each recipient, addressed to that
recipient). This is referred to as a "Decentralized Multipoint
Conference" in [H.323]. Full mesh media with centralized mixing is
another approach.
2.4.2. Loosely Coupled
In these models, there is no point of central control of SIP
signaling. As in the "Centralized Signaling, Distributed Media" case
above, all endpoints send media to all other endpoints.
Consequently, every endpoint mixes their own media from all the other
sources and sends their own media to every other participant.
2.4.2.1. Large-Scale Multicast Conferences
Large-scale multicast conferences were the original motivation for
both the Session Description Protocol (SDP) [RFC4566] and SIP. In a
large-scale multicast conference, one or more multicast addresses are
allocated to the conference. Each participant joins those multicast
groups and sends their media to those groups. Signaling is not sent
to the multicast groups. The sole purpose of the signaling is to
inform participants of which multicast groups to join. Large-scale
multicast conferences are usually pre-arranged, with specific start
and stop times. However, multicast conferences do not need to be
pre-arranged, so long as a mechanism exists to dynamically obtain a
multicast address.
Mahy, et al. Informational [Page 12]
RFC 5850 SIP Call Control Framework May 2010
2.4.2.2. Full Distributed Unicast Conferencing
In this conferencing model, each participant has both a pairwise
media relationship and a pairwise signaling relationship with every
other participant (a full mesh). This model requires a mechanism to
maintain a consistent view of distributed state across the group.
This is a classic, hard problem in computer science. Also, this
model does not scale well for large numbers of participants. For
participants, the number of media and signaling relationships is
approximately n-squared. As a result, this model is not generally
available in commercial implementations; to the contrary, it is
primarily the topic of research or experimental implementations.
Note that this model assumes peer-to-peer signaling.
2.5. Conveying Information and Events
Participants should have access to information about the other
participants in a conversation space so that this information can be
rendered to a human user or processed by an automaton. Although some
of this information may be available from the Request-URI or To,
From, Contact, or other SIP header fields, another mechanism of
reporting this information is necessary.
Many applications are driven by knowledge about the progress of calls
and conferences. In general, these types of events allow for the
construction of distributed applications, where the application
requires information on dialog and conference state, but is not
necessarily a co-resident with an endpoint user agent or conference
server. For example, a focus involved in a conversation space may
wish to provide URIs for conference status and/or conference/floor
control.
The SIP Events architecture [RFC3265] defines general mechanisms for
subscription to and notification of events within SIP networks. It
introduces the notion of a package that is a specific "instantiation"
of the events mechanism for a well-defined set of events.
Event packages are needed to provide the status of a user's dialogs,
the status of conferences and their participants, user-presence
information, the status of registrations, and the status of a user's
messages. While this is not an exhaustive list, these are sufficient
to enable the sample features described in this document.
The conference event package [RFC4575] allows users to subscribe to
information about an entire tightly coupled SIP conference.
Notifications convey information about the participants such as the
SIP URI identifying each user, their status in the space (active,
declined, departed), URIs to invoke other features (such as sidebar
Mahy, et al. Informational [Page 13]
RFC 5850 SIP Call Control Framework May 2010
conversations), links to other relevant information (such as floor-
control policies), and if floor-control policies are in place, the
user's floor-control status. For conversation spaces created from
cascaded conferences, conversation state can be gathered from
relevant foci and merged into a cohesive set of state.
The dialog package [RFC4235] provides information about all the
dialogs the target user is maintaining, in which conversations the
user is participating, and how these are correlated. Likewise, the
registration package [RFC3680] provides notifications when contacts
have changed for a specific address-of-record (AOR). The combination
of these allows a user agent to learn about all conversations
occurring for the entire registered contact set for an address-of-
record.
Note that user presence in SIP [RFC3856] has a close relationship
with these latter two event packages. It is fundamental to the
presence model that the information used to obtain user presence is
constructed from any number of different input sources. Examples of
other such sources include calendaring information and uploads of
presence documents. These two packages can be considered another
mechanism that allows a presence agent to determine the presence
state of the user. Specifically, a user presence server can act as a
subscriber for the dialog and registration packages to obtain
additional information that can be used to construct a presence
document.
The multi-party architecture may also need to provide a mechanism to
get information about the status/handling of a dialog (for example,
information about the history of other contacts attempted prior to
the current contact). Finally, the architecture should provide ample
opportunities to present informational URIs that relate to calls,
conversations, or dialogs in some way. For example, consider the SIP
Call-Info header or Contact header fields returned in a 300-class
response. Frequently, additional information about a call or dialog
can be fetched via non-SIP URIs. For example, consider a web page
for package tracking when calling a delivery company or a web page
with related documentation when joining a dial-in conference. The
use of URIs in the multi-party framework is discussed in more detail
in Section 3.7.
Finally, the interaction of SIP with stimulus-signaling-based
applications, which allow a user agent to interact with an
application without knowledge of the semantics of that application,
is discussed in the SIP application interaction framework [RFC5629].
Stimulus signaling can occur with a user interface running locally
with the client, or with a remote user interface, through media
streams. Stimulus signaling encompasses a wide range of mechanisms,
Mahy, et al. Informational [Page 14]
RFC 5850 SIP Call Control Framework May 2010
from clicking on hyperlinks, to pressing buttons, to traditional
Dual-Tone Multi Frequency (DTMF) input. In all cases, stimulus
signaling is supported through the use of markup languages, which
play a key role in that framework.
2.6. Componentization and Decomposition
This framework proposes a decomposed component architecture with a
very loose coupling of services and components. This means that a
service (such as a conferencing server or an auto-attendant) need not
be implemented as an actual server. Rather, these services can be
built by combining a few basic components in straightforward or
arbitrarily complex ways.
Since the components are easily deployed on separate boxes, by
separate vendors, or even with separate providers, we achieve a
separation of function that allows each piece to be developed in
complete isolation. We can also reuse existing components for new
applications. This allows rapid service creation, and the ability
for services to be distributed across organizational domains anywhere
in the Internet.
For many of these components, it is also desirable to discover their
capabilities, for example, querying the ability of a mixer to host a
10-dialog conference or to reserve resources for a specific time.
These actions could be provided in the form of URIs, provided there
is an a priori means of understanding their semantics. For example,
if there is a published dictionary of operations, a way to query the
service for the available operations and the associated URIs, the URI
can be the interface for providing these service operations. This
concept is described in more detail in the context of dialog
operations in Section 3.
2.6.1. Media Intermediaries
Media intermediaries are not participants in any conversation space,
although an entity that is also a media translator may also have a
co-located participant component (for example, a mixer that also
announces the arrival of a new participant; the announcement portion
is a participant, but the mixer itself is not). Media intermediaries
should be as transparent as possible to the end users -- offering a
useful, fundamental service without getting in the way of new
features implemented by participants. Some common media
intermediaries are described below.
Mahy, et al. Informational [Page 15]
RFC 5850 SIP Call Control Framework May 2010
2.6.1.1. Mixer
A SIP mixer is a component that combines media from all dialogs in
the same conversation in a media-specific way. For example, the
default combining for an audio conference might be an N-1
configuration, while a text mixer might interleave text messages on a
per-line basis. More details about how to manipulate the media
policy used by mixers is discussed in [XCON-CCMP].
2.6.1.2. Transcoder
A transcoder translates media from one encoding or format to another
(for example, GSM (Global System for Mobile communications) voice to
G.711, MPEG2 to H.261, or text/html to text/plain), or from one media
type to another (for example, text to speech). A more thorough
discussion of transcoding is described in the SIP transcoding
services invocation [RFC5369].
2.6.1.3. Media Relay
A media relay terminates media and simply forwards it to a new
destination without changing the content in any way. Sometimes,
media relays are used to provide source IP address anonymity, to
facilitate middlebox traversal, or to provide a trusted entity where
media can be forcefully disconnected.
2.6.1.4. Queue Server
A queue server is a location where calls can be entered into one of
several FIFO (first-in, first-out) queues. A queue server would
subscribe to the presence of groups or individuals who are interested
in its queues. When detecting that a user is available to service a
queue, the server redirects or transfers the last call in the
relevant queue to the available user. On a queue-by-queue basis,
authorized users could also subscribe to the call state (dialog
information) of calls within a queue. Authorized users could use
this information to effectively pluck (take) a call out of the queue
(for example, by sending an INVITE with a Replaces header to one of
the user agents in the queue).
2.6.1.5. Parking Place
A parking place is a location where calls can be terminated
temporarily and then retrieved later. While a call is "parked", it
can receive media "on hold" such as music, announcements, or
advertisements. Such a service could be further decomposed such that
announcements or music are handled by a separate component.
Mahy, et al. Informational [Page 16]
RFC 5850 SIP Call Control Framework May 2010
2.6.1.6. Announcements and Voice Dialogs
An announcement server is a server that can play digitized media
(frequently audio), such as music or recorded speech. These servers
are typically accessible via SIP, HTTP (Hyper Text Transport
Protocol), or RTSP (Real-Time Streaming Protocol). An analogous
service is a recording service that stores digitized media. A
convention for specifying announcements in SIP URIs is described in
[RFC4240]. Likewise, the same server could easily provide a service
that records digitized media.
A "voice dialog" is a model of spoken interactive behavior between a
human and an automaton that can include synthesized speech, digitized
audio, recognition of spoken and DTMF key input, a recording of
spoken input, and interaction with call control. Voice dialogs
frequently consist of forms or menus. Forms present information and
gather input; menus offer choices of what to do next.
Spoken dialogs are a basic building block of applications that use
voice. Consider, for example, that a voicemail system, the
conference-id and passcode collection system for a conferencing
system, and complicated voice-portal applications all require a
voice-dialog component.
2.6.2. Text-to-Speech and Automatic Speech Recognition
Text-to-speech (TTS) is a service that converts text into digitized
audio. TTS is frequently integrated into other applications, but
when separated as a component, it provides greater opportunity for
broad reuse. Automatic Speech Recognition (ASR) is a service that
attempts to decipher digitized speech based on a proposed grammar.
Like TTS, ASR services can be embedded, or exposed so that many
applications can take advantage of such services. A standardized
(decomposed) interface to access standalone TTS and ASR services is
currently being developed as described in [RFC4313].
2.6.3. VoiceXML
VoiceXML is a W3C (World Wide Web Consortium) recommendation that was
designed to give authors control over the spoken dialog between users
and applications. The application and user take turns speaking: the
application prompts the user, and the user in turn responds. Its
major goal is to bring the advantages of web-based development and
content delivery to interactive voice-response applications. We
believe that VoiceXML represents the ideal partner for SIP in the
development of distributed IVR (interactive voice response) servers.
VoiceXML is an XML-based scripting language for describing IVR
services at an abstract level. VoiceXML supports DTMF recognition,
Mahy, et al. Informational [Page 17]
RFC 5850 SIP Call Control Framework May 2010
speech recognition, text-to-speech, and the playing out of recorded
media files. The results of the data collected from the user are
passed to a controlling entity through an HTTP POST operation. The
controller can then return another script, or terminate the
interaction with the IVR server.
A VoiceXML server also need not be implemented as a monolithic
server. Figure 4 shows a diagram of a VoiceXML browser that is split
into media and non-media handling parts. The VoiceXML interpreter
handles SIP dialog state and state within a VoiceXML document, and
sends requests to the media component over another protocol.
+-------------+
| |
| VoiceXML |
| Interpreter |
| (signaling) |
+-------------+
^ ^
| |
SIP | | RTSP
| |
| |
v v
+-------------+ +-------------+
| | | |
| SIP UA | RTP | RTSP Server |
| |<------>| (media) |
| | | |
+-------------+ +-------------+
Figure 4. Decomposed VoiceXML Server
2.7. Use of URIs
All naming in SIP uses URIs. URIs in SIP are used in a plethora of
contexts: the Request-URI; Contact, To, From, and *-Info header
fields; application/uri bodies; and embedded in email, web pages,
instant messages, and ENUM records. The Request-URI identifies the
user or service for which the call is destined.
SIP URIs embedded in informational SIP header fields, SIP bodies, and
non-SIP content can also specify methods, special parameters, header
fields, and even bodies. For example:
sip:bob@b.example.com;method=REFER?Refer-To=http://example.com/~alice
Mahy, et al. Informational [Page 18]
RFC 5850 SIP Call Control Framework May 2010
Throughout this document, we discuss call control primitive
operations. One of the biggest problems is defining how these
operations may be invoked. There are a number of ways to do this.
One way is to define the primitives in the protocol itself such that
SIP methods (for example, REFER) or SIP header fields (for example,
Replaces) indicate a specific call control action. Another way to
invoke call control primitives is to define a specific Request-URI
naming convention. Either these conventions must be shared between
the client (the invoker) and the server, or published by or on behalf
of the server. The former involves defining URI construction
techniques (e.g., URI parameters and/or token conventions) as
proposed in [RFC4240]. The latter technique usually involves
discovering the URI via a SIP event package, a web page, a business
card, or an instant message. Yet, another means to acquire the URIs
is to define a dictionary of primitives with well-defined semantics
and provide a means to query the named primitives and corresponding
URIs that may be invoked on the service or dialogs.
2.7.1. Naming Users in SIP
An address-of-record, or public SIP address, is a SIP (or Secure SIP
(SIPS)) URI that points to a domain with a location service that can
map the URI to set of Contact URIs where the user might be available.
Typically, the Contact URIs are populated via registration.
Address-of-Record Contacts
sip:bob@biloxi.example.com -> sip:bob@babylon.biloxi.example.com:5060
sip:bbrown@mailbox.provider.example.net
sip:+1.408.555.6789@mobile.example.net
Callee Capabilities [RFC3840] define a set of additional parameters
to the Contact header field that define the characteristics of the
user agent at the specified URI. For example, there is a mobility
parameter that indicates whether the UA is fixed or mobile. When a
user agent registers, it places these parameters in the Contact
header fields to characterize the URIs it is registering. This
allows a proxy for that domain to have information about the contact
addresses for that user.
When a caller sends a request, it can optionally request Caller
Preferences [RFC3841] by including the Accept-Contact, Request-
Disposition, and Reject-Contact header fields that request certain
handling by the proxy in the target domain. These header fields
contain preferences that describe the set of desired URIs to which
the caller would like their request routed. The proxy in the target
domain matches these preferences with the Contact characteristics
originally registered by the target user. The target user can also
Mahy, et al. Informational [Page 19]
RFC 5850 SIP Call Control Framework May 2010
choose to run arbitrarily complex "Find-me" feature logic on a proxy
in the target domain.
There is a strong asymmetry in how preferences for callers and
callees can be presented to the network. While a caller takes an
active role by initiating the request, the callee takes a passive
role in waiting for requests. This motivates the use of callee-
supplied scripts and caller preferences included in the call request.
This asymmetry is also reflected in the appropriate relationship
between caller and callee preferences. A server for a callee should
respect the wishes of the caller to avoid certain locations, while
the preferences among locations has to be the callee's choice, as it
determines where, for example, the phone rings and whether the callee
incurs mobile telephone charges for incoming calls.
SIP User Agent implementations are encouraged to make intelligent
decisions based on the type of participants (active/passive, hidden,
human/robot) in a conversation space. This information is conveyed
via the dialog package or in a SIP header field parameter
communicated using an appropriate SIP header field. For example, a
music on hold service may take the sensible approach that if there
are two or more unhidden participants, it should not provide hold
music; or that it will not send hold music to robots.
Multiple participants in the same conversation space may represent
the same human user. For example, the user may use one participant
device for video, chat, and whiteboard media on a PC and another for
audio media on a SIP phone. In this case, the address-of-record is
the same for both user agents, but the Contacts are different. In
this case, there is really only one human participant. In addition,
human users may add robot participants that act on their behalf (for
example, a call recording service or a calendar announcement
reminder). Call control features in SIP should continue to function
as expected in such an environment.
2.7.2. Naming Services with SIP URIs
A critical piece of defining a session-level service that can be
accessed by SIP is defining the naming of the resources within that
service. This point cannot be overstated.
In the context of SIP control of application components, we take
advantage of the fact that the left-hand side of a standard SIP URI
is a user part. Most services may be thought of as user automatons
that participate in SIP sessions. It naturally follows that the user
part should be utilized as a service indicator.
Mahy, et al. Informational [Page 20]
RFC 5850 SIP Call Control Framework May 2010
For example, media servers commonly offer multiple services at a
single host address. Use of the user part as a service indicator
enables service consumers to direct their requests without ambiguity.
It has the added benefit of enabling media services to register their
availability with SIP Registrars just as any "real" SIP user would.
This maintains consistency and provides enhanced flexibility in the
deployment of media services in the network.
There has been much discussion about the potential for confusion if
media-service URIs are not readily distinguishable from other types
of SIP UAs. The use of a service namespace provides a mechanism to
unambiguously identify standard interfaces while not constraining the
development of private or experimental services.
In SIP, the Request-URI identifies the user or service for which the
call is destined. The great advantage of using URIs (specifically,
the SIP Request-URI) as a service identifier comes because of the
combination of two facts. First, unlike in the PSTN (Public Switched
Telephone Network), where the namespace (dialable telephone numbers)
is limited, URIs come from an infinite space. They are plentiful,
and they are free. Secondly, the primary function of SIP is call
routing through manipulations of the Request-URI. In the traditional
SIP application, this URI represents a person. However, the URI can
also represent a service, as we propose here. This means we can
apply the routing services SIP provides to the routing of calls to
services. The result -- the problem of service invocation and
service location becomes a routing problem, for which SIP provides a
scalable and flexible solution. Since there is such a vast namespace
of services, we can explicitly name each service in a finely granular
way. This allows the distribution of services across the network.
For further discussion about services and SIP URIs, see RFC 3087
[RFC3087].
Consider a conferencing service, where we have separated the names of
ad hoc conferences from scheduled conferences, we can program proxies
to route calls for ad hoc conferences to one set of servers and calls
for scheduled ones to another, possibly even in a different provider.
In fact, since each conference itself is given a URI, we can
distribute conferences across servers, and easily guarantee that
calls for the same conference always get routed to the same server.
This is in stark contrast to conferences in the telephone network,
where the equivalent of the URI -- the phone number -- is scarce. An
entire conferencing provider generally has one or two numbers.
Conference IDs must be obtained through IVR interactions with the
caller or through a human attendant. This makes it difficult to
distribute conferences across servers all over the network, since the
PSTN routing only knows about the dialed number.
Mahy, et al. Informational [Page 21]
RFC 5850 SIP Call Control Framework May 2010
For more examples, consider the URI conventions of RFC 4240 [RFC4240]
for media servers and RFC 4458 [RFC4458] for voicemail and IVR
systems.
In practical applications, it is important that an invoker does not
necessarily apply semantic rules to various URIs it did not create.
Instead, it should allow any arbitrary string to be provisioned, and
map the string to the desired behavior. The administrator of a
service may choose to provision specific conventions or mnemonic
strings, but the application should not require it. In any large
installation, the system owner is likely to have preexisting rules
for mnemonic URIs, and any attempt by an application to define its
own rules may create a conflict. Implementations should allow an
arbitrary mix of URIs from these schemes, or any other scheme that
renders valid SIP URIs, rather than enforce only one particular
scheme.
As we have shown, SIP URIs represent an ideal, flexible mechanism for
describing and naming service resources, regardless of whether the
resources are queues, conferences, voice dialogs, announcements,
voicemail treatments, or phone features.
2.8. Invoker Independence
With functional signaling, only the invoker of features in SIP needs
to know exactly which feature they are invoking. One of the primary
benefits of this approach is that combinations of functional features
work in SIP call control without requiring complex feature-
interaction matrices. For example, let us examine the combination of
a "transfer" of a call that is "conferenced".
Alice calls Bob. Alice silently "conferences in" her robotic
assistant Albert as a hidden party. Bob transfers Alice to Carol.
If Bob asks Alice to Replace her leg with a new one to Carol, then
both Alice and Albert should be communicating with Carol
(transparently).
Using the peer-to-peer model, this combination of features works fine
if A is doing local mixing (Alice replaces Bob's dialog with
Carol's), or if A is using a central mixer (the mixer replaces Bob's
dialog with Carol's). A clever implementation using the 3pcc model
can generate similar results.
New extensions to the SIP Call Control Framework should attempt to
preserve this property.
Mahy, et al. Informational [Page 22]
RFC 5850 SIP Call Control Framework May 2010
2.9. Billing Issues
Billing in the PSTN is typically based on who initiated a call. At
the moment, billing in a SIP network is neither consistent with
itself nor with the PSTN. (A billing model for SIP should allow for
both PSTN-style billing and non-PSTN billing.) The example below
demonstrates one such inconsistency.
Alice places a call to Bob. Alice then blind transfers Bob to Carol
through a PSTN gateway. In current usage of REFER, Bob may be billed
for a call he did not initiate (his UA originated the outgoing
dialog, however). This is not necessarily a terrible thing, but it
demonstrates a security concern (Bob must have appropriate local
policy to prevent fraud). Also, Alice may wish to pay for Bob's
session with Carol. There should be a way to signal this in SIP.
Likewise, a Replacement call may maintain the same billing
relationship as a Replaced call, so if Alice first calls Carol, then
asks Bob to Replace this call, Alice may continue to receive a bill.
Further work in SIP billing should define a way to set or discover
the direction of billing.
3. Catalog of Call Control Actions and Sample Features
Call control actions can be categorized by the dialogs upon which
they operate. The actions may involve a single or multiple dialogs.
These dialogs can be early or established. Multiple dialogs may be
related in a conversation space to form a conference or other
interesting media topologies.
It should be noted that it is desirable to provide a means by which a
party can discover the actions that may be performed on a dialog.
The interested party may be independent or related to the dialogs.
One means of accomplishing this is through the ability to define and
obtain URIs for these actions, as described in Section 2.7.2.
Below are listed several call control "actions" that establish or
modify dialogs and relate the participants in a conversation space.
The names of the actions listed are for descriptive purposes only
(they are not normative). This list of actions is not meant to be
exhaustive.
In the examples, all actions are initiated by the user "Alice"
represented by UA "A".
Mahy, et al. Informational [Page 23]
RFC 5850 SIP Call Control Framework May 2010
3.1. Remote Call Control Actions on Early Dialogs
The following are a set of actions that may be performed on a single
early dialog. These actions can be thought of as a set of remote
control operations. For example, an automaton might perform the
operation on behalf of a user. Alternatively, a user might use the
remote control in the form of an application to perform the action on
the early dialog of a UA that may be out of reach. All of these
actions correspond to telling the UA how to respond to a request to
establish an early dialog. These actions provide useful
functionality for PDA-, PC-, and server-based applications that
desire the ability to control a UA. A proposed mechanism for this
type of functionality is described in remote call control
[FEATURE-REF].
3.1.1. Remote Answer
A dialog is in some early dialog state such as 180 Ringing. It may
be desirable to tell the UA to answer the dialog. That is, tell it
to send a 200 OK response to establish the dialog.
3.1.2. Remote Forward or Put
It may be desirable to tell the UA to respond with a 3xx class
response to forward an early dialog to another UA.
3.1.3. Remote Busy or Error Out
It may be desirable to instruct the UA to send an error response such
as 486 Busy Here.
3.2. Remote Call Control Actions on Single Dialogs
There is another useful set of actions that operate on a single
established dialog. These operations are useful in building
productivity applications for aiding users in controlling their
phones. For example, a Customer Relationship Management (CRM)
application that sets up calls for a user eliminating the need for
the user to actually enter an address. These operations can also be
thought of as remote control actions. A proposed mechanism for this
type of functionality is described in remote call control
[FEATURE-REF].
3.2.1. Remote Dial
This action instructs the UA to initiate a dialog. This action can
be performed using the REFER method.
Mahy, et al. Informational [Page 24]
RFC 5850 SIP Call Control Framework May 2010
3.2.2. Remote On and Off Hold
This action instructs the UA to put an established dialog on hold.
Though this operation can conceptually be performed with the REFER
method, there are no semantics defined as to what the referred party
should do with the SDP. There is no way to distinguish between the
desire to go on or off hold on a per-media stream basis.
3.2.3. Remote Hangup
This action instructs the UA to terminate an early or established
dialog. A REFER request with the following Refer-To URI and Target-
Dialog header field [RFC4538] performs this action. Note: this
example does not show the full set of header fields.
REFER sip:carol@client.chicago.net SIP/2.0
Refer-To: sip:bob@babylon.biloxi.example.com;method=BYE
Target-Dialog: 13413098;local-tag=879738;remote-tag=023214
3.3. Call Control Actions on Multiple Dialogs
These actions apply to a set of related dialogs.
3.3.1. Transfer
This section describes how call transfer can be achieved using
centralized (3pcc) and peer-to-peer (REFER) approaches.
The conversation space changes as follows:
before after
{ A , B } --> { C , B }
A replaces itself with C.
To make this happen using the peer-to-peer approach, "A" would send
two SIP requests. A shorthand for those requests is shown below:
REFER B Refer-To:C
BYE B
To make this happen using the 3pcc approach instead, the controller
sends requests represented by the shorthand below:
INVITE C (w/SDP of B)
reINVITE B (w/SDP of C)
BYE A
Mahy, et al. Informational [Page 25]
RFC 5850 SIP Call Control Framework May 2010
Features enabled by this action:
- blind transfer
- transfer to a central mixer (some type of conference or forking)
- transfer to park server (park)
- transfer to music on hold or announcement server
- transfer to a "queue"
- transfer to a service (such as voice-dialog service)
- transition from local mixer to central mixer
This action is frequently referred to as "completing an attended
transfer". It is described in more detail in [RFC5589].
Note that if a transfer requires URI hiding or privacy, then the 3pcc
approach can more easily implement this. For example, if the URI of
C needs to be hidden from B, then the use of 3pcc helps accomplish
this.
3.3.2. Take
The conversation space changes as follows:
{ B , C } --> { B , A }
A forcibly replaces C with itself. In most uses of this primitive, A
is just "un-replacing" itself.
Using the peer-to-peer approach, "A" sends:
INVITE B Replaces:
RFC, FYI, BCP