Network Working Group                                    J. van der Meer
Request for Comments: 3640                           Philips Electronics
Category: Standards Track                                      D. Mackie
                                                         Apple Computer
                                                         V. Swaminathan
                                                  Sun Microsystems Inc.
                                                              D. Singer
                                                         Apple Computer
                                                             P. Gentric
                                                    Philips Electronics
                                                          November 2003


    RTP Payload Format for Transport of MPEG-4 Elementary Streams

Status of this Memo

  This document specifies an Internet standards track protocol for the
  Internet community, and requests discussion and suggestions for
  improvements.  Please refer to the current edition of the "Internet
  Official Protocol Standards" (STD 1) for the standardization state
  and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

  Copyright (C) The Internet Society (2003).  All Rights Reserved.

Abstract

  The Motion Picture Experts Group (MPEG) Committee (ISO/IEC JTC1/SC29
  WG11) is a working group in ISO that produced the MPEG-4 standard.
  MPEG defines tools to compress content such as audio-visual
  information into elementary streams.  This specification defines a
  simple, but generic RTP payload format for transport of any non-
  multiplexed MPEG-4 elementary stream.

Table of Contents

  1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
  2.  Carriage of MPEG-4 Elementary Streams Over RTP . . . . . . . .  4
      2.1.  Signaling by MIME Format Parameters  . . . . . . . . . .  4
      2.2.  MPEG Access Units  . . . . . . . . . . . . . . . . . . .  5
      2.3.  Concatenation of Access Units  . . . . . . . . . . . . .  5
      2.4.  Fragmentation of Access Units  . . . . . . . . . . . . .  6
      2.5.  Interleaving . . . . . . . . . . . . . . . . . . . . . .  6
      2.6.  Time Stamp Information . . . . . . . . . . . . . . . . .  7
      2.7.  State Indication of MPEG-4 System Streams  . . . . . . .  8
      2.8.  Random Access Indication . . . . . . . . . . . . . . . .  8



van der Meer, et al.        Standards Track                     [Page 1]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


      2.9.  Carriage of Auxiliary Information  . . . . . . . . . . .  8
      2.10. MIME Format Parameters and Configuring Conditional Field  8
      2.11. Global Structure of Payload Format . . . . . . . . . . .  9
      2.12. Modes to Transport MPEG-4 Streams  . . . . . . . . . . .  9
      2.13. Alignment with RFC 3016  . . . . . . . . . . . . . . . . 10
  3.  Payload Format . . . . . . . . . . . . . . . . . . . . . . . . 10
      3.1.  Usage of RTP Header Fields and RTCP  . . . . . . . . . . 10
      3.2.  RTP Payload Structure  . . . . . . . . . . . . . . . . . 11
            3.2.1.  The AU Header Section  . . . . . . . . . . . . . 11
                    3.2.1.1.  The AU-header  . . . . . . . . . . . . 12
            3.2.2.  The Auxiliary Section . . . . . . . . . . . . .  14
            3.2.3.  The Access Unit Data Section . . . . . . . . . . 15
                    3.2.3.1.  Fragmentation. . . . . . . . . . . . . 16
                    3.2.3.2.  Interleaving . . . . . . . . . . . . . 16
                    3.2.3.3.  Constraints for Interleaving . . . . . 17
                    3.2.3.4.  Crucial and Non-Crucial AUs with
                              MPEG-4 System Data . . . . . . . . . . 20
      3.3.  Usage of this Specification. . . . . . . . . . . . . . . 21
            3.3.1.  General. . . . . . . . . . . . . . . . . . . . . 21
            3.3.2.  The Generic Mode . . . . . . . . . . . . . . . . 22
            3.3.3.  Constant Bit Rate CELP . . . . . . . . . . . . . 22
            3.3.4.  Variable Bit Rate CELP . . . . . . . . . . . . . 23
            3.3.5.  Low Bit Rate AAC . . . . . . . . . . . . . . . . 24
            3.3.6.  High Bit Rate AAC. . . . . . . . . . . . . . . . 25
            3.3.7.  Additional Modes . . . . . . . . . . . . . . . . 26
  4.  IANA Considerations. . . . . . . . . . . . . . . . . . . . . . 27
      4.1.  MIME Type Registration . . . . . . . . . . . . . . . . . 27
      4.2.  Registration of Mode Definitions with IANA . . . . . . . 33
      4.3.  Concatenation of Parameters. . . . . . . . . . . . . . . 33
      4.4.  Usage of SDP . . . . . . . . . . . . . . . . . . . . . . 34
            4.4.1.  The a=fmtp Keyword . . . . . . . . . . . . . . . 34
  5.  Security Considerations. . . . . . . . . . . . . . . . . . . . 34
  6.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 35
  APPENDIX: Usage of this Payload Format. . .  . . . . . . . . . . . 36
  Appendix A.  Interleave Analysis . . . . . . . . . . . . . . . . . 36
  A.  Examples of Delay Analysis with Interleave. . .  . . . . . . . 36
      A.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . 36
      A.2.  De-interleaving and Error Concealment  . . . . . . . . . 36
      A.3.  Simple Group Interleave  . . . . . . . . . . . . . . . . 36
            A.3.1.  Introduction . . . . . . . . . . . . . . . . . . 36
            A.3.2.  Determining the De-interleave Buffer Size  . . . 37
            A.3.3.  Determining the Maximum Displacement . . . . . . 37
      A.4.  More Subtle Group Interleave . . . . . . . . . . . . . . 38
            A.4.1.  Introduction . . . . . . . . . . . . . . . . . . 38
            A.4.2.  Determining the De-interleave Buffer Size. . . . 38
            A.4.3.  Determining the Maximum Displacement . . . . . . 39
      A.5.  Continuous Interleave  . . . . . . . . . . . . . . . . . 39
            A.5.1.  Introduction . . . . . . . . . . . . . . . . . . 39



van der Meer, et al.        Standards Track                     [Page 2]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


            A.5.2.  Determining the De-interleave Buffer Size  . . . 40
            A.5.3.  Determining the Maximum Displacement . . . . . . 40
  References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
  Normative References . . . . . . . . . . . . . . . . . . . . . . . 41
  Informative References . . . . . . . . . . . . . . . . . . . . . . 41
  Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 42
  Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 43

1.  Introduction

  The MPEG Committee is Working Group 11 (WG11) in ISO/IEC JTC1 SC29
  that specified the MPEG-1, MPEG-2 and, more recently, the MPEG-4
  standards [1].  The MPEG-4 standard specifies compression of audio-
  visual data into, for example an audio or video elementary stream.
  In the MPEG-4 standard, these streams take the form of audio-visual
  objects that may be arranged into an audio-visual scene by means of a
  scene description.  Each MPEG-4 elementary stream consists of a
  sequence of Access Units; examples of an Access Unit (AU) are an
  audio frame and a video picture.

  This specification defines a general and configurable payload
  structure to transport MPEG-4 elementary streams, in particular
  MPEG-4 audio (including speech) streams, MPEG-4 video streams and
  also MPEG-4 systems streams, such as BIFS (BInary Format for Scenes),
  OCI (Object Content Information), OD (Object Descriptor) and IPMP
  (Intellectual Property Management and Protection) streams.  The RTP
  payload defined in this document is simple to implement and
  reasonably efficient.  It allows for optional interleaving of Access
  Units (such as audio frames) to increase error resiliency in packet
  loss.

  Some types of MPEG-4 elementary streams include "crucial" information
  whose loss cannot be tolerated.  However, RTP does not provide
  reliable transmission, so receipt of that crucial information is not
  assured.  Section 3.2.3.4 specifies how stream state is conveyed so
  that the receiver can detect the loss of crucial information and
  cease decoding until the next random access point has been received.
  Applications transmitting streams that include crucial information,
  such as OD commands, BIFS commands, or programmatic content such as
  MPEG-J (Java) and ECMAScript, should include random access points, at
  a suitable periodicity depending upon the probability of loss, in
  order to reduce stream corruption to an acceptable level.  An example
  is the carousel mechanism as defined by MPEG in ISO/IEC 14496-1 [1].

  Such applications may also employ additional protocols or services to
  reduce the probability of loss.  At the RTP layer, these measures
  include payload formats and profiles for retransmission or forward
  error correction (such as in RFC 2733 [10]), that must be employed



van der Meer, et al.        Standards Track                     [Page 3]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  with due consideration to congestion control.  Another solution that
  may be appropriate for some applications is to carry RTP over TCP
  (such as in RFC 2326 [8], section 10.12).  At the network layer,
  resource allocation or preferential service may be available to
  reduce the probability of loss.  For a general description of methods
  to repair streaming media, see RFC 2354 [9].

  Though the RTP payload format defined in this document is capable of
  transporting any MPEG-4 stream, other, more specific, formats may
  exist, such as RFC 3016 [12] for transport of MPEG-4 video (ISO/IEC
  14496 [1] part 2).

  Configuration of the payload is provided to accommodate the
  transportation of any MPEG-4 stream at any possible bit rate.
  However, for a specific MPEG-4 elementary stream typically only very
  few configurations are needed.  So as to allow for the design of
  simplified, but dedicated receivers, this specification requires that
  specific modes be defined for transport of MPEG-4 streams.  This
  document defines modes for MPEG-4 CELP and AAC streams, as well as a
  generic mode that can be used to transport any MPEG-4 stream.  In the
  future, new RFCs are expected to specify additional modes for the
  transportation of MPEG-4 streams.

  The RTP payload format defined in this document specifies carriage of
  system-related information that is often equivalent to the
  information that may be contained in the MPEG-4 Sync Layer (SL) as
  defined in MPEG-4 Systems [1].  This document does not prescribe how
  to transcode or map information from the SL to fields defined in the
  RTP payload format.  Such processing, if any, is left to the
  discretion of the application.  However, to anticipate the need for
  the transportation of any additional system-related information in
  the future, an auxiliary field can be configured that may carry any
  such data.

  The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
  "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
  document are to be interpreted as described in BCP 14, RFC 2119 [4].

2.  Carriage of MPEG-4 Elementary Streams over RTP

2.1.  Signaling by MIME Format Parameters

  With this payload format, a single MPEG-4 elementary stream can be
  transported.  Information on the type of MPEG-4 stream carried in the
  payload is conveyed by MIME format parameters, as in an SDP [5]
  message or by other means (see section 4).  These MIME format
  parameters specify the configuration of the payload.  To allow for
  simplified and dedicated receivers, a MIME format parameter is



van der Meer, et al.        Standards Track                     [Page 4]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  available to signal a specific mode of using this payload.  A mode
  definition MAY include the type of MPEG-4 elementary stream, as well
  as the applied configuration, so as to avoid the need for receivers
  to parse all MIME format parameters.  The applied mode MUST be
  signaled.

2.2.  MPEG Access Units

  For carriage of compressed audio-visual data, MPEG defines Access
  Units.  An MPEG Access Unit (AU) is the smallest data entity to which
  timing information is attributed.  In the case of audio, an Access
  Unit may represent an audio frame and in the case of video, a
  picture.  MPEG Access Units are octet-aligned by definition.  If, for
  example, an audio frame is not octet-aligned, up to 7 zero-padding
  bits MUST be inserted at the end of the frame to achieve the octet-
  aligned Access Units, as required by the MPEG-4 specification.
  MPEG-4 decoders MUST be able to decode AUs in which such padding is
  applied.

  Consistent with the MPEG-4 specification, this document requires that
  each MPEG-4 part 2 video Access Unit include all the coded data of a
  picture, any video stream headers that may precede the coded picture
  data, and any video stream stuffing that may follow it, up to but not
  including the startcode indicating the start of a new video stream or
  the next Access Unit.

2.3.  Concatenation of Access Units

  Frequently it is possible to carry multiple Access Units in one RTP
  packet.  This is particularly useful for audio; for example, when AAC
  is used for encoding a stereo signal at 64 kbits/sec, AAC frames
  contain on average, approximately 200 octets.  On a LAN with a 1500
  octet MTU, this would allow an average of 7 complete AAC frames to be
  carried per RTP packet.

  Access Units may have a fixed size in octets, but a variable size is
  also possible.  To facilitate parsing in the case of multiple
  concatenated AUs in one RTP packet, the size of each AU is made known
  to the receiver.  When concatenating in the case of a constant AU
  size, this size is communicated "out of band" through a MIME format
  parameter.  When concatenating in case of variable size AUs, the RTP
  payload carries "in band" an AU size field for each contained AU.

  In combination with the RTP payload length, the size information
  allows the RTP payload to be split by the receiver back into the
  individual AUs.





van der Meer, et al.        Standards Track                     [Page 5]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  To simplify the implementation of RTP receivers, it is required that
  when multiple AUs are carried in an RTP packet, each AU MUST be
  complete, i.e., the number of AUs in an RTP packet MUST be integral.

  In addition, an AU MUST NOT be repeated in other RTP packets; hence
  repetition of an AU is only possible when using a duplicate RTP
  packet.

2.4.  Fragmentation of Access Units

  MPEG allows for very large Access Units.  Since most IP networks have
  significantly smaller MTU sizes, this payload format allows for the
  fragmentation of an Access Unit over multiple RTP packets.  Hence,
  when an IP packet is lost after IP-level fragmentation, only an AU
  fragment may get lost instead of the entire AU.  To simplify the
  implementation of RTP receivers, an RTP packet SHALL either carry one
  or more complete Access Units or a single fragment of one AU, i.e.,
  packets MUST NOT contain fragments of multiple Access Units.

2.5.  Interleaving

  When an RTP packet carries a contiguous sequence of Access Units, the
  loss of such a packet can result in a "decoding gap" for the user.
  One method of alleviating this problem is to allow for the Access
  Units to be interleaved in the RTP packets.  For a modest cost in
  latency and implementation complexity, significant error resiliency
  to packet loss can be achieved.

  To support optional interleaving of Access Units, this payload format
  allows for index information to be sent for each Access Unit.  After
  informing receivers about buffer resources to allocate for de-
  interleaving, the RTP sender is free to choose the interleaving
  pattern without propagating this information a priori to the
  receiver(s).  Indeed, the sender could dynamically adjust the
  interleaving pattern based on the Access Unit size, error rates, etc.
  The RTP receiver does not need to know the interleaving pattern used;
  it only needs to extract the index information of the Access Unit and
  insert the Access Unit into the appropriate sequence in the decoding
  or rendering queue.  An example of interleaving is given below.












van der Meer, et al.        Standards Track                     [Page 6]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  For example, if we assume that an RTP packet contains 3 AUs, and that
  the AUs are numbered 0, 1, 2, 3, 4, and so forth, and if an
  interleaving group length of 9 is chosen, then RTP packet(i) contains
  the following AU(n):

     RTP packet(0):  AU(0),  AU(3),  AU(6)
     RTP packet(1):  AU(1),  AU(4),  AU(7)
     RTP packet(2):  AU(2),  AU(5),  AU(8)
     RTP packet(3):  AU(9),  AU(12), AU(15)
     RTP packet(4):  AU(10), AU(13), AU(16)  Etc.

2.6.  Time Stamp Information

  The RTP time stamp MUST carry the sampling instant of the first AU
  (fragment) in the RTP packet.  When multiple AUs are carried within
  an RTP packet, the time stamps of subsequent AUs can be calculated if
  the frame period of each AU is known.  For audio and video, this is
  possible if the frame rate is constant.  However, in some cases it is
  not possible to make such a calculation (for example, for variable
  frame rate video, or for MPEG-4 BIFS streams carrying composition
  information).  To support such cases, this payload format can be
  configured to carry a time stamp in the RTP payload for each
  contained Access Unit.  A time stamp MAY be conveyed in the RTP
  payload only for non-first AUs in the RTP packet, and SHALL NOT be
  conveyed for the first AU (fragment), as the time stamp for the first
  AU in the RTP packet is carried by the RTP time stamp.

  MPEG-4 defines two types of time stamps: the composition time stamp
  (CTS) and the decoding time stamp (DTS).  The CTS represents the
  sampling instant of an AU, and hence the CTS is equivalent to the RTP
  time stamp.  The DTS may be used in MPEG-4 video streams that use
  bi-directional coding, i.e., when pictures are predicted in both
  forward and backward direction by using either a reference picture in
  the past, or a reference picture in the future.  The DTS cannot be
  carried in the RTP header.  In some cases, the DTS can be derived
  from the RTP time stamp using frame rate information; this requires
  deep parsing in the video stream, which may be considered
  objectionable.  If the video frame rate is variable, the required
  information may not even be present in the video stream.  For both
  reasons, the capability has been defined to optionally carry the DTS
  in the RTP payload for each contained Access Unit.

  To keep the coding of time stamps efficient, each time stamp
  contained in the RTP payload is coded as a difference.  For the CTS,
  the offset from the RTP time stamps is provided, and for the DTS, the
  offset from the CTS.





van der Meer, et al.        Standards Track                     [Page 7]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


2.7.  State Indication of MPEG-4 System Streams

  ISO/IEC 14496-1 defines states for MPEG-4 system streams.  So as to
  convey state information when transporting MPEG-4 system streams,
  this payload format allows for the optional carriage in the RTP
  payload of the stream state for each contained Access Unit.  Stream
  states are used to signal "crucial" AUs that carry information whose
  loss cannot be tolerated and are also useful when repeating AUs
  according to the carousel mechanism defined in ISO/IEC 14496-1.

2.8.  Random Access Indication

  Random access to the content of MPEG-4 elementary streams may be
  possible at some but not all Access Units.  To signal Access Units
  where random access is possible, a random access point flag can
  optionally be carried in the RTP payload for each contained Access
  Unit.  Carriage of random access points is particularly useful for
  MPEG-4 system streams in combination with the stream state.

2.9.  Carriage of Auxiliary Information

  This payload format defines a specific field to carry auxiliary data.
  The auxiliary data field is preceded by a field that specifies the
  length of the auxiliary data, so as to facilitate the skipping of
  data without parsing it.  The coding of the auxiliary data is not
  defined in this document; instead, the format, meaning and signaling
  of auxiliary information is expected to be specified in one or more
  future RFCs.  Auxiliary information MUST NOT be transmitted until its
  format, meaning and signaling have been specified and its use has
  been signaled.  Receivers that have knowledge of the auxiliary data
  MAY decode the auxiliary data, but receivers without knowledge of
  such data MUST skip the auxiliary data field.

2.10.  MIME Format Parameters and Configuring Conditional Fields

  To support the features described in the previous sections, several
  fields are defined for carriage in the RTP payload.  However, their
  use strongly depends on the type of MPEG-4 elementary stream that is
  carried.  Sometimes a specific field is needed with a certain length,
  while in other cases such a field is not needed.  To be efficient in
  either case, the fields to support these features are configurable by
  means of MIME format parameters.  In general, a MIME format parameter
  defines the presence and length of the associated field.  A length of
  zero indicates absence of the field.  As a consequence, parsing of
  the payload requires knowledge of MIME format parameters.  The MIME
  format parameters are conveyed to the receiver via SDP [5] messages,
  as specified in section 4.4.1, or through other means.




van der Meer, et al.        Standards Track                     [Page 8]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


2.11.  Global Structure of Payload Format

  The RTP payload following the RTP header, contains three octet-
  aligned data sections, of which the first two MAY be empty, see
  Figure 1.

        +---------+-----------+-----------+---------------+
        | RTP     | AU Header | Auxiliary | Access Unit   |
        | Header  | Section   | Section   | Data Section  |
        +---------+-----------+-----------+---------------+

                  <----------RTP Packet Payload----------->

           Figure 1: Data sections within an RTP packet

  The first data section is the AU (Access Unit) Header Section, that
  contains one or more AU-headers; however, each AU-header MAY be
  empty, in which case the entire AU Header Section is empty.  The
  second section is the Auxiliary Section, containing auxiliary data;
  this section MAY also be configured empty.  The third section is the
  Access Unit Data Section, containing either a single fragment of one
  Access Unit or one or more complete Access Units.  The Access Unit
  Data Section MUST NOT be empty.

2.12.  Modes to Transport MPEG-4 Streams

  While it is possible to build fully configurable receivers capable of
  receiving any MPEG-4 stream, this specification also allows for the
  design of simplified, but dedicated receivers, that are for example,
  capable of receiving only one type of MPEG-4 stream.  This is
  achieved by requiring that specific modes be defined in order to use
  this specification.  Each mode may define constraints for transport
  of one or more types of MPEG-4 streams, for instance on the payload
  configuration.

  The applied mode MUST be signaled.  Signaling the mode is
  particularly important for receivers that are only capable of
  decoding one or more specific modes.  Such receivers need to
  determine whether the applied mode is supported, so as to avoid
  problems with processing of payloads that are beyond the capabilities
  of the receiver.

  In this document several modes are defined for the transportation of
  MPEG-4 CELP and AAC streams, as well as a generic mode that can be
  used for any MPEG-4 stream.  In the future, new RFCs may specify
  other modes of using this specification.  However, each mode MUST be
  in full compliance with this specification (see section 3.3.7).




van der Meer, et al.        Standards Track                     [Page 9]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


2.13.  Alignment with RFC 3016

  This payload can be configured as nearly identical to the payload
  format defined in RFC 3016 [12] for the MPEG-4 video configurations
  recommended in RFC 3016.  Hence, receivers that comply with RFC 3016
  can decode such RTP payload, provided that additional packets
  containing video decoder configuration (VO, VOL, VOSH) are inserted
  in the stream, as required by RFC 3016 [12].  Conversely, receivers
  that comply with the specification in this document SHOULD be able to
  decode payloads, names and parameters defined for MPEG-4 video in RFC
  3016 [12].  In this respect, it is strongly RECOMMENDED that the
  implementation provide the ability to ignore "in band" video decoder
  configuration packets that may be found in streams conforming to the
  RFC 3016 video payload.

  Note the "out of band" availability of the video decoder
  configuration is optional in RFC 3016 [12].  To achieve maximum
  interoperability with the RTP payload format defined in this
  document, applications that use RFC 3016 to transport MPEG-4 video
  (part 2) are recommended to make the video decoder configuration
  available as a MIME parameter.

3.  Payload Format

3.1.  Usage of RTP Header Fields and RTCP

  Payload Type (PT): The assignment of an RTP payload type for this
     packet format is outside the scope of this document; it is
     specified by the RTP profile under which this payload format is
     used, or signaled dynamically out-of-band (e.g., using SDP).

  Marker (M) bit: The M bit is set to 1 to indicate that the RTP packet
     payload contains either the final fragment of a fragmented Access
     Unit or one or more complete Access Units.

  Extension (X) bit: Defined by the RTP profile used.

  Sequence Number: The RTP sequence number SHOULD be generated by the
     sender in the usual manner with a constant random offset.

  Timestamp: Indicates the sampling instant of the first AU contained
     in the RTP payload.  This sampling instant is equivalent to the
     CTS in the MPEG-4 time domain.  When using SDP, the clock rate of
     the RTP time stamp MUST be expressed using the "rtpmap" attribute.
     If an MPEG-4 audio stream is transported, the rate SHOULD be set
     to the same value as the sampling rate of the audio stream.  If an
     MPEG-4 video stream is transported, it is RECOMMENDED that the
     rate be set to 90 kHz.



van der Meer, et al.        Standards Track                    [Page 10]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  In all cases, the sender SHALL make sure that RTP time stamps are
  identical only if the RTP time stamp refers to fragments of the same
  Access Unit.

  According to RFC 3550 [2] (section 5.1), it is RECOMMENDED that RTP
  time stamps start at a random value for security reasons.  This is
  not an issue for synchronization of multiple RTP streams.  However,
  when streams from multiple sources are to be synchronized (for
  example one stream from local storage, another from an RTP streaming
  server), synchronization may become impossible if the receiver only
  knows the original time stamp relationships.  In such cases the time
  stamp relationship required for obtaining synchronization may be
  provided by out of band means.  The format of such information, as
  well as methods to convey such information, are beyond the scope of
  this specification.

  SSRC: set as described in RFC 3550 [2].

  CC and CSRC fields are used as described in RFC 3550 [2].

  RTCP SHOULD be used as defined in RFC 3550 [2].  Note that time
  stamps in RTCP Sender Reports may be used to synchronize multiple
  MPEG-4 elementary streams and also to synchronize MPEG-4 streams with
  non-MPEG-4 streams, in case the delivery of these streams uses RTP.

3.2.  RTP Payload Structure

3.2.1.  The AU Header Section

  When present, the AU Header Section consists of the AU-headers-length
  field, followed by a number of AU-headers, see Figure 2.

     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+-+
     |AU-headers-length|AU-header|AU-header|      |AU-header|padding|
     |                 |   (1)   |   (2)   |      |   (n)   | bits  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+-+

                  Figure 2: The AU Header Section

  The AU-headers are configured using MIME format parameters and MAY be
  empty.  If the AU-header is configured empty, the AU-headers-length
  field SHALL NOT be present and consequently the AU Header Section is
  empty.  If the AU-header is not configured empty, then the AU-
  headers-length is a two octet field that specifies the length in bits
  of the immediately following AU-headers, excluding the padding bits.

  Each AU-header is associated with a single Access Unit (fragment)
  contained in the Access Unit Data Section in the same RTP packet.



van der Meer, et al.        Standards Track                    [Page 11]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  For each contained Access Unit (fragment), there is exactly one AU-
  header.  Within the AU Header Section, the AU-headers are bit-wise
  concatenated in the order in which the Access Units are contained in
  the Access Unit Data Section.  Hence, the n-th AU-header refers to
  the n-th AU (fragment).  If the concatenated AU-headers consume a
  non-integer number of octets, up to 7 zero-padding bits MUST be
  inserted at the end in order to achieve octet-alignment of the AU
  Header Section.

3.2.1.1.  The AU-header

  Each AU-header may contain the fields given in Figure 3.  The length
  in bits of the fields, with the exception of the CTS-flag, the
  DTS-flag and the RAP-flag fields, is defined by MIME format
  parameters; see section 4.1.  If a MIME format parameter has the
  default value of zero, then the associated field is not present.  The
  number of bits for fields that are present and that represent the
  value of a parameter MUST be chosen large enough to correctly encode
  the largest value of that parameter during the session.

  If present, the fields MUST occur in the mutual order given in Figure
  3.  In the general case, a receiver can only discover the size of an
  AU-header by parsing it since the presence of the CTS-delta and DTS-
  delta fields is signaled by the value of the CTS-flag and DTS-flag,
  respectively.

     +---------------------------------------+
     |     AU-size                           |
     +---------------------------------------+
     |     AU-Index / AU-Index-delta         |
     +---------------------------------------+
     |     CTS-flag                          |
     +---------------------------------------+
     |     CTS-delta                         |
     +---------------------------------------+
     |     DTS-flag                          |
     +---------------------------------------+
     |     DTS-delta                         |
     +---------------------------------------+
     |     RAP-flag                          |
     +---------------------------------------+
     |     Stream-state                      |
     +---------------------------------------+

  Figure 3: The fields in the AU-header.  If used, the AU-Index field
            only occurs in the first AU-header within an AU Header
            Section; in any other AU-header, the AU-Index-delta field
            occurs instead.



van der Meer, et al.        Standards Track                    [Page 12]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  AU-size: Indicates the size in octets of the associated Access Unit
     in the Access Unit Data Section in the same RTP packet.  When the
     AU-size is associated with an AU fragment, the AU size indicates
     the size of the entire AU and not the size of the fragment.  In
     this case, the size of the fragment is known from the size of the
     AU data section.  This can be exploited to determine whether a
     packet contains an entire AU or a fragment, which is particularly
     useful after losing a packet carrying the last fragment of an AU.

  AU-Index: Indicates the serial number of the associated Access Unit
     (fragment).  For each (in decoding order) consecutive AU or AU
     fragment, the serial number is incremented by 1.  When present,
     the AU-Index field occurs in the first AU-header in the AU Header
     Section, but MUST NOT occur in any subsequent (non-first) AU-
     header in that Section.  To encode the serial number in any such
     non-first AU-header, the AU-Index-delta field is used.

  AU-Index-delta: The AU-Index-delta field is an unsigned integer that
     specifies the serial number of the associated AU as the difference
     with respect to the serial number of the previous Access Unit.
     Hence, for the n-th (n>1) AU, the serial number is found from:

     AU-Index(n) = AU-Index(n-1) + AU-Index-delta(n) + 1

     If the AU-Index field is present in the first AU-header in the AU
     Header Section, then the AU-Index-delta field MUST be present in
     any subsequent (non-first) AU-header.  When the AU-Index-delta is
     coded with the value 0, it indicates that the Access Units are
     consecutive in decoding order.  An AU-Index-delta value larger
     than 0 signals that interleaving is applied.

  CTS-flag: Indicates whether the CTS-delta field is present.  A value
     of 1 indicates that the field is present, a value of 0 indicates
     that it is not present.

     The CTS-flag field MUST be present in each AU-header if the length
     of the CTS-delta field is signaled to be larger than zero.  In
     that case, the CTS-flag field MUST have the value 0 in the first
     AU-header and MAY have the value 1 in all non-first AU-headers.
     The CTS-flag field SHOULD be 0 for any non-first fragment of an
     Access Unit.










van der Meer, et al.        Standards Track                    [Page 13]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  CTS-delta: Encodes the CTS by specifying the value of CTS as a 2's
     complement offset (delta) from the time stamp in the RTP header of
     this RTP packet.  The CTS MUST use the same clock rate as the time
     stamp in the RTP header.

  DTS-flag: Indicates whether the DTS-delta field is present.  A value
     of 1 indicates that DTS-delta is present, a value of 0 indicates
     that it is not present.

     The DTS-flag field MUST be present in each AU-header if the length
     of the DTS-delta field is signaled to be larger than zero.  The
     DTS-flag field MUST have the same value for all fragments of an
     Access Unit.

  DTS-delta: Specifies the value of the DTS as a 2's complement offset
     (delta) from the CTS.  The DTS MUST use the same clock rate as the
     time stamp in the RTP header.  The DTS-delta field MUST have the
     same value for all fragments of an Access Unit.

  RAP-flag: When set to 1, indicates that the associated Access Unit
     provides a random access point to the content of the stream.  If
     an Access Unit is fragmented, the RAP flag, if present, MUST be
     set to 0 for each non-first fragment of the AU.

  Stream-state:  Specifies the state of the stream for an AU of an
     MPEG-4 system stream; each state is identified by a value of a
     modulo counter.  In ISO/IEC 14496-1, MPEG-4 system streams use the
     AU_SequenceNumber to signal stream states.  When the stream state
     changes, the value of the stream-state MUST be incremented by one.

     Note: no relation is required between stream-states of different
     streams.

3.2.2.  The Auxiliary Section

  The Auxiliary Section consists of the auxiliary-data-size field
  followed by the auxiliary-data field.  Receivers MAY (but are not
  required to) parse the auxiliary-data field; to facilitate skipping
  of the auxiliary-data field by receivers, the auxiliary-data-size
  field indicates the length in bits of the auxiliary-data.  If the
  concatenation of the auxiliary-data-size and the auxiliary-data
  fields consume a non-integer number of octets, up to 7 zero padding
  bits MUST be inserted immediately after the auxiliary data in order
  to achieve octet-alignment.  See Figure 4.







van der Meer, et al.        Standards Track                    [Page 14]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+
     | auxiliary-data-size   | auxiliary-data       |padding bits |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+

          Figure 4: The fields in the Auxiliary Section

  The length in bits of the auxiliary-data-size field is configurable
  by a MIME format parameter; see section 4.1.  The default length of
  zero indicates that the entire Auxiliary Section is absent.

  auxiliary-data-size: specifies the length in bits of the immediately
     following auxiliary-data field;

  auxiliary-data: the auxiliary-data field contains data of a format
     not defined by this specification.

3.2.3.  The Access Unit Data Section

  The Access Unit Data Section contains an integer number of complete
  Access Units or a single fragment of one AU.  The Access Unit Data
  Section is never empty.  If data of more than one Access Unit is
  present, then the AUs are concatenated into a contiguous string of
  octets.  See Figure 5.  The AUs inside the Access Unit Data Section
  MUST be in decoding order, though not necessarily contiguous in the
  case of interleaving.

  The size and number of Access Units SHOULD be adjusted such that the
  resulting RTP packet is not larger than the path MTU.  To handle
  larger packets, this payload format relies on lower layers for
  fragmentation, which may result in reduced performance.

     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |AU(1)                                                          |
     +                                                               |
     |                                                               |
     |               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |               |AU(2)                                          |
     +-+-+-+-+-+-+-+-+                                               |
     |                                                               |
     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               | AU(n)                         |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |AU(n) continued|
     |-+-+-+-+-+-+-+-+

       Figure 5: Access Unit Data Section; each AU is octet-aligned.





van der Meer, et al.        Standards Track                    [Page 15]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  When multiple Access Units are carried, the size of each AU MUST be
  made available to the receiver.  If the AU size is variable, then the
  size of each AU MUST be indicated in the AU-size field of the
  corresponding AU-header.  However, if the AU size is constant for a
  stream, this mechanism SHOULD NOT be used; instead, the fixed size
  SHOULD be signaled by the MIME format parameter "constantSize"; see
  section 4.1.

  The absence of both AU-size in the AU-header and the constantSize
  MIME format parameter indicates the carriage of a single AU
  (fragment), i.e., that a single Access Unit (fragment) is transported
  in each RTP packet for that stream.

3.2.3.1.  Fragmentation

  A packet SHALL carry either one or more complete Access Units, or a
  single fragment of an Access Unit.  Fragments of the same Access Unit
  have the same time stamp but different RTP sequence numbers.  The
  marker bit in the RTP header is 1 on the last fragment of an Access
  Unit, and 0 on all other fragments.

3.2.3.2.  Interleaving

  Unless prohibited by the signaled mode, a sender MAY interleave
  Access Units.  Receivers that are capable of receiving modes that
  support interleaving MUST be able to decode interleaved Access Units.

  When a sender interleaves Access Units, it needs to provide
  sufficient information to enable a receiver to unambiguously
  reconstruct the original order, even in the case of out-of-order
  packets, packet loss or duplication.  The information that senders
  need to provide depends on whether or not the Access Units have a
  constant time duration.  Access Units have a constant time duration,
  if:

  TS(i+1) - TS(i) = constant

      for any i, where:
         i indicates the index of the AU in the original order, and
         TS(i) denotes the time stamp of AU(i)

  The MIME parameter "constantDuration" SHOULD be used to signal that
  Access Units have a constant time duration; see section 4.1.

  If the "constantDuration" parameter is present, the receiver can
  reconstruct the original Access Unit timing based solely on the RTP
  timestamp and AU-Index-delta.  Accordingly, when transmitting Access
  Units of constant duration, the AU-Index, if present, MUST be set to



van der Meer, et al.        Standards Track                    [Page 16]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  the value 0.  Receivers of constant duration Access Units MUST use
  the RTP timestamp to determine the index of the first AU in the RTP
  packet.  The AU-Index-delta header and the signaled
  "constantDuration" are used to reconstruct AU timing.

  If the "constantDuration" parameter is not present, then senders MAY
  signal AUs of constant duration by coding the AU-Index with zero in
  each RTP packet.  In the absence of the constantDuration parameter
  receivers MUST conclude that the AUs have constant duration if the
  AU-index is zero in two consecutive RTP packets.

  When transmitting Access Units of variable duration, then the
  "constantDuration" parameter MUST NOT be present, and the transmitter
  MUST use the AU-Index to encode the index information required for
  re-ordering, and the receiver MUST use that value to determine the
  index of each AU in the RTP packet.  The number of bits of the AU-
  Index field MUST be chosen so that valid index information is
  provided at the applied interleaving scheme, without causing problems
  due to roll-over of the AU-Index field.  In addition, the CTS-delta
  MUST be coded in the AU header for each non-first AU in the RTP
  packet, so that receivers can place the AUs correctly in time.

  When interleaving is applied, a de-interleave buffer is needed in
  receivers to put the Access Units in their correct logical
  consecutive decoding order.  This requires the computation of the
  time stamp for each Access Unit.  In case of a constant time duration
  per Access Unit, the time stamp of the i-th access unit in an RTP
  packet with RTP time stamp T is calculated as follows:

  Timestamp[0] = T
  Timestamp[i, i > 0] = T +(Sum(for k=1 to i of (AU-Index-delta[k]
                        + 1))) * access-unit-duration

  When AU-Index-delta is always 0, this reduces to T + i * (access-
  unit-duration).  This is the non-interleaved case, where the frames
  are consecutive in decoding order.  Note that the AU-Index field
  (present for the first Access Unit) is indeed not needed in this
  calculation.

3.2.3.3.  Constraints for Interleaving

  The size of the packets should be suitably chosen to be appropriate
  to both the path MTU and the capacity of the receiver's de-interleave
  buffer.  The maximum packet size for a session SHOULD be chosen to
  not exceed the path MTU.






van der Meer, et al.        Standards Track                    [Page 17]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  To allow receivers to allocate sufficient resources for de-
  interleaving, senders MUST provide the information to receivers as
  specified in this section.

  AUs enter the decoder in decoding order.  The de-interleave buffer is
  used to re-order a stream of interleaved AUs back into decoding
  order.  When interleaving is applied, the decoding of "early" AUs has
  to be postponed until all AUs that precede it in decoding order are
  present.  Therefore, these "early" AUs are stored in the de-
  interleave buffer.  As an example in Figure 6, the interleaving
  pattern from section 2.5 is considered.

                            +--+--+--+--+--+--+--+--+--+--+--+-
  Interleaved AUs           | 0| 3| 6| 1| 4| 7| 2| 5| 8| 9|12|..
                            +--+--+--+--+--+--+--+--+--+--+--+-
  Storage of "early" AUs         3  3  3  3  3  3
                                    6  6  6  6  6  6
                                          4  4  4
                                             7  7  7
                                                           12 12

  Figure 6: Storage of "early" AUs in the de-interleave buffer per
            interleaved AU.

  AU(3) is to be delivered to the decoder after AU(0), AU(1) and AU(2);
  of these AUs, AU(2) arrives from the network last and hence AU(3)
  needs to be stored until AU(2) is present in the pattern.  Similarly,
  AU(6) is to be stored until AU(5) is present, while AU(4) and AU(7)
  are to be stored until AU(2) and AU(5) are present, respectively.
  Note that the fullness of the de-interleave buffer varies in time.
  In Figure 6, the de-interleave buffer contains at most 4, but often
  less AUs.

  So as to give a rough indication of the resources needed in the
  receiver for de-interleaving, the maximum displacement in time of an
  AU is defined.  For any AU(j) in the pattern, each AU(i) with i<j
  that is not yet present can be determined.  The maximum displacement
  in time of an AU is the maximum difference between the time stamp of
  an AU in the pattern and the time stamp of the earliest AU that is
  not yet present.  In other words, when considering a sequence of
  interleaved AUs, then:

  Maximum displacement = max{TS(i) - TS(j)}

      for any i and any j>i, where:
         i and j indicate the index of the AU in the interleaving
               pattern, and
         TS denotes the time stamp of the AU.



van der Meer, et al.        Standards Track                    [Page 18]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  As an example in Figure 7, the interleaving pattern from section 2.5
  is considered.  For each AU in the pattern, the index is given of the
  earliest of any earlier AUs not yet present.  Hence for each AU(n) in
  the interleaving pattern the smallest index k (with k<n) of not yet
  delivered AUs is indicated.  A "-" indicates that all previous AUs
  are present.  If the AU period is constant, the maximum displacement
  equals 5 AU periods, as found for AU(6) and AU(7).

                                +--+--+--+--+--+--+--+--+--+--+--+-
  Interleaved AUs               | 0| 3| 6| 1| 4| 7| 2| 5| 8| 9|12|..
                                +--+--+--+--+--+--+--+--+--+--+--+-

  Earliest not yet present AU     -  1  1  -  2  2  -  -  -  - 10

  Figure 7: For each AU in the interleaving pattern, the earliest of
            any earlier AUs not yet present

  When interleaving, senders MUST signal the maximum displacement in
  time during the session via the MIME format parameter
  "maxDisplacement"; see section 4.1.

  An estimate of the size of the de-interleave buffer is found by
  multiplying the maximum displacement by the maximum bit rate:

  size(de-interleave buffer) = {(maxDisplacement) * Rate(max)} / (RTP
                               clock frequency),

      where:
         Rate(max) is the maximum bit-rate of the transported stream.

  Note that receivers can derive Rate(max) from the MIME format
  parameters streamType, profile-level-id, and config.

  However, this calculation estimates the size of the de-interleave
  buffer and the required size may differ from the calculated value.
  If this calculation under-estimates the size of the
  de-interleave buffer, then senders, when interleaving, MUST signal a
  size of the de-interleave buffer via the MIME format parameter
  "de-interleaveBufferSize"; see section 4.1.  If the calculation
  over-estimates the size of the de-interleave buffer, then senders,
  when interleaving, MAY signal a size of the de-interleave buffer via
  the MIME format parameter "de-interleaveBufferSize".









van der Meer, et al.        Standards Track                    [Page 19]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  The signaled size of the de-interleave buffer MUST be large enough to
  contain all "early" AUs at any point in time during the session.
  That is:

  minimum de-interleave buffer size = max [sum {if TS(i) > TS(j) then
                                      AU-size(i) else 0}]

      for any j and any i<j, where:
         i and j indicate the index of an AU in the interleaving
               pattern,
         TS(i) denotes the time stamp of AU(i), and
         AU-size(i) denotes the size of AU(i) in number of octets.

  If the "de-interleaveBufferSize" parameter is present, then the
  applied buffer for de-interleaving in a receiver MUST have a size
  that is at least equal to the signaled size of the de-interleave
  buffer, else a size that is at least equal to the calculated size of
  the de-interleave buffer.

  No matter what interleaving scheme is used, the scheme must be
  analyzed to calculate the applicable maxDisplacement value, as well
  as the required size of the de-interleave buffer.  Senders SHOULD
  signal values that are not larger than the strictly required values;
  if larger values are signaled, the receiver will buffer excessively.

  Note that for low bit-rate material, the applied interleaving may
  make packets shorter than the MTU size.

3.2.3.4.  Crucial and Non-Crucial AUs with MPEG-4 System Data

  Some Access Units with MPEG-4 system data, called "crucial" AUs,
  carry information whose loss cannot be tolerated, either in the
  presentation or in the decoder.  At each crucial AU in an MPEG-4
  system stream, the stream state changes.  The stream-state MAY remain
  constant at non-crucial AUs.  In ISO/IEC 14496-1, MPEG-4 system
  streams use the AU_SequenceNumber to signal stream states.

  Example: Given three AUs, AU1 = "Insertion of node X", AU2 = "Set
  position of node X", AU3 = "Set position of node X".  AU1 is crucial,
  since if it is lost, AU2 cannot be executed.  However, AU2 is not
  crucial, since AU3 can be executed even if AU2 is lost.

  When a crucial AU is (possibly) lost, the stream is corrupted.  For
  example, when an AU is lost and the stream state has changed at the
  next received AU, then it is possible that the lost AU was crucial.
  Once corrupted, the stream remains corrupted until the next random
  access point.  Note that loss of non-crucial AUs does not corrupt the
  stream.  When a decoder starts receiving a stream, the decoder MUST



van der Meer, et al.        Standards Track                    [Page 20]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  consider the stream corrupted until an AU is received that provides a
  random access point.

  An AU that provides a random access point, as signaled by the RAP-
  flag, may or may not be crucial.  Non-crucial RAP AUs provide a
  "repeated" random access point for use by decoders that recently
  joined the stream or that need to re-start decoding after a stream
  corruption.  Non-crucial RAP AUs MUST include all updates since the
  last crucial RAP AU.

  Upon receiving AUs, decoders are to react as follows:

  a) if the RAP-flag is set to 1 and the stream-state changes, then the
     AU is a crucial RAP AU, and the AU MUST be decoded.

  b) if the RAP-flag is set to 1 and the stream state does not change,
     then the AU is a non-crucial RAP AU, and the receiver SHOULD
     decode it if the stream is corrupted.  Otherwise, the decoder MUST
     ignore the AU.

  c) if the RAP-flag is set to 0, then the AU MUST be decoded, unless
     the stream is corrupted, in which case the AU MUST be ignored.

3.3.  Usage of this Specification

3.3.1.  General

  Usage of this specification requires definition of a mode.  A mode
  defines how to use this specification, as deemed appropriate.
  Senders MUST signal the applied mode via the MIME format parameter
  "mode", as specified in section 4.1.  This specification defines a
  generic mode that can be used for any MPEG-4 stream, as well as
  specific modes for the transportation of MPEG-4 CELP and MPEG-4 AAC
  streams, defined in ISO/IEC 14496-3 [1].

  When use of this payload format is signaled using SDP [5], an
  "rtpmap" attribute is part of that signaling.  The same requirements
  apply for the rtpmap attribute in any mode compliant to this
  specification.  The general form of an rtpmap attribute is:

  a=rtpmap:<payload type> <encoding name>/<clock rate>[/<encoding
            parameters>]

  For audio streams, <encoding parameters> specifies the number of
  audio channels: 2 for stereo material (see RFC 2327 [5]) and 1 for
  mono.  Provided no additional parameters are needed, this parameter
  may be omitted for mono material, hence its default value is 1.




van der Meer, et al.        Standards Track                    [Page 21]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


3.3.2.  The Generic Mode

  The generic mode can be used for any MPEG-4 stream.  In this mode, no
  mode-specific constraints are applied; hence, in the generic mode,
  the full flexibility of this specification can be exploited.  The
  generic mode is signaled by mode=generic.

  An example is given below for the transportation of a BIFS-Anim
  stream.  In this example carriage of multiple BIFS-Anim Access Units
  is allowed in one RTP packet.  The AU-header contains the AU-size
  field, the CTS-flag and, if the CTS flag is set to 1, the CTS-delta
  field.  The number of bits of the AU-size and the CTS-delta fields
  are 10 and 16, respectively.  The AU-header also contains the RAP-
  flag and the Stream-state of 4 bits.  This results in an AU-header
  with a total size of two or four octets per BIFS-Anim AU.  The RTP
  time stamp uses a 1 kHz clock.  Note that the media type name is
  video, because the BIFS-Anim stream is part of an audio-visual
  presentation.  For conventions on media type names, see section 4.1.

  In detail:

  m=video 49230 RTP/AVP 96
  a=rtpmap:96 mpeg4-generic/1000
  a=fmtp:96 streamtype=3; profile-level-id=1807; mode=generic;
  objectType=2; config=0842237F24001FB400094002C0; sizeLength=10;
  CTSDeltaLength=16; randomAccessIndication=1;
  streamStateIndication=4

  Note: The a=fmtp line has been wrapped to fit the page, it comprises
  a single line in the SDP file.

  The hexadecimal value of the "config" parameter is the
  BIFSConfiguration() as defined in ISO/IEC 14496-1.  The
  BIFSConfiguration() specifies that the BIFS stream is a BIFS-Anim
  stream.  For the description of MIME parameters, see section 4.1.

3.3.3.  Constant Bit-rate CELP

  This mode is signaled by mode=CELP-cbr.  In this mode, one or more
  complete CELP frames of fixed size can be transported in one RTP
  packet; interleaving MUST NOT be used with this mode.  The RTP
  payload consists of one or more concatenated CELP frames, each of
  equal size.  CELP frames MUST NOT be fragmented when using this mode.
  Both the AU Header Section and the Auxiliary Section MUST be empty.

  The MIME format parameter constantSize MUST be provided to specify
  the length of each CELP frame.




van der Meer, et al.        Standards Track                    [Page 22]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  For example:

  m=audio 49230 RTP/AVP 96
  a=rtpmap:96 mpeg4-generic/16000/1
  a=fmtp:96 streamtype=5; profile-level-id=14; mode=CELP-cbr; config=
  440E00; constantSize=27; constantDuration=240

  Note: The a=fmtp line has been wrapped to fit the page, it comprises
  a single line in the SDP file.

  The hexadecimal value of the "config" parameter is the
  AudioSpecificConfig()as defined in ISO/IEC 14496-3.
  AudioSpecificConfig() specifies a mono CELP stream with a sampling
  rate of 16 kHz at a fixed bitrate of 14.4 kb/s and 6 sub-frames per
  CELP frame.  For the description of MIME parameters, see section 4.1.

3.3.4.  Variable Bit-rate CELP

  This mode is signaled by mode=CELP-vbr.  With this mode, one or more
  complete CELP frames of variable size can be transported in one RTP
  packet with OPTIONAL interleaving.  In this mode, the largest
  possible value for AU-size is greater than the maximum CELP frame
  size. Because CELP frames are very small, there is no support for
  fragmentation of CELP frames.  Hence, CELP frames MUST NOT be
  fragmented when using this mode.

  In this mode, the RTP payload consists of the AU Header Section,
  followed by one or more concatenated CELP frames.  The Auxiliary
  Section MUST be empty.  For each CELP frame contained in the payload,
  there MUST be a one octet AU-header in the AU Header Section to
  provide:

  a) the size of each CELP frame in the payload and

  b) index information for computing the sequence (and hence timing) of
     each CELP frame.

  Transport of CELP frames requires that the AU-size field be coded
  with 6 bits.  Therefore, in this mode 6 bits are allocated to the
  AU-size field, and 2 bits to the AU-Index(-delta) field.  Each AU-
  Index field MUST be coded with the value 0.  In the AU Header
  Section, the concatenated AU-headers are preceded by the 16-bit AU-
  headers-length field, as specified in section 3.2.1.

  In addition to the required MIME format parameters, the following
  parameters MUST be present: sizeLength, indexLength, and
  indexDeltaLength.  CELP frames always have a fixed duration per
  Access Unit; when interleaving in this mode, this specific duration



van der Meer, et al.        Standards Track                    [Page 23]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  MUST be signaled by the MIME format parameter constantDuration.  In
  addition, the parameter maxDisplacement MUST be present when
  interleaving.

  For example:

  m=audio 49230 RTP/AVP 96
  a=rtpmap:96 mpeg4-generic/16000/1
  a=fmtp:96 streamtype=5; profile-level-id=14; mode=CELP-vbr; config=
  440F20; sizeLength=6; indexLength=2; indexDeltaLength=2;
  constantDuration=160; maxDisplacement=5

  Note: The a=fmtp line has been wrapped to fit the page; it comprises
  a single line in the SDP file.

  The hexadecimal value of the "config" parameter is the
  AudioSpecificConfig() as defined in ISO/IEC 14496-3.
  AudioSpecificConfig() specifies a mono CELP stream with a sampling
  rate of 16 kHz, at a bitrate that varies between 13.9 and 16.2 kb/s
  and with 4 sub-frames per CELP frame.  For the description of MIME
  parameters, see section 4.1.

3.3.5.  Low Bit-rate AAC

  This mode is signaled by mode=AAC-lbr.  This mode supports the
  transportation of one or more complete AAC frames of variable size.
  In this mode, the AAC frames are allowed to be interleaved and hence
  receivers MUST support de-interleaving.  The maximum size of an AAC
  frame in this mode is 63 octets.  AAC frames MUST NOT be fragmented
  when using this mode.  Hence, when using this mode, encoders MUST
  ensure that the size of each AAC frame is at most 63 octets.

  The payload configuration in this mode is the same as in the variable
  bit-rate CELP mode as defined in 3.3.4.  The RTP payload consists of
  the AU Header Section, followed by concatenated AAC frames.  The
  Auxiliary Section MUST be empty.  For each AAC frame contained in the
  payload, the one octet AU-header MUST provide:

  a) the size of each AAC frame in the payload and

  b) index information for computing the sequence (and hence timing) of
     each AAC frame.

  In the AU-header Section, the concatenated AU-headers MUST be
  preceded by the 16-bit AU-headers-length field, as specified in
  section 3.2.1.





van der Meer, et al.        Standards Track                    [Page 24]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  In addition to the required MIME format parameters, the following
  parameters MUST be present: sizeLength, indexLength, and
  indexDeltaLength.  AAC frames always have a fixed duration per Access
  Unit; when interleaving in this mode, this specific duration MUST be
  signaled by the MIME format parameter constantDuration.  In addition,
  the parameter maxDisplacement MUST be present when interleaving.

  For example:

  m=audio 49230 RTP/AVP 96
  a=rtpmap:96 mpeg4-generic/22050/1
  a=fmtp:96 streamtype=5; profile-level-id=14; mode=AAC-lbr; config=
  1388; sizeLength=6; indexLength=2; indexDeltaLength=2;
  constantDuration=1024; maxDisplacement=5

  Note: The a=fmtp line has been wrapped to fit the page; it comprises
  a single line in the SDP file.

  The hexadecimal value of the "config" parameter is the
  AudioSpecificConfig(), as defined in ISO/IEC 14496-3.
  AudioSpecificConfig() specifies a mono AAC stream with a sampling
  rate of 22.05 kHz.  For the description of MIME parameters, see
  section 4.1.

3.3.6.  High Bit-rate AAC

  This mode is signaled by mode=AAC-hbr.  This mode supports the
  transportation of variable size AAC frames.  In one RTP packet,
  either one or more complete AAC frames are carried, or a single
  fragment of an AAC frame is carried.  In this mode, the AAC frames
  are allowed to be interleaved and hence receivers MUST support de-
  interleaving.  The maximum size of an AAC frame in this mode is 8191
  octets.

  In this mode, the RTP payload consists of the AU Header Section,
  followed by either one AAC frame, several concatenated AAC frames or
  one fragmented AAC frame.  The Auxiliary Section MUST be empty.  For
  each AAC frame contained in the payload, there MUST be an AU-header
  in the AU Header Section to provide:

  a) the size of each AAC frame in the payload and

  b) index information for computing the sequence (and hence timing) of
     each AAC frame.

  To code the maximum size of an AAC frame requires 13 bits.
  Therefore, in this configuration 13 bits are allocated to the AU-
  size, and 3 bits to the AU-Index(-delta) field.  Thus, each AU-header



van der Meer, et al.        Standards Track                    [Page 25]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  has a size of 2 octets.  Each AU-Index field MUST be coded with the
  value 0.  In the AU Header Section, the concatenated AU-headers MUST
  be preceded by the 16-bit AU-headers-length field, as specified in
  section 3.2.1.

  In addition to the required MIME format parameters, the following
  parameters MUST be present: sizeLength, indexLength, and
  indexDeltaLength.  AAC frames always have a fixed duration per Access
  Unit; when interleaving in this mode, this specific duration MUST be
  signaled by the MIME format parameter constantDuration.  In addition,
  the parameter maxDisplacement MUST be present when interleaving.

  For example:

  m=audio 49230 RTP/AVP 96
  a=rtpmap:96 mpeg4-generic/48000/6
  a=fmtp:96 streamtype=5; profile-level-id=16; mode=AAC-hbr;
  config=11B0; sizeLength=13; indexLength=3;
  indexDeltaLength=3; constantDuration=1024

  Note: The a=fmtp line has been wrapped to fit the page; it comprises
  a single line in the SDP file.

  The hexadecimal value of the "config" parameter is the
  AudioSpecificConfig(), as defined in ISO/IEC 14496-3.
  AudioSpecificConfig() specifies a 5.1 channel AAC stream with a
  sampling rate of 48 kHz.  For the description of MIME parameters, see
  section 4.1.

3.3.7.  Additional Modes

  This specification only defines the modes specified in sections 3.3.2
  through 3.3.6.  Additional modes are expected to be defined in future
  RFCs.  Each additional mode MUST be in full compliance with this
  specification.

  Any new mode MUST be defined such that an implementation including
  all the features of this specification can decode the payload format
  corresponding to this new mode.  For this reason, a mode MUST NOT
  specify new default values for MIME parameters.  In particular, MIME
  parameters that configure the RTP payload MUST be present (unless
  they have the default value), even if its presence is redundant in
  case the mode assigns a fixed value to a parameter.  A mode may
  additionally define that some MIME parameters are required instead of
  optional, that some MIME parameters have fixed values (or ranges),
  and that there are rules restricting its usage.





van der Meer, et al.        Standards Track                    [Page 26]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


4.  IANA Considerations

  This section describes the MIME types and names associated with this
  payload format.  Section 4.1 registers the MIME types, as per RFC
  2048 [3].

  This format may require additional information about the mapping to
  be made available to the receiver.  This is done using parameters
  described in the next section.

4.1.  MIME Type Registration

  MIME media type name: "video" or "audio" or "application"

  "video" MUST be used for MPEG-4 Visual streams (ISO/IEC 14496-2) or
  MPEG-4 Systems streams (ISO/IEC 14496-1) that convey information
  needed for an audio/visual presentation.

  "audio" MUST be used for MPEG-4 Audio streams (ISO/IEC 14496-3) or
  MPEG-4 Systems streams that convey information needed for an audio
  only presentation.

  "application" MUST be used for MPEG-4 Systems streams (ISO/IEC
  14496-1) that serve purposes other than audio/visual presentation,
  e.g., in some cases when MPEG-J (Java) streams are transmitted.

  Depending on the required payload configuration, MIME format
  parameters may need to be available to the receiver.  This is done
  using the parameters described in the next section.  There are
  required and optional parameters.

  Optional parameters are of two types: general parameters and
  configuration parameters.  The configuration parameters are used to
  configure the fields in the AU Header section and in the auxiliary
  section.  The absence of any configuration parameter is equivalent to
  the associated field set to its default value, which is always zero.
  The absence of all configuration parameters results in a default
  "basic" configuration with an empty AU-header section and an empty
  auxiliary section in each RTP packet.

  MIME subtype name: mpeg4-generic

  Required parameters:

  MIME format parameters are not case dependent; for clarity however,
  both upper and lower case are used in the names of the parameters
  described in this specification.




van der Meer, et al.        Standards Track                    [Page 27]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


     streamType:
     The integer value that indicates the type of MPEG-4 stream that is
     carried; its coding corresponds to the values of the streamType,
     as defined in Table 9 (streamType Values) in ISO/IEC 14496-1.

     profile-level-id:
     A decimal representation of the MPEG-4 Profile Level indication.
     This parameter MUST be used in the capability exchange or session
     set-up procedure to indicate the MPEG-4 Profile and Level
     combination of which the relevant MPEG-4 media codec is capable.

     For MPEG-4 Audio streams, this parameter is the decimal value from
        Table 5 (audioProfileLevelIndication Values) in ISO/IEC 14496-
        1, indicating which MPEG-4 Audio tool subsets are required to
        decode the audio stream.

     For MPEG-4 Visual streams, this parameter is the decimal value
        from Table G-1 (FLC table for profile and level indication) of
        ISO/IEC 14496-2 [1], indicating which MPEG-4 Visual tool
        subsets are required to decode the visual stream.

     For BIFS streams, this parameter is the decimal value obtained
        from (SPLI + 256*GPLI), where:
        SPLI is the decimal value from Table 4 in ISO/IEC 14496-1 with
                 the applied sceneProfileLevelIndication;
        GPLI is the decimal value from Table 7 in ISO/IEC 14496-1 with
           the applied graphicsProfileLevelIndication.

     For MPEG-J streams, this parameter is the decimal value from table
        13 (MPEGJProfileLevelIndication) in ISO/IEC 14496-1, indicating
        the profile and level of the MPEG-J stream.

     For OD streams, this parameter is the decimal value from table 3
        (ODProfileLevelIndication) in ISO/IEC 14496-1, indicating the
        profile and level of the OD stream.

     For IPMP streams, this parameter has either the decimal value 0,
        indicating an unspecified profile and level, or a value larger
        than zero, indicating an MPEG-4 IPMP profile and level as
        defined in a future MPEG-4 specification.

     For Clock Reference streams and Object Content Info streams, this
        parameter has the decimal value zero, indicating that profile
        and level information is conveyed through the OD framework.







van der Meer, et al.        Standards Track                    [Page 28]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


     config:
     A hexadecimal representation of an octet string that expresses the
     media payload configuration.  Configuration data is mapped onto
     the hexadecimal octet string in an MSB-first basis.  The first bit
     of the configuration data SHALL be located at the MSB of the first
     octet.  In the last octet, if necessary to achieve octet-
     alignment, up to 7 zero-valued padding bits shall follow the
     configuration data.

     For MPEG-4 Audio streams, config is the audio object type specific
        decoder configuration data AudioSpecificConfig(), as defined in
        ISO/IEC 14496-3.  For Structured Audio, the
        AudioSpecificConfig() may be conveyed by other means, not
        defined by this specification.  If the AudioSpecificConfig() is
        conveyed by other means for Structured Audio, then the config
        MUST be a quoted empty hexadecimal octet string, as follows:
        config="".

        Note that a future mode of using this RTP payload format for
        Structured Audio may define such other means.

     For MPEG-4 Visual streams, config is the MPEG-4 Visual
        configuration information as defined in subclause 6.2.1, Start
        codes of ISO/IEC 14496-2.  The configuration information
        indicated by this parameter SHALL be the same as the
        configuration information in the corresponding MPEG-4 Visual
        stream, except for first-half-vbv-occupancy and latter-half-
        vbv-occupancy, if it exists, which may vary in the repeated
        configuration information inside an MPEG-4 Visual stream (See
        6.2.1 Start codes of ISO/IEC 14496-2).

     For BIFS streams, this is the BIFSConfig() information as defined
        in ISO/IEC 14496-1.  Version 1 of BIFSConfig is defined in
        section 9.3.5.2, and version 2 is defined in section 9.3.5.3.
        The MIME format parameter objectType signals the version of
        BIFSConfig.

     For IPMP streams, this is either a quoted empty hexadecimal octet
        string, indicating the absence of any decoder configuration
        information (config=""), or the IPMPConfiguration() as will be
        defined in a future MPEG-4 IPMP specification.

     For Object Content Info (OCI) streams, this is the
        OCIDecoderConfiguration() information of the OCI stream, as
        defined in section 8.4.2.4 in ISO/IEC 14496-1.






van der Meer, et al.        Standards Track                    [Page 29]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


     For OD streams, Clock Reference streams and MPEG-J streams, this
        is a quoted empty hexadecimal octet string (config=""), as no
        information on the decoder configuration is required.

     mode:
     The mode in which this specification is used.  The following modes
     can be signaled:

     mode=generic,
     mode=CELP-cbr,
     mode=CELP-vbr,
     mode=AAC-lbr and
     mode=AAC-hbr.

     Other modes are expected to be defined in future RFCs.  See also
     section 3.3.7 and 4.2 of RFC 3640.

  Optional general parameters:

     objectType:
     The decimal value from Table 8 in ISO/IEC 14496-1, indicating the
     value of the objectTypeIndication of the transported stream.  For
     BIFS streams, this parameter MUST be present to signal the version
     of BIFSConfiguration().  Note that objectTypeIndication may signal
     a non-MPEG-4 stream and that the RTP payload format defined in
     this document may not be suitable for carrying a stream that is
     not defined by MPEG-4.  The objectType parameter SHOULD NOT be set
     to a value that signals a stream that cannot be carried by this
     payload format.

     constantSize:
     The constant size in octets of each Access Unit for this stream.
     The constantSize and the sizeLength parameters MUST NOT be
     simultaneously present.

     constantDuration:
     The constant duration of each Access Unit for this stream,
     measured with the same units as the RTP time stamp.

     maxDisplacement:
     The decimal representation of the maximum displacement in time of
     an interleaved AU, as defined in section 3.2.3.3, expressed in
     units of the RTP time stamp clock.

     This parameter MUST be present when interleaving is applied.






van der Meer, et al.        Standards Track                    [Page 30]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


     de-interleaveBufferSize:
     The decimal representation in number of octets of the size of the
     de-interleave buffer, described in section 3.2.3.3.  When
     interleaving, this parameter MUST be present if the calculation of
     the de-interleave buffer size given in 3.2.3.3 and based on
     maxDisplacement and rate(max) under-estimates the size of the
     de-interleave buffer.  If this calculation does not under-estimate
     the size of the de-interleave buffer, then the
     de-interleaveBufferSize parameter SHOULD NOT be present.

  Optional configuration parameters:

     sizeLength:
     The number of bits on which the AU-size field is encoded in the
     AU-header.  The sizeLength and the constantSize parameters MUST
     NOT be simultaneously present.

     indexLength:
     The number of bits on which the AU-Index is encoded in the first
     AU-header.  The default value of zero indicates the absence of the
     AU-Index field in each first AU-header.

     indexDeltaLength:
     The number of bits on which the AU-Index-delta field is encoded in
     any non-first AU-header.  The default value of zero indicates the
     absence of the AU-Index-delta field in each non-first AU-header.

     CTSDeltaLength:
     The number of bits on which the CTS-delta field is encoded in the
     AU-header.

     DTSDeltaLength:
     The number of bits on which the DTS-delta field is encoded in the
     AU-header.

     randomAccessIndication:
     A decimal value of zero or one, indicating whether the RAP-flag is
     present in the AU-header.  The decimal value of one indicates
     presence of the RAP-flag, the default value zero indicates its
     absence.

     streamStateIndication:
     The number of bits on which the Stream-state field is encoded in
     the AU-header.  This parameter MAY be present when transporting
     MPEG-4 system streams, and SHALL NOT be present for MPEG-4 audio
     and MPEG-4 video streams.





van der Meer, et al.        Standards Track                    [Page 31]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


     auxiliaryDataSizeLength:
     The number of bits that is used to encode the auxiliary-data-size
     field.

  Applications MAY use more parameters, in addition to those defined
  above.  Each additional parameter MUST be registered with IANA to
  ensure that there is not a clash of names.  Each additional parameter
  MUST be accompanied by a specification in the form of an RFC, MPEG
  standard, or other permanent and readily available reference (the
  "Specification Required" policy defined in RFC 2434 [6]).  Receivers
  MUST tolerate the presence of such additional parameters, but these
  parameters SHALL NOT impact the decoding of receivers that comply
  with this specification.

  Encoding considerations:
  This MIME subtype is defined for RTP transport only.  System
  bitstreams MUST be generated according to MPEG-4 Systems
  specifications (ISO/IEC 14496-1).  Video bitstreams MUST be generated
  according to MPEG-4 Visual specifications (ISO/IEC 14496-2).  Audio
  bitstreams MUST be generated according to MPEG-4 Audio specifications
  (ISO/IEC 14496-3).  The RTP packets MUST be packetized according to
  the RTP payload format defined in RFC 3640.

  Security considerations:
  As defined in section 5 of RFC 3640.

  Interoperability considerations:
  MPEG-4 provides a large and rich set of tools for the coding of
  visual objects.  For effective implementation of the standard,
  subsets of the MPEG-4 tool sets have been provided for use in
  specific applications.  These subsets, called 'Profiles', limit the
  size of the tool set a decoder is required to implement.  In order to
  restrict computational complexity, one or more 'Levels' are set for
  each Profile.  A Profile@Level combination allows:

      .  a codec builder to implement only the subset of the standard
         he needs, while maintaining interworking with other MPEG-4
         devices that implement the same combination, and

      .  checking whether MPEG-4 devices comply with the standard
         ('conformance testing').

  A stream SHALL be compliant with the MPEG-4 Profile@Level specified
  by the parameter "profile-level-id".  Interoperability between a
  sender and a receiver is achieved by specifying the parameter
  "profile-level-id" in MIME content.  In the capability
  exchange/announcement procedure, this parameter may mutually be set
  to the same value.



van der Meer, et al.        Standards Track                    [Page 32]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  Published specification:
  The specifications for MPEG-4 streams are presented in ISO/IEC
  14496-1, 14496-2, and 14496-3.  The RTP payload format is described
  in RFC 3640.

  Applications which use this media type:
  Multimedia streaming and conferencing tools.

  Additional information: none

  Magic number(s): none

  File extension(s):
  None.  A file format with the extension .mp4 has been defined for
  MPEG-4 content but is not directly correlated with this MIME type for
  which the sole purpose is RTP transport.

  Macintosh File Type Code(s): none

  Person & email address to contact for further information:
  Authors of RFC 3640, IETF Audio/Video Transport working group.

  Intended usage: COMMON

  Author/Change controller:
  Authors of RFC 3640, IETF Audio/Video Transport working group.

4.2.  Registration of Mode Definitions with IANA

  This specification can be used in a number of modes.  The mode of
  operation is signaled using the "mode" MIME parameter, with the
  initial set of values specified in section 4.1.  New modes may be
  defined at any time, as described in section 3.3.7.  These modes MUST
  be registered with IANA, to ensure that there is not a clash of
  names.

  A new mode registration MUST be accompanied by a specification in the
  form of an RFC, MPEG standard, or other permanent and readily
  available reference (the "Specification Required" policy defined in
  RFC 2434 [6]).

4.3.  Concatenation of Parameters

  Multiple parameters SHOULD be expressed as a MIME media type string,
  in the form of a semicolon-separated list of parameter=value pairs
  (for parameter usage examples see sections 3.3.2 up to 3.3.6).





van der Meer, et al.        Standards Track                    [Page 33]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


4.4.  Usage of SDP

4.4.1.  The a=fmtp Keyword

  It is assumed that one typical way to transport the above-described
  parameters associated with this payload format is via an SDP message
  [5] for example transported to the client in reply to an RTSP
  DESCRIBE [8] or via SAP [11].  In that case, the (a=fmtp) keyword
  MUST be used as described in RFC 2327 [5], section 6, the syntax then
  being:

  a=fmtp:<format> <parameter name>=<value>[; <parameter name>=<value>]

5.  Security Considerations

  RTP packets using the payload format defined in this specification
  are subject to the security considerations discussed in the RTP
  specification [2].  This implies that confidentiality of the media
  streams is achieved by encryption.  Because the data compression used
  with this payload format is applied end-to-end, encryption may be
  performed on the compressed data so there is no conflict between the
  two operations.  The packet processing complexity of this payload
  type (i.e., excluding media data processing) does not exhibit any
  significant non-uniformity in the receiver side to cause a denial-
  of-service threat.

  However, it is possible to inject non-compliant MPEG streams (Audio,
  Video, and Systems) so that the receiver/decoder's buffers are
  overloaded, which might compromise the functionality of the receiver
  or even crash it.  This is especially true for end-to-end systems
  like MPEG, where the buffer models are precisely defined.

  MPEG-4 Systems support stream types including commands that are
  executed on the terminal, like OD commands, BIFS commands, etc. and
  programmatic content like MPEG-J (Java(TM) Byte Code) and MPEG-4
  scripts.  It is possible to use one or more of the above in a manner
  non-compliant to MPEG to crash the receiver or make it temporarily
  unavailable.  Senders that transport MPEG-4 content SHOULD ensure
  that such content is MPEG compliant, as defined in the compliance
  part of IEC/ISO 14496 [1].  Receivers that support MPEG-4 content
  should prevent malfunctioning of the receiver in case of non MPEG
  compliant content.

  Authentication mechanisms can be used to validate the sender and the
  data to prevent security problems due to non-compliant malignant
  MPEG-4 streams.





van der Meer, et al.        Standards Track                    [Page 34]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  In ISO/IEC 14496-1, a security model is defined for MPEG-4 Systems
  streams carrying MPEG-J access units that comprise Java(TM) classes
  and objects.  MPEG-J defines a set of Java APIs and a secure
  execution model.  MPEG-J content can call this set of APIs and
  Java(TM) methods from a set of Java packages supported in the
  receiver within the defined security model.  According to this
  security model, downloaded byte code is forbidden to load libraries,
  define native methods, start programs, read or write files, or read
  system properties. Receivers can implement intelligent filters to
  validate the buffer requirements or parametric (OD, BIFS, etc.) or
  programmatic (MPEG-J, MPEG-4 scripts) commands in the streams.
  However, this can increase the complexity significantly.

  Implementors of MPEG-4 streaming over RTP who also implement MPEG-4
  scripts (subset of ECMAScript) MUST ensure that the action of such
  scripts is limited solely to the domain of the single presentation in
  which they reside (thus disallowing session to session communication,
  access to local resources and storage, etc).  Though loading static
  network-located resources (such as media) into the presentation
  should be permitted, network access by scripts MUST be restricted to
  such a (media) download.

6.  Acknowledgements

  This document evolved into RFC 3640 after several revisions.  Thanks
  to contributions from people in the ISMA forum, the IETF AVT Working
  Group and the 4-on-IP ad-hoc group within MPEG.  The authors wish to
  thank all people involved, particularly Andrea Basso, Stephen Casner,
  M. Reha Civanlar, Carsten Herpel, John Lazaro, Zvi Lifshitz, Young-
  kwon Lim, Alex MacAulay, Bill May, Colin Perkins, Dorairaj V and
  Stephan Wenger for their valuable comments and support.




















van der Meer, et al.        Standards Track                    [Page 35]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


APPENDIX: Usage of this Payload Format

Appendix A.  Interleave Analysis

A.  Examples of Delay Analysis with Interleave

A.1.  Introduction

  Interleaving issues are discussed in this appendix.  Some general
  notes are provided on de-interleaving and error concealment, while a
  number of interleaving patterns are examined, in particular for
  determining the size of the de-interleave buffer and the maximum
  displacement of access units in time.  In these examples, the maximum
  displacement is cited in terms of an access unit count, for ease of
  reading.  In actual streams, it is signaled in units of the RTP time
  stamp clock.

A.2.  De-interleaving and Error Concealment

  This appendix does not describe any details on de-interleaving and
  error concealment, as the control of the AU decoding and error
  concealment process has little to do with interleaving.  If the next
  AU to be decoded is present and there is sufficient storage available
  for the decoded AU, then decode it immediately.  If not, wait.  When
  the decoding deadline is reached (i.e., the time when decoding must
  begin in order to be completed by the time the AU is to be
  presented), or if the decoder is some hardware that presents a
  constant delay between initiation of decoding of an AU and
  presentation of that AU, then decoding must begin at that deadline
  time.

  If the next AU to be decoded is not present when the decoding
  deadline is reached, then that AU is lost so the receiver must take
  whatever error concealment measures are deemed appropriate.  The
  play-out delay may need to be adjusted at that point (especially if
  other AUs have also missed their deadline recently).  Or, if it was a
  momentary delay, and maintaining the latency is important, then the
  receiver should minimize the glitch and continue processing with the
  next AU.

A.3.  Simple Group Interleave

A.3.1.  Introduction

  An example of regular interleave is when packets are formed into
  groups.  If the 'stride' of the interleave (the distance between
  interleaved AUs) is N, packet 0 could contain AU(0), AU(N), AU(2N),
  and so on; packet 1 could contain AU(1), AU(1+N), AU(1+2N), and so



van der Meer, et al.        Standards Track                    [Page 36]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  on.  If there are M access units in a packet, then there are M*N
  access units in the group.

  An example with N=M=3 follows; note that this is the same example as
  given in section 2.5 and that a fixed time duration per Access Unit
  is assumed:

  Packet   Time stamp   Carried AUs      AU-Index, AU-Index-delta
  P(0)     T[0]         0, 3, 6          0, 2, 2
  P(1)     T[1]         1, 4, 7          0, 2, 2
  P(2)     T[2]         2, 5, 8          0, 2, 2
  P(3)     T[9]         9,12,15          0, 2, 2

  In this example, the AU-Index is present in the first AU-header and
  coded with the value 0, as required for fixed duration AUs.  The
  position of the first AU of each packet within the group is defined
  by the RTP time stamp, while the AU-Index-delta field indicates the
  position of subsequent AUs relative to the first AU in the packet.
  All AU-Index-delta fields are coded with the value N-1, equal to 2 in
  this example.  Hence the RTP time stamp and the AU-Index-delta are
  used to reconstruct the original order.  See also section 3.2.3.2.

A.3.2.  Determining the De-interleave Buffer Size

  For the regular pattern as in this example, Figure 6 in section
  3.2.3.3 shows that the de-interleave buffer stores at most 4 AUs.  A
  de-interleaveBufferSize value that is at least equal to the total
  number of octets of any 4 "early" AUs that are stored at the same
  time may be signaled.

A.3.3.  Determining the Maximum Displacement

  For the regular pattern as in this example, Figure 7 in section 3.3
  shows that the maximum displacement in time equals 5 AU periods.
  Hence, the minimum maxDisplacement value that must be signaled is 5
  AU periods.  In case each AU has the same size, this maxDisplacement
  value over-estimates the de-interleave buffer size with one AU.
  However, note that in case of variable AU sizes, the total size of
  any 4 "early" AUs that must be stored at the same time may exceed
  maxDisplacement times the maximum bitrate, in which case the de-
  interleaveBufferSize must be signaled.










van der Meer, et al.        Standards Track                    [Page 37]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


A.4.  More Subtle Group Interleave

A.4.1.  Introduction

  Another example of forming packets with group interleave is given
  below.  In this example, the packets are formed such that the loss of
  two subsequent RTP packets does not cause the loss of two subsequent
  AUs.  Note that in this example, the RTP time stamps of packet 3 and
  packet 4 are earlier than the RTP time stamps of packets 1 and 2,
  respectively; a fixed time duration per Access Unit is assumed.

  Packet   Time stamp   Carried AUs      AU-Index, AU-Index-delta
  0        T[0]         0,  5            0, 4
  1        T[2]         2,  7            0, 4
  2        T[4]         4,  9            0, 4
  3        T[1]         1,  6            0, 4
  4        T[3]         3,  8            0, 4
  5        T[10]       10, 15            0, 4
  and so on ..

  In this example, the AU-Index is present in the first AU-header and
  coded with the value 0, as required for AUs with a fixed duration.
  To reconstruct the original order, the RTP time stamp and the AU-
  Index-delta (coded with the value 4) are used.  See also section
  3.2.3.2.

A.4.2.  Determining the De-interleave Buffer Size

  From Figure 8, it can be to determined that at most 5 "early" AUs are
  to be stored.  If the AUs are of constant size, then this value
  equals 5 times the AU size.  The minimum size of the de-interleave
  buffer equals the maximum total number of octets of the "early" AUs
  that are to be stored at the same time.  This gives the minimum value
  of the de-interleaveBufferSize that may be signaled.

                             +--+--+--+--+--+--+--+--+--+--+
  Interleaved AUs            | 0| 5| 2| 7| 4| 9| 1| 6| 3| 8|
                             +--+--+--+--+--+--+--+--+--+--+
                               -  -  5  -  5  -  2  7  4  9
                                           7     4  9  5
  "Early" AUs                                    5     6
                                                 7     7
                                                 9     9

  Figure 8: Storage of "early" AUs in the de-interleave buffer per
            interleaved AU.





van der Meer, et al.        Standards Track                    [Page 38]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


A.4.3.  Determining the Maximum Displacement

  From Figure 9, it can be seen that the maximum displacement in time
  equals 8 AU periods.  Hence the minimum maxDisplacement value to be
  signaled is 8 AU periods.

                                   +--+--+--+--+--+--+--+--+--+--+
  Interleaved AUs                  | 0| 5| 2| 7| 4| 9| 1| 6| 3| 8|
                                   +--+--+--+--+--+--+--+--+--+--+

  Earliest not yet present AU        -  1  1  1  1  1  -  3  -  -

  Figure 9: For each AU in the interleaving pattern, the earliest of
            any earlier AUs not yet present

  In case each AU has the same size, the found maxDisplacement value
  over-estimates the de-interleave buffer size with three AUs.
  However, in case of variable AU sizes, the total size of any 5
  "early" AUs stored at the same time may exceed maxDisplacement times
  the maximum bitrate, in which case de-interleaveBufferSize must be
  signaled.

A.5.  Continuous Interleave

A.5.1.  Introduction

  In continuous interleave, once the scheme is 'primed', the number of
  AUs in a packet exceeds the 'stride' (the distance between them).
  This shortens the buffering needed, smoothes the data-flow, and gives
  slightly larger packets -- and thus lower overhead -- for the same
  interleave.  For example, here is a continuous interleave also over a
  stride of 3 AUs, but with 4 AUs per packet, for a run of 20 AUs.
  This shows both how the scheme 'starts up' and how it finishes.  Once
  again, the example assumes fixed time duration per Access Unit.

  Packet   Time-stamp   Carried AUs         AU-Index, AU-Index-delta
  0        T[0]                      0      0
  1        T[1]                  1   4      0  2
  2        T[2]              2   5   8      0  2  2
  3        T[3]          3   6   9  12      0  2  2  2
  4        T[7]          7  10  13  16      0  2  2  2
  5        T[11]        11  14  17  20      0  2  2  2
  6        T[15]        15  18              0  2
  7        T[19]        19                  0

  In this example, the AU-Index is present in the first AU-header and
  coded with the value 0, as required for AUs with a fixed duration.
  To reconstruct the original order, the RTP time stamp and the



van der Meer, et al.        Standards Track                    [Page 39]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


  AU-Index-delta (coded with the value 2) are used.  See also 3.2.3.2.
  Note that this example has RTP time-stamps in increasing order.

A.5.2.  Determining the De-interleave Buffer Size

  For this example the de-interleave buffer size can be derived from
  Figure 10.  The maximum number of "early" AUs is 3.  If the AUs are
  of constant size, then the de-interleave buffer size equals 3 times
  the AU size.  Compared to the example in A.2, for constant size AUs
  the de-interleave buffer size is reduced from 4 to 3 times the AU
  size, while maintaining the same 'stride'.

                       +--+--+--+--+--+--+--+--+--+--+--+--+--+--+-
  Interleaved AUs      | 0| 1| 4| 2| 5| 8| 3| 6| 9|12| 7|10|13|16|
                       +--+--+--+--+--+--+--+--+--+--+--+--+--+--+-
                         -  -  -  4  -  -  4  8  -  -  8 12  -  -
                                           5           9
  "Early" AUs                              8          12


  Figure 10: Storage of "early" AUs in the de-interleave buffer per
             interleaved AU.

A.5.3.  Determining the Maximum Displacement

  For this example, the maximum displacement has a value of 5 AU
  periods.  See Figure 11.  Compared to the example in A.2, the maximum
  displacement does not decrease, though in fact less de-interleave
  buffering is required.

                       +--+--+--+--+--+--+--+--+--+--+--+--+--+--+-
  Interleaved AUs      | 0| 1| 4| 2| 5| 8| 3| 6| 9|12| 7|10|13|16|
                       +--+--+--+--+--+--+--+--+--+--+--+--+--+--+-
  Earliest not yet
       present AU        -  -  2  -  3  3  -  -  7  7  -  - 11 11


  Figure 11: For each AU in the interleaving pattern, the earliest of
             any earlier AUs not yet present












van der Meer, et al.        Standards Track                    [Page 40]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


References

Normative References

  [1]  ISO/IEC International Standard 14496 (MPEG-4); "Information
       technology - Coding of audio-visual objects", January 2000

  [2]  Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,
       "RTP:  A Transport Protocol for Real-Time Applications", RFC
       3550, July 2003.

  [3]  Freed, N., Klensin, J. and J. Postel, "Multipurpose Internet
       Mail Extensions (MIME) Part Four: Registration Procedures", BCP
       13, RFC 2048, November 1996.

  [4]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
       Levels", BCP 14, RFC 2119, March 1997.

  [5]  Handley, M. and V. Jacobson, "SDP: Session Description
       Protocol", RFC 2327, April 1998.

  [6]  Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
       Considerations Section in RFCs", BCP 26, RFC 2434, October 1998.

Informative References

  [7]  Hoffman, D., Fernando, G., Goyal, V. and M. Civanlar, "RTP
       Payload Format for MPEG1/MPEG2 Video", RFC 2250, January 1998.

  [8]  Schulzrinne, H., Rao, A. and R. Lanphier, "Real-Time Session
       Protocol (RTSP)", RFC 2326, April 1998.

  [9]  Perkins, C. and O. Hodson, "Options for Repair of Streaming
       Media", RFC 2354, June 1998.

  [10] Schulzrinne, H. and J. Rosenberg, "An RTP Payload Format for
       Generic Forward Error Correction", RFC 2733, December 1999.

  [11] Handley, M., Perkins, C. and E. Whelan, "Session Announcement
       Protocol", RFC 2974, October 2000.

  [12] Kikuchi, Y., Nomura, T., Fukunaga, S., Matsui, Y. and H. Kimata,
       "RTP Payload Format for MPEG-4 Audio/Visual Streams", RFC 3016,
       November 2000.







van der Meer, et al.        Standards Track                    [Page 41]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


Authors' Addresses

  Jan van der Meer
  Philips Electronics
  Prof Holstlaan 4
  Building WAH-1
  5600 JZ Eindhoven
  Netherlands

  EMail: [email protected]


  David Mackie
  Apple Computer, Inc.
  One Infinite Loop, MS:302-3KS
  Cupertino  CA 95014

  EMail: [email protected]


  Viswanathan Swaminathan
  Sun Microsystems Inc.
  2600 Casey Avenue
  Mountain View, CA 94043

  EMail: [email protected]


  David Singer
  Apple Computer, Inc.
  One Infinite Loop, MS:302-3MT
  Cupertino  CA 95014

  EMail: [email protected]


  Philippe Gentric
  Philips Electronics
  51 rue Carnot
  92156 Suresnes
  France

  EMail: [email protected]








van der Meer, et al.        Standards Track                    [Page 42]

RFC 3640         Transport of MPEG-4 Elementary Streams    November 2003


Full Copyright Statement

  Copyright (C) The Internet Society (2003).  All Rights Reserved.

  This document and translations of it may be copied and furnished to
  others, and derivative works that comment on or otherwise explain it
  or assist in its implementation may be prepared, copied, published
  and distributed, in whole or in part, without restriction of any
  kind, provided that the above copyright notice and this paragraph are
  included on all such copies and derivative works.  However, this
  document itself may not be modified in any way, such as by removing
  the copyright notice or references to the Internet Society or other
  Internet organizations, except as needed for the purpose of
  developing Internet standards in which case the procedures for
  copyrights defined in the Internet Standards process must be
  followed, or as required to translate it into languages other than
  English.

  The limited permissions granted above are perpetual and will not be
  revoked by the Internet Society or its successors or assignees.

  This document and the information contained herein is provided on an
  "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
  TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
  BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
  HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
  MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

  Funding for the RFC Editor function is currently provided by the
  Internet Society.



















van der Meer, et al.        Standards Track                    [Page 43]