Network Working Group                                              A. Li
Request for Comments: 3558                                          UCLA
Category: Standards Track                                      July 2003


     RTP Payload Format for Enhanced Variable Rate Codecs (EVRC)
                  and Selectable Mode Vocoders (SMV)

Status of this Memo

  This document specifies an Internet standards track protocol for the
  Internet community, and requests discussion and suggestions for
  improvements.  Please refer to the current edition of the "Internet
  Official Protocol Standards" (STD 1) for the standardization state
  and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

  Copyright (C) The Internet Society (2003).  All Rights Reserved.

Abstract

  This document describes the RTP payload format for Enhanced Variable
  Rate Codec (EVRC) Speech and Selectable Mode Vocoder (SMV) Speech.
  Two sub-formats are specified for different application scenarios.  A
  bundled/interleaved format is included to reduce the effect of packet
  loss on speech quality and amortize the overhead of the RTP header
  over more than one speech frame.  A non-bundled format is also
  supported for conversational applications.

Table of Contents

  1. Introduction ................................................... 2
  2. Background ..................................................... 2
  3. The Codecs Supported ........................................... 3
     3.1. EVRC ...................................................... 3
     3.2. SMV ....................................................... 3
     3.3. Other Frame-Based Vocoders ................................ 4
  4. RTP/Vocoder Packet Format ...................................... 4
     4.1. Interleaved/Bundled Packet Format ......................... 5
     4.2. Header-Free Packet Format ................................. 6
     4.3. Determining the Format of Packets ......................... 7
  5. Packet Table of Contents Entries and Codec Data Frame Format ... 7
     5.1. Packet Table of Contents entries .......................... 7
     5.2. Codec Data Frames ......................................... 8
  6. Interleaving Codec Data Frames ................................. 9
  7. Bundling Codec Data Frames .................................... 12
  8. Handling Missing Codec Data Frames ............................ 12



Li                          Standards Track                     [Page 1]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  9. Implementation Issues ......................................... 12
     9.1. Interleaving Length .......................................12
     9.2. Validation of Received Packets ............................13
     9.3. Processing the Late Packets ...............................13
  10. Mode Request ................................................. 13
  11. Storage Format ............................................... 14
  12. IANA Considerations .......................................... 15
     12.1. Registration of Media Type EVRC ..........................15
     12.2. Registration of Media Type EVRC0 .........................16
     12.3. Registration of Media Type SMV ...........................17
     12.4. Registration of Media Type SMV0 ..........................18
  13. Mapping to SDP Parameters .................................... 19
  14. Security Considerations ...................................... 20
  15. Adding Support of Other Frame-Based Vocoders ................. 20
  16. Acknowledgements ............................................. 21
  17. References ................................................... 21
     17.1 Normative ................................................ 21
     17.2 Informative .............................................. 22
  18. Author's Address ............................................. 22
  19. Full Copyright Statement ..................................... 23

1. Introduction

  This document describes how speech compressed with EVRC [1] or SMV
  [2] may be formatted for use as an RTP payload type.  The format is
  also extensible to other codecs that generate a similar set of frame
  types.  Two methods are provided to packetize the codec data frames
  into RTP packets: an interleaved/bundled format and a zero-header
  format.  The sender may choose the best format for each application
  scenario, based on network conditions, bandwidth availability, delay
  requirements, and packet-loss tolerance.

  The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
  "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
  document are to be interpreted as described in RFC 2119 [3].

2. Background

  The 3rd Generation Partnership Project 2 (3GPP2) has published two
  standards which define speech compression algorithms for CDMA
  applications: EVRC [1] and SMV [2].  EVRC is currently deployed in
  millions of first and second generation CDMA handsets.  SMV is the
  preferred speech codec standard for CDMA2000, and will be deployed in
  third generation handsets in addition to EVRC.  Improvements and new
  codecs will keep emerging as technology improves, and future handsets
  will likely support multiple codecs.





Li                          Standards Track                     [Page 2]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  The formats of the EVRC and SMV codec frames are very similar.  Many
  other vocoders also share common characteristics, and have many
  similar application scenarios.  This parallelism enables an RTP
  payload format to be designed for EVRC and SMV that may also support
  other, similar vocoders with minimal additional specification work.
  This can simplify the protocol for transporting vocoder data frames
  through RTP and reduce the complexity of implementations.

3. The Codecs Supported

3.1. EVRC

  The Enhanced Variable Rate Codec (EVRC) [1] compresses each 20
  milliseconds of 8000 Hz, 16-bit sampled speech input into output
  frames in one of the three different sizes: Rate 1 (171 bits), Rate
  1/2 (80 bits), or Rate 1/8 (16 bits).  In addition, there are two
  zero bit codec frame types: null frames and erasure frames.  Null
  frames are produced as a result of the vocoder running at rate 0.
  Null frames are zero bits long and are normally not transmitted.
  Erasure frames are the frames substituted by the receiver to the
  codec for the lost or damaged frames.  Erasure frames are also zero
  bits long and are normally not transmitted.

  The codec chooses the output frame rate based on analysis of the
  input speech and the current operating mode (either normal or one of
  several reduced rate modes).  For typical speech patterns, this
  results in an average output of 4.2 kilobits/second for normal mode
  and a lower average output for reduced rate modes.

3.2. SMV

  The Selectable Mode Vocoder (SMV) [2] compresses each 20 milliseconds
  of 8000 Hz, 16-bit sampled speech input into output frames of one of
  the four different sizes: Rate 1 (171 bits), Rate 1/2 (80 bits), Rate
  1/4 (40 bits), or Rate 1/8 (16 bits).  In addition, there are two
  zero bit codec frame types: null frames and erasure frames.  Null
  frames are produced as a result of the vocoder running at rate 0.
  Null frames are zero bits long and are normally not transmitted.
  Erasure frames are the frames substituted by the receiver to the
  codec for the lost or damaged frames.  Erasure frames are also zero
  bits long and are normally not transmitted.

  The SMV codec can operate in six modes.  Each mode may produce frames
  of any of the rates (full rate to 1/8 rate) for varying percentages
  of time, based on the characteristics of the speech samples and the
  selected mode.  The SMV mode can change on a
  frame-by-frame basis.  The SMV codec does not need additional
  information other than the codec data frames to correctly decode the



Li                          Standards Track                     [Page 3]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  data of various modes; therefore, the mode of the encoder does not
  need to be transmitted with the encoded frames.

  The SMV codec chooses the output frame rate based on analysis of the
  input speech and the current operating mode.  For typical speech
  patterns, this results in an average output of 4.2 kilobits/second
  for Mode 0 in two way conversation (approximately 50% active speech
  time and 50% in eighth rate while listening) and lower for other
  reduced rate modes.  SMV is more bandwidth efficient than EVRC.  EVRC
  is equivalent in performance to SMV mode 1.

3.3. Other Frame-Based Vocoders

  Other frame-based vocoders can be carried in the packet format
  defined in this document, as long as they possess the following
  properties:

     o The codec is frame-based;
     o blank and erasure frames are supported;
     o the total number of rates is less than 17;
     o the maximum full rate frame can be transported in a single RTP
       packet using this specific format.

  Vocoders with the characteristics listed above can be transported
  using the packet format specified in this document with some
  additional specification work; the pieces that must be defined are
  listed in Section 15.

4. RTP/Vocoder Packet Format

  The vocoder speech data may be transmitted in either of the two RTP
  packet formats specified in the following two subsections, as
  appropriate for the application scenario.  In the packet format
  diagrams shown in this document, bit 0 is the most significant bit.

















Li                          Standards Track                     [Page 4]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


4.1. Interleaved/Bundled Packet Format

  This format is used to send one or more vocoder frames per packet.
  Interleaving or bundling MAY be used.  The RTP packet for this format
  is as follows:

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |                      RTP Header [4]                           |
  +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
  |R|R| LLL | NNN | MMM |  Count  |  TOC  |  ...  |  TOC  |padding|
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |        one or more codec data frames, one per TOC entry       |
  |                             ....                              |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

  The RTP header has the expected values as described in the RTP
  specification [4].  The RTP timestamp is in 1/8000 of a second units
  for EVRC and SMV.  For any other vocoders that use this packet
  format, the timestamp unit needs to be defined explicitly.  The M bit
  should be set as specified in the applicable RTP profile, for
  example, RFC 3551 [5].  Note that RFC 3551 [5] specifies that if the
  sender does not suppress silence, the M bit will always be zero.
  When multiple codec data frames are present in a single RTP packet,
  the timestamp is that of the oldest data represented in the RTP
  packet.  The assignment of an RTP payload type for this packet format
  is outside the scope of this document; it is specified by the RTP
  profile under which this payload format is used.

  The first octet of a Interleaved/Bundled format packet is the
  Interleave Octet.  The second octet contains the Mode Request and
  Frame Count fields.  The Table of Contents (ToC) field then follows.
  The fields are specified as follows:

  Reserved (RR): 2 bits
     Reserved bits.  MUST be set to zero by sender, SHOULD be ignored
     by receiver.

  Interleave Length (LLL): 3 bits
     Indicates the length of interleave; a value of 0 indicates
     bundling, a special case of interleaving.  See Section 6 and
     Section 7 for more detailed discussion.








Li                          Standards Track                     [Page 5]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  Interleave Index (NNN): 3 bits
     Indicates the index within an interleave group.  MUST have a value
     less than or equal to the value of LLL.  Values of NNN greater
     than the value of LLL are invalid.  Packet with invalid NNN values
     SHOULD be ignored by the receiver.

  Mode Request (MMM): 3 bits
     The Mode Request field is used to signal Mode Request information.
     See Section 10 for details.

  Frame Count (Count): 5 bits
     The number of ToC fields (and vocoder frames) present in the
     packet is the value of the frame count field plus one.  A value of
     zero indicates that the packet contains one ToC field, while a
     value of 31 indicates that the packet contains 32 ToC fields.

  Padding (padding): 0 or 4 bits
     This padding ensures that codec data frames start on an octet
     boundary.  When the frame count is odd, the sender MUST add 4 bits
     of padding following the last TOC.  When the frame count is even,
     the sender MUST NOT add padding bits.  If padding is present, the
     padding bits MUST be set to zero by sender, and SHOULD be ignored
     by receiver.

  The Table of Contents field (ToC) provides information on the codec
  data frame(s) in the packet.  There is one ToC entry for each codec
  data frame.  The detailed formats of the ToC field and codec data
  frames are specified in Section 5.

  Multiple data frames may be included within a Interleaved/Bundled
  packet using interleaving or bundling as described in Section 6 and
  Section 7.

4.2. Header-Free Packet Format

  The Header-Free Packet Format is designed for maximum bandwidth
  efficiency and low latency.  Only one codec data frame can be sent in
  each Header-Free format packet.  None of the payload header fields
  (LLL, NNN, MMM, Count) nor ToC entries are present.  The codec rate
  for the data frame can be determined from the length of the codec
  data frame, since there is only one codec data frame in each
  Header-Free packet.

  Use of the RTP header fields for Header-Free RTP/Vocoder Packet
  Format is the same as described in Section 4.1 for
  Interleaved/Bundled RTP/Vocoder Packet Format.  The detailed format
  of the codec data frame is specified in Section 5.




Li                          Standards Track                     [Page 6]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |                      RTP Header [4]                           |
  +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
  |                                                               |
  +          ONLY one codec data frame            +-+-+-+-+-+-+-+-+
  |                                               |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

4.3. Determining the Format of Packets

  All receivers SHOULD be able to process both packet formats.  The
  sender MAY choose to use one or both packet formats.

  A receiver MUST have prior knowledge of the packet format to
  correctly decode the RTP packets.  When packets of both formats are
  used within the same session, different RTP payload type values MUST
  be used for each format to distinguish the packet formats.  The
  association of payload type number with the packet format is done
  out-of-band, for example by SDP during the setup of a session.

5. Packet Table of Contents Entries and Codec Data Frame Format

5.1. Packet Table of Contents entries

  Each codec data frame in a Interleaved/Bundled packet has a
  corresponding Table of Contents (ToC) entry.  The ToC entry indicates
  the rate of the codec frame.  (Header-Free packets MUST NOT have a
  ToC field.)

  Each ToC entry is occupies four bits.  The format of the bits is
  indicated below:

      0 1 2 3
     +-+-+-+-+
     |fr type|
     +-+-+-+-+

  Frame Type: 4 bits
     The frame type indicates the type of the corresponding codec data
     frame in the RTP packet.









Li                          Standards Track                     [Page 7]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  For EVRC and SMV codecs, the frame type values and size of the
  associated codec data frame are described in the table below:

  Value   Rate      Total codec data frame size (in octets)
  ---------------------------------------------------------
    0     Blank      0    (0 bit)
    1     1/8        2    (16 bits)
    2     1/4        5    (40 bits; not valid for EVRC)
    3     1/2       10    (80 bits)
    4     1         22    (171 bits; 5 padded at end with zeros)
    5     Erasure    0    (SHOULD NOT be transmitted by sender)

  All values not listed in the above table MUST be considered reserved.
  A ToC entry with a reserved Frame Type value SHOULD be considered
  invalid.  Note that the EVRC codec does not have 1/4 rate frames,
  thus frame type value 2 MUST be considered a reserved value when the
  EVRC codec is in use.

  Other vocoders that use this packet format need to specify their own
  table of frame types and corresponding codec data frames.

5.2. Codec Data Frames

  The output of the vocoder MUST be converted into codec data frames
  for inclusion in the RTP payload.  The conversions for EVRC and SMV
  codecs are specified below.  (Note: Because the EVRC codec does not
  have Rate 1/4 frames, the specifications of 1/4 frames does not apply
  to EVRC codec data frames).  Other vocoders that use this packet
  format need to specify how to convert vocoder output data into
  frames.

  The codec output data bits as numbered in EVRC and SMV are packed
  into octets.  The lowest numbered bit (bit 1 for Rate 1, Rate 1/2,
  Rate 1/4 and Rate 1/8) is placed in the most significant bit
  (internet bit 0) of octet 1 of the codec data frame, the second
  lowest bit is placed in the second most significant bit of the first
  octet, the third lowest in the third most significant bit of the
  first octet, and so on.  This continues until all of the bits have
  been placed in the codec data frame.

  The remaining unused bits of the last octet of the codec data frame
  MUST be set to zero.  Note that in EVRC and SMV this is only
  applicable to Rate 1 frames (171 bits) as the Rate 1/2 (80 bits),
  Rate 1/4 (40 bits, SMV only) and Rate 1/8 frames (16 bits) fit
  exactly into a whole number of octets.

  Following is a detailed listing showing a Rate 1 EVRC/SMV codec
  output frame converted into a codec data frame:



Li                          Standards Track                     [Page 8]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  The codec data frame for a EVRC/SMV codec Rate 1 frame is 22 octets
  long.  Bits 1 through 171 from the EVRC/SMV codec Rate 1 frame are
  placed as indicated, with bits marked with "Z" set to zero.  EVRC/SMV
  codec Rate 1/8, Rate 1/4 and Rate 1/2 frames are converted similarly,
  but do not require zero padding because they align on octet
  boundaries.

                       Rate 1 codec data frame

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
  |0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|2|2|2|2|2|2|2|2|2|2|3|3|3|
  |1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  :                                                               :
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| | | | | |
  |4|4|4|4|4|5|5|5|5|5|5|5|5|5|5|6|6|6|6|6|6|6|6|6|6|7|7|Z|Z|Z|Z|Z|
  |5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1| | | | | |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

6. Interleaving Codec Data Frames

  As indicated in Section 4.1, more than one codec data frame MAY be
  included in a single Interleaved/Bundled packet by a sender.  This is
  accomplished by interleaving or bundling.

  Bundling is used to spread the transmission overhead of the RTP and
  payload header over multiple vocoder frames.  Interleaving
  additionally reduces the listener's perception of data loss by
  spreading such loss over non-consecutive vocoder frames.  EVRC, SMV,
  and similar vocoders are able to compensate for an occasional lost
  frame, but speech quality degrades exponentially with consecutive
  frame loss.

  Bundling is signaled by setting the LLL field to zero and the Count
  field to greater than zero.  Interleaving is indicated by setting the
  LLL field to a value greater than zero.

  The discussions on general interleaving apply to the bundling (which
  can be viewed as a reduced case of interleaving) with reduced
  complexity.  The bundling case is discussed in detail in Section 7.

  Senders MAY support interleaving and/or bundling.  All receivers that
  support Interleave/Bundling packet format MUST support both
  interleaving and bundling.



Li                          Standards Track                     [Page 9]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  Given a time-ordered sequence of output frames from the codec
  numbered 0..n, a bundling value B (the value in the Count field plus
  one), and an interleave length L where n = B * (L+1) - 1, the output
  frames are placed into RTP packets as follows (the values of the
  fields LLL and NNN are indicated for each RTP packet):

  First RTP Packet in Interleave group:
     LLL=L, NNN=0
     Frame 0, Frame L+1, Frame 2(L+1), Frame 3(L+1), ... for a total of
     B frames

  Second RTP Packet in Interleave group:
     LLL=L, NNN=1
     Frame 1, Frame 1+L+1, Frame 1+2(L+1), Frame 1+3(L+1), ... for a
     total of B frames

  This continues to the last RTP packet in the interleave group:

  L+1 RTP Packet in Interleave group:
     LLL=L, NNN=L
     Frame L, Frame L+L+1, Frame L+2(L+1), Frame L+3(L+1), ... for a
     total of B frames

  Within each interleave group, the RTP packets making up the
  interleave group MUST be transmitted in value-increasing order of the
  NNN field.  While this does not guarantee reduced end-to-end delay on
  the receiving end, when packets are delivered in order by the
  underlying transport, delay will be reduced to the minimum possible.

  Receivers MAY signal the maximum number of codec data frames (i.e.,
  the maximum acceptable bundling value B) they can handle in a single
  RTP packet using the OPTIONAL maxptime RTP mode parameter identified
  in Section 12.

  Receivers MAY signal the maximum interleave length (i.e., the maximum
  acceptable LLL value in the Interleaving Octet) they will accept
  using the OPTIONAL maxinterleave RTP mode parameter identified in
  Section 12.

  The parameters maxptime and maxinterleave are exchanged at the
  initial setup of the session.  In one-to-one sessions, the sender
  MUST respect these values set be the receiver, and MUST NOT
  interleave/bundle more packets than what the receiver signals that it
  can handle.  This ensures that the receiver can allocate a known
  amount of buffer space that will be sufficient for all
  interleaving/bundling used in that session.  During the session, the
  sender may decrease the bundling value or interleaving length (so
  that less buffer space is required at the receiver), but never exceed



Li                          Standards Track                    [Page 10]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  the maximum value set by the receiver.  This prevents the situation
  where a receiver needs to allocate more buffer space in the middle of
  a session but is unable to do so.

  Additionally, senders have the following restrictions:

  o  MUST NOT bundle more codec data frames in a single RTP packet than
     indicated by maxptime (see Section 12) if it is signaled.

  o  SHOULD NOT bundle more codec data frames in a single RTP packet
     than will fit in the MTU of the underlying network.

  o  Once beginning a session with a given maximum interleaving value
     set by maxinterleave in Section 12, MUST NOT increase the
     interleaving value (LLL) to exceed the maximum interleaving value
     that is signaled.

  o  MAY change the interleaving value, but MUST do so only between
     interleave groups.

  o  Silence suppression MUST only be used between interleave groups.
     A ToC with Frame Type 0 (Blank Frame, Section 5.1) MUST be used
     within interleaving groups if the codec outputs a blank frame.
     The M bit in the RTP header is not set for these blank frames, as
     the stream is continuous in time.  Because there is only one time
     stamp for each RTP packet, silence suppression used within an
     interleave group would cause ambiguities when reconstructing the
     speech at the receiver side, and thus is prohibited.

  Given an RTP packet with sequence number S, interleave length (field
  LLL) L, interleave index value (field NNN) N, and bundling value B,
  the interleave group consists of this RTP packet and other RTP
  packets with sequence numbers from S-N mod 65536 to S-N+L mod 65536
  inclusive.  In other words, the interleave group always consists of
  L+1 RTP packets with sequential sequence numbers.  The bundling value
  for all RTP packets in an interleave group MUST be the same.

  The receiver determines the expected bundling value for all RTP
  packets in an interleave group by the number of codec data frames
  bundled in the first RTP packet of the interleave group received.
  Note that this may not be the first RTP packet of the interleave
  group if packets are delivered out of order by the underlying
  transport.








Li                          Standards Track                    [Page 11]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


7. Bundling Codec Data Frames

  As discussed in Section 6, the bundling of codec data frames is a
  special reduced case of interleaving with LLL value in the Interleave
  Octet set to 0.

  Bundling codec data frames indicates that multiple data frames are
  included consecutively in a packet, because the interleaving length
  (LLL) is 0.  The interleaving group is thus reduced to a single RTP
  packet, and the reconstruction of the codec data frames from RTP
  packets becomes a much simpler process.

  Furthermore, the additional restrictions on senders are reduced to:

  o  MUST NOT bundle more codec data frames in a single RTP packet than
     indicated by maxptime (see Section 12) if it is signaled.

  o  SHOULD NOT bundle more codec data frames in a single RTP packet
     than will fit in the MTU of the underlying network.

8. Handling Missing Codec Data Frames

  The vocoders covered by this payload format support erasure frames as
  an indication when frames are not available.  The erasure frames are
  normally used internally by a receiver to advance the state of the
  voice decoder by exactly one frame time for each missing frame.
  Using the information from packet sequence number, time stamp, and
  the M bit, the receiver can detect missing codec data frames from RTP
  packet loss and/or silence suppression, and generate corresponding
  erasure frames.  Erasure frames MUST also be used in storage format
  to record missing frames.

9. Implementation Issues

9.1. Interleaving Length

  The vocoder interpolates the missing speech content when given an
  erasure frame.  However, the best quality is perceived by the
  listener when erasure frames are not consecutive.  This makes
  interleaving desirable as it increases speech quality when packet
  loss occurs.

  On the other hand, interleaving can greatly increase the end-to-end
  delay.  Where an interactive session is desired, either
  Interleaved/Bundled packet format with interleaving length (field
  LLL) 0 or Header-Free packet format is RECOMMENDED.





Li                          Standards Track                    [Page 12]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  When end-to-end delay is not a primary concern, an interleaving
  length (field LLL) of 4 or 5 is RECOMMENDED as it offers a reasonable
  compromise between robustness and latency.

9.2. Validation of Received Packets

  When receiving an RTP packet, the receiver SHOULD check the validity
  of the ToC fields and match the length of the packet with what is
  indicated by the ToC fields.  If any invalidity or mismatch is
  detected, it is RECOMMENDED to discard the received packet to avoid
  potential severe degradation of the speech quality.  The discarded
  packet is treated following the same procedure as a lost packet, and
  the discarded data will be replaced with erasure frames.

  On receipt of an RTP packet with an invalid value of the LLL or NNN
  fields, the RTP packet SHOULD be treated as lost by the receiver for
  the purpose of generating erasure frames as described in Section 8.

  On receipt of an RTP packet in an interleave group with other than
  the expected frame count value, the receiver MAY discard codec data
  frames off the end of the RTP packet or add erasure codec data frames
  to the end of the packet in order to manufacture a substitute packet
  with the expected bundling value.  The receiver MAY instead choose to
  discard the whole interleave group.

9.3. Processing the Late Packets

  Assume that the receiver has begun playing frames from an interleave
  group.  The time has come to play frame x from packet n of the
  interleave group.  Further assume that packet n of the interleave
  group has not been received.  As described in Section 8, an erasure
  frame will be sent to the receiving vocoder.

  Now, assume that packet n of the interleave group arrives before
  frame x+1 of that packet is needed.  Receivers should use frame x+1
  of the newly received packet n rather than substituting an erasure
  frame.  In other words, just because packet n was not available the
  first time it was needed to reconstruct the interleaved speech, the
  receiver should not assume it is not available when it is
  subsequently needed for interleaved speech reconstruction.

10.  Mode Request

  The Mode Request signal requests a particular encoding mode for the
  speech encoding in the reverse direction.  All implementations are
  RECOMMENDED to honor the Mode Request signal.  The Mode Request
  signal SHOULD only be used in one-to-one sessions.  In multi-party
  sessions, any received Mode Request signals SHOULD be ignored.



Li                          Standards Track                    [Page 13]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  In addition, the Mode Request signal MAY also be sent through non-RTP
  means, which is out of the scope of this specification.

  The three-bit Mode Request field is used to signal the receiver to
  set a particular encoding mode to its audio encoder.  If the Mode
  Request field is set to a valid value in RTP packets from node A to
  node B, it is a request for node B to change to the requested
  encoding mode for its audio encoder and therefore the bit rate of the
  RTP stream from node B to node A.  Once a node sets this field to a
  value, it SHOULD continue to set the field to the same value in
  subsequent packets until the requested mode is different.  This
  design helps to eliminate the scenario of getting the codec stuck in
  an unintended state if one of the packets that carries the Mode
  Request is lost.  An otherwise silent node MAY send an RTP packet
  containing a blank frame in order to send a Mode Request.

  Each codec type using this format SHOULD define its own
  interpretation of the Mode Request field.  Codecs SHOULD follow the
  convention that higher values of the three-bit field correspond to an
  equal or lower average output bit rate.

  For the EVRC codec, the Mode Request field MUST be interpreted
  according to Tables 2.2.1.2-1 and 2.2.1.2-2 of the EVRC codec
  specifications [1].

  For SMV codec, the Mode Request field MUST be interpreted according
  to Table 2.2-2 of the SMV codec specifications [2].

11.  Storage Format

  The storage format is used for storing speech frames, e.g., as a file
  or e-mail attachment.

  The file begins with a magic number to identify the vocoder that is
  used.  The magic number for EVRC corresponds to the ASCII character
  string "#!EVRC\n", i.e., "0x23 0x21 0x45 0x56 0x52 0x43 0x0A".  The
  magic number for SMV corresponds to the ASCII character string
  "#!SMV\n", i.e., "0x23 0x21 0x53 0x4d 0x56 0x0a".

  The codec data frames are stored in consecutive order, with a single
  TOC entry field, extended to one octet, prefixing each codec data
  frame.  The ToC field is extended to one octet by setting the four
  most significant bits of the octet to zero.  For example, a ToC value
  of 4 (a full-rate frame) is stored as 0x04.

  Speech frames lost in transmission and non-received frames MUST be
  stored as erasure frames (frame type 5, see definition in Section
  5.1) to maintain synchronization with the original media.



Li                          Standards Track                    [Page 14]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


12.  IANA Considerations

  Four new MIME sub-types as described in this section have been
  registered by the IANA.

  The MIME-names for the EVRC and SMV codec are allocated from the IETF
  tree since all the vocoders covered are expected to be widely used
  for Voice-over-IP applications.

12.1.  Registration of Media Type EVRC

  Media Type Name:           audio

  Media Subtype Name:        EVRC

  Required Parameter:        none

  Optional parameters:
     The following parameters apply to RTP transfer only.

     ptime:    Defined as usual for RTP audio (see RFC 2327).

     maxptime: The maximum amount of media which can be encapsulated in
        each packet, expressed as time in milliseconds.  The time SHALL
        be calculated as the sum of the time the media present in the
        packet represents.  The time SHOULD be a multiple of the
        duration of a single codec data frame (20 msec).  If not
        signaled, the default maxptime value SHALL be 200 milliseconds.

     maxinterleave: Maximum number for interleaving length (field LLL
        in the Interleaving Octet).  The interleaving lengths used in
        the entire session MUST NOT exceed this maximum value.  If not
        signaled, the maxinterleave length SHALL be 5.

  Encoding considerations:
     This type is defined for transfer of EVRC-encoded data via RTP
     using the Interleaved/Bundled packet format specified in Sections
     4.1, 6, and 7 of RFC 3558.  It is also defined for other transfer
     methods using the storage format specified in Section 11 of RFC
     3558.

  Security considerations:
     See Section 14 "Security Considerations" of RFC 3558.

  Public specification:
     The EVRC vocoder is specified in 3GPP2 C.S0014.  Transfer methods
     are specified in RFC 3558.




Li                          Standards Track                    [Page 15]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  Additional information:
     The following information applies for storage format only.

     Magic number: #!EVRC\n (see Section 11 of RFC 3558)
     File extensions: evc, EVC
     Macintosh file type code: none
     Object identifier or OID: none

  Intended usage:
     COMMON.  It is expected that many VoIP applications (as well as
     mobile applications) will use this type.

  Person & email address to contact for further information:
     Adam Li
     [email protected]

  Author/Change controller:
     Adam Li
     [email protected]
     IETF Audio/Video Transport Working Group

12.2. Registration of Media Type EVRC0

  Media Type Name:           audio

  Media Subtype Name:        EVRC0

  Required Parameters:       none

  Optional parameters:       none

  Encoding considerations:   none
     This type is only defined for transfer of EVRC-encoded data via
     RTP using the Header-Free packet format specified in Section 4.2
     of RFC 3558.

  Security considerations:
     See Section 14 "Security Considerations" of RFC 3558.

  Public specification:
     The EVRC vocoder is specified in 3GPP2 C.S0014.  Transfer methods
     are specified in RFC 3558.

  Additional information:    none

  Intended usage:
     COMMON.  It is expected that many VoIP applications (as well as
     mobile applications) will use this type.



Li                          Standards Track                    [Page 16]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  Person & email address to contact for further information:
     Adam Li
     [email protected]

  Author/Change controller:
     Adam Li
     [email protected]
     IETF Audio/Video Transport Working Group

12.3. Registration of Media Type SMV

  Media Type Name:           audio

  Media Subtype Name:        SMV

  Required Parameter:        none

  Optional parameters:
  The following parameters apply to RTP transfer only.

     ptime:    Defined as usual for RTP audio (see RFC 2327).

     maxptime: The maximum amount of media which can be encapsulated
        in each packet, expressed as time in milliseconds.  The time
        SHALL be calculated as the sum of the time the media present
        in the packet represents.  The time SHOULD be a multiple of the
        duration of a single codec data frame (20 msec).  If not
        signaled, the default maxptime value SHALL be 200
        milliseconds.

     maxinterleave: Maximum number for interleaving length (field LLL
        in the Interleaving Octet).  The interleaving lengths used in
        the entire session MUST NOT exceed this maximum value.  If not
        signaled, the maxinterleave length SHALL be 5.

  Encoding considerations:
     This type is defined for transfer of SMV-encoded data via RTP
     using the Interleaved/Bundled packet format specified in Section
     4.1, 6, and 7 of RFC 3558.  It is also defined for other transfer
     methods using the storage format specified in Section 11 of RFC
     3558.

  Security considerations:
     See Section 14 "Security Considerations" of RFC 3558.

  Public specification:
     The SMV vocoder is specified in 3GPP2 C.S0030-0 v2.0.
     Transfer methods are specified in RFC 3558.



Li                          Standards Track                    [Page 17]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  Additional information:
     The following information applies to storage format only.

     Magic number: #!SMV\n (see Section 11 of RFC 3558)
     File extensions: smv, SMV
     Macintosh file type code: none
     Object identifier or OID: none

  Intended usage:
     COMMON.  It is expected that many VoIP applications (as well as
     mobile applications) will use this type.

  Person & email address to contact for further information:
     Adam Li
     [email protected]

  Author/Change controller:
     Adam Li
     [email protected]
     IETF Audio/Video Transport Working Group

12.4. Registration of Media Type SMV0

  Media Type Name:           audio

  Media Subtype Name:        SMV0

  Required Parameter:        none

  Optional parameters:       none

  Encoding considerations:   none
     This type is only defined for transfer of SMV-encoded data via RTP
     using the Header-Free packet format specified in Section 4.2 of
     RFC 3558.

  Security considerations:
     See Section 14 "Security Considerations" of RFC 3558.

  Public specification:
     The SMV vocoder is specified in 3GPP2 C.S0030-0 v2.0.  Transfer
     methods are specified in RFC 3558.

  Additional information:    none

  Intended usage:
     COMMON.  It is expected that many VoIP applications (as well as
     mobile applications) will use this type.



Li                          Standards Track                    [Page 18]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  Person & email address to contact for further information:
     Adam Li
     [email protected]

  Author/Change controller:
     Adam Li
     [email protected]
     IETF Audio/Video Transport Working Group

13.  Mapping to SDP Parameters

  Please note that this section applies to the RTP transfer only.

  The information carried in the MIME media type specification has a
  specific mapping to fields in the Session Description Protocol (SDP)
  [6], which is commonly used to describe RTP sessions.  When SDP is
  used to specify sessions employing the EVRC or EMV codec, the mapping
  is as follows:

     o  The MIME type ("audio") goes in SDP "m=" as the media name.

     o  The MIME subtype (payload format name) goes in SDP "a=rtpmap"
        as the encoding name.

     o  The parameters "ptime" and "maxptime" go in the SDP "a=ptime"
        and "a=maxptime" attributes, respectively.

     o  The parameter "maxinterleave" goes in the SDP "a=fmtp"
        attribute by copying it directly from the MIME media type
        string as "maxinterleave=value".

  Some examples of SDP session descriptions for EVRC and SMV encodings
  follow below.

  Example of usage of EVRC:

     m=audio 49120 RTP/AVP 97
     a=rtpmap:97 EVRC/8000
     a=fmtp:97 maxinterleave=2
     a=maxptime:80

  Example of usage of SMV

     m=audio 49122 RTP/AVP 99
     a=rtpmap:99 SMV0/8000
     a=fmtp:99





Li                          Standards Track                    [Page 19]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  Note that the payload format (encoding) names are commonly shown in
  upper case.  MIME subtypes are commonly shown in lower case.  These
  names are case-insensitive in both places.  Similarly, parameter
  names are case-insensitive both in MIME types and in the default
  mapping to the SDP a=fmtp attribute.

14.  Security Considerations

  RTP packets using the payload format defined in this specification
  are subject to the security considerations discussed in the RTP
  specification [4], and any appropriate profile (for example [5]).
  This implies that confidentiality of the media streams is achieved by
  encryption.  Because the data compression used with this payload
  format is applied end-to-end, encryption may be performed after
  compression so there is no conflict between the two operations.

  A potential denial-of-service threat exists for data encoding using
  compression techniques that have non-uniform receiver-end
  computational load.  The attacker can inject pathological datagrams
  into the stream which are complex to decode and cause the receiver to
  become overloaded.  However, the encodings covered in this document
  do not exhibit any significant non-uniformity.

  As with any IP-based protocol, in some circumstances, a receiver may
  be overloaded simply by the receipt of too many packets, either
  desired or undesired.  Network-layer authentication may be used to
  discard packets from undesired sources, but the processing cost of
  the authentication itself may be too high.  In a multicast
  environment, pruning of specific sources may be implemented in future
  versions of IGMP [7] and in multicast routing protocols to allow a
  receiver to select which sources are allowed to reach it.

  Interleaving may affect encryption.  Depending on the used encryption
  scheme there may be restrictions on, for example, the time when keys
  can be changed.  Specifically, the key change may need to occur at
  the boundary between interleave groups.

15.  Adding Support of Other Frame-Based Vocoders

  As described above, the RTP packet format defined in this document is
  very flexible and designed to be usable by other frame-based
  vocoders.

  Additional vocoders using this format MUST have properties as
  described in Section 3.3.






Li                          Standards Track                    [Page 20]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


  For an eligible vocoder to use the payload format mechanisms defined
  in this document, a new RTP payload format document needs to be
  published as a standards track RFC.  That document can simply refer
  to this document and then specify the following parameters:

     o Define the unit used for RTP time stamp;
     o Define the meaning of the Mode Request bits;
     o Define corresponding codec data frame type values for ToC;
     o Define the conversion procedure for vocoders output data frame;
     o Define a magic number for storage format, and complete the
       corresponding MIME registration.

16.  Acknowledgements

  The following authors have made significant contributions to this
  document: Adam H. Li, John D. Villasenor, Dong-Seek Park, Jeong-Hoon
  Park, Keith Miller, S. Craig Greer, David Leon, Nikolai Leung,
  Marcello Lioy, Kyle J. McKay, Magdalena L. Espelien, Randall Gellens,
  Tom Hiller, Peter J. McCann, Stinson S. Mathai, Michael D. Turner,
  Ajay Rajkumar, Dan Gal, Magnus Westerlund, Lars-Erik Jonsson, Greg
  Sherwood, and Thomas Zeng.

17.  References

17.1 Normative

  [1]  3GPP2 C.S0014, "Enhanced Variable Rate Codec, Speech Service
       Option 3 for Wideband Spread Spectrum Digital Systems", January
       1997.

  [2]  3GPP2 C.S0030-0 v2.0, "Selectable Mode Vocoder, Service Option
       for Wideband Spread Spectrum Communication Systems", May 2002.

  [3]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
       Levels", BCP 14, RFC 2119, March 1997.

  [4]  Schulzrinne, H., Casner, S., Jacobson, V. and R. Frederick,
       "RTP: A Transport Protocol for Real-Time Applications", RFC
       3550, July 2003.

  [5]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video
       Conferences with Minimal Control", RFC 3551, July 2003.

  [6]  Handley, M. and V. Jacobson, "SDP: Session Description
       Protocol", RFC 2327, April 1998.






Li                          Standards Track                    [Page 21]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


17.2 Informative

  [7]  Deering, S., "Host Extensions for IP Multicasting", STD 5, RFC
       1112, August 1989.

18.  Author's Address

  Adam H. Li
  Image Communication Lab
  Electrical Engineering Department
  University of California
  Los Angeles, CA 90095
  USA

  Phone: +1 310 825 5178
  EMail: [email protected]



































Li                          Standards Track                    [Page 22]

RFC 3558          RTP Payload Format for EVRC and SMV          July 2003


19.  Full Copyright Statement

  Copyright (C) The Internet Society (2003).  All Rights Reserved.

  This document and translations of it may be copied and furnished to
  others, and derivative works that comment on or otherwise explain it
  or assist in its implementation may be prepared, copied, published
  and distributed, in whole or in part, without restriction of any
  kind, provided that the above copyright notice and this paragraph are
  included on all such copies and derivative works.  However, this
  document itself may not be modified in any way, such as by removing
  the copyright notice or references to the Internet Society or other
  Internet organizations, except as needed for the purpose of
  developing Internet standards in which case the procedures for
  copyrights defined in the Internet Standards process must be
  followed, or as required to translate it into languages other than
  English.

  The limited permissions granted above are perpetual and will not be
  revoked by the Internet Society or its successors or assigns.

  This document and the information contained herein is provided on an
  "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
  TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
  BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
  HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
  MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

  Funding for the RFC Editor function is currently provided by the
  Internet Society.



















Li                          Standards Track                    [Page 23]