Network Working Group                                        T. Turletti
Request for Comments: 2032                                           MIT
Category: Standards Track                                     C. Huitema
                                                               Bellcore
                                                           October 1996


              RTP Payload Format for H.261 Video Streams

Status of this Memo

  This document specifies an Internet standards track protocol for the
  Internet community, and requests discussion and suggestions for
  improvements.  Please refer to the current edition of the "Internet
  Official Protocol Standards" (STD 1) for the standardization state
  and status of this protocol.  Distribution of this memo is unlimited.

Table of Contents

  1. Abstract .............................................    1
  2. Purpose of this document .............................    2
  3. Structure of the packet stream .......................    2
  3.1 Overview of the ITU-T recommendation H.261 ..........    2
  3.2 Considerations for packetization ....................    3
  4. Specification of the packetization scheme ............    4
  4.1 Usage of RTP ........................................    4
  4.2 Recommendations for operation with hardware codecs ..    6
  5. Packet loss issues ...................................    7
  5.1 Use of optional H.261-specific control packets ......    8
  5.2 H.261 control packets definition ....................    9
  5.2.1 Full INTRA-frame Request (FIR) packet .............    9
  5.2.2 Negative ACKnowledgements (NACK) packet ...........    9
  6. Security Considerations ..............................   10
   Authors' Addresses .....................................   10
   Acknowledgements .......................................   10
   References .............................................   11

1.  Abstract

  This memo describes a scheme to packetize an H.261 video stream for
  transport using the Real-time Transport Protocol, RTP, with any of
  the underlying protocols that carry RTP.

  This specification is a product of the Audio/Video Transport working
  group within the Internet Engineering Task Force.  Comments are
  solicited and should be addressed to the working group's mailing list
  at [email protected] and/or the authors.




Turletti & Huitema          Standards Track                     [Page 1]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


2.  Purpose of this document

  The ITU-T recommendation H.261 [6] specifies the encodings used by
  ITU-T compliant video-conference codecs. Although these encodings
  were originally specified for fixed data rate ISDN circuits,
  experiments [3],[8] have shown that they can also be used over
  packet-switched networks such as the Internet.

  The purpose of this memo is to specify the RTP payload format for
  encapsulating H.261 video streams in RTP [1].

3.  Structure of the packet stream

3.1.  Overview of the ITU-T recommendation H.261

  The H.261 coding is organized as a hierarchy of groupings.  The video
  stream is composed of a sequence of images, or frames, which are
  themselves organized as a set of Groups of Blocks (GOB). Note that
  H.261 "pictures" are referred as "frames" in this document.  Each GOB
  holds a set of 3 lines of 11 macro blocks (MB). Each MB carries
  information on a group of 16x16 pixels: luminance information is
  specified for 4 blocks of 8x8 pixels, while chrominance information
  is given by two "red" and "blue" color difference components at a
  resolution of only 8x8 pixels.  These components and the codes
  representing their sampled values are as defined in the ITU-R
  Recommendation 601 [7].

  This grouping is used to specify information at each level of the
  hierarchy:

  -    At the frame level, one specifies information such as the
       delay from the previous frame, the image format, and
       various indicators.

  -    At the GOB level, one specifies the GOB number and the
       default quantifier that will be used for the MBs.

  -    At the MB level, one specifies which blocks are present
       and which did not change, and optionally a quantifier and
       motion vectors.

  Blocks which have changed are encoded by computing the discrete
  cosine transform (DCT) of their coefficients, which are then
  quantized and Huffman encoded (Variable Length Codes).

  The H.261 Huffman encoding includes a special "GOB start" pattern,
  composed of 15 zeroes followed by a single 1, that cannot be imitated
  by any other code words. This pattern is included at the beginning of



Turletti & Huitema          Standards Track                     [Page 2]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


  each GOB header (and also at the beginning of each frame header) to
  mark the separation between two GOBs, and is in fact used as an
  indicator that the current GOB is terminated. The encoding also
  includes a stuffing pattern, composed of seven zeroes followed by
  four ones; that stuffing pattern can only be entered between the
  encoding of MBs, or just before the GOB separator.

3.2.  Considerations for packetization

  H.261 codecs designed for operation over ISDN circuits produce a bit
  stream composed of several levels of encoding specified by H.261 and
  companion recommendations.  The bits resulting from the Huffman
  encoding are arranged in 512-bit frames, containing 2 bits of
  synchronization, 492 bits of data and 18 bits of error correcting
  code.  The 512-bit frames are then interlaced with an audio stream
  and transmitted over px64 kbps circuits according to specification
  H.221 [5].

  When transmitting over the Internet, we will directly consider the
  output of the Huffman encoding. All the bits produced by the Huffman
  encoding stage will be included in the packet. We will not carry the
  512-bit frames, as protection against bit errors can be obtained by
  other means. Similarly, we will not attempt to multiplex audio and
  video signals in the same packets, as UDP and RTP provide a much more
  efficient way to achieve multiplexing.

  Directly transmitting the result of the Huffman encoding over an
  unreliable stream of UDP datagrams would, however, have poor error
  resistance characteristics. The result of the hierachical structure
  of H.261 bit stream is that one needs to receive the information
  present in the frame header to decode the GOBs, as well as the
  information present in the GOB header to decode the MBs.  Without
  precautions, this would mean that one has to receive all the packets
  that carry an image in order to properly decode its components.

  If each image could be carried in a single packet, this requirement
  would not create a problem. However, a video image or even one GOB by
  itself can sometimes be too large to fit in a single packet.
  Therefore, the MB is taken as the unit of fragmentation.  Packets
  must start and end on a MB boundary, i.e. a MB cannot be split across
  multiple packets.  Multiple MBs may be carried in a single packet
  when they will fit within the maximal packet size allowed. This
  practice is recommended to reduce the packet send rate and packet
  overhead.

  To allow each packet to be processed independently for efficient
  resynchronization in the presence of packet losses, some state
  information from the frame header and GOB header is carried with each



Turletti & Huitema          Standards Track                     [Page 3]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


  packet to allow the MBs in that packet to be decoded.  This state
  information includes the GOB number in effect at the start of the
  packet, the macroblock address predictor (i.e. the last MBA encoded
  in the previous packet), the quantizer value in effect prior to the
  start of this packet (GQUANT, MQUANT or zero in case of a beginning
  of GOB) and the reference motion vector data (MVD) for computing the
  true MVDs contained within this packet. The bit stream cannot be
  fragmented between a GOB header and MB 1 of that GOB.

  Moreover, since the compressed MB may not fill an integer number of
  octets, the data header contains two three-bit integers, SBIT and
  EBIT, to indicate the number of unused bits in the first and last
  octets of the H.261 data, respectively.

4.  Specification of the packetization scheme

4.1.  Usage of RTP

  The H.261 information is carried as payload data within the RTP
  protocol. The following fields of the RTP header are specified:

  -    The payload type should specify H.261 payload format (see
       the companion RTP profile document RFC 1890).

  -    The RTP timestamp encodes the sampling instant of the
       first video image contained in the RTP data packet. If a
       video image occupies more than one packet, the timestamp
       will be the same on all of those packets. Packets from
       different video images must have different timestamps so
       that frames may be distinguished by the timestamp. For
       H.261 video streams, the RTP timestamp is based on a
       90kHz clock. This clock rate is a multiple of the natural
       H.261 frame rate (i.e. 30000/1001 or approx. 29.97 Hz).
       That way, for each frame time, the clock is just
       incremented by the multiple and this removes inaccuracy
       in calculating the timestamp. Furthermore, the initial
       value of the timestamp is random (unpredictable) to make
       known-plaintext attacks on encryption more difficult, see
       RTP [1]. Note that if multiple frames are encoded in a
       packet (e.g. when there are very little changes between
       two images), it is necessary to calculate display times
       for the frames after the first using the timing
       information in the H.261 frame header. This is required
       because the RTP timestamp only gives the display time of
       the first frame in the packet.

  -    The marker bit of the RTP header is set to one in the
       last packet of a video frame, and otherwise, must be



Turletti & Huitema          Standards Track                     [Page 4]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


       zero. Thus, it is not necessary to wait for a following
       packet (which contains the start code that terminates the
       current frame) to detect that a new frame should be
       displayed.

  The H.261 data will follow the RTP header, as in:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   .                                                               .
   .                          RTP header                           .
   .                                                               .
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          H.261  header                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          H.261 stream ...                     .
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

  The H.261 header is defined as following:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |SBIT |EBIT |I|V| GOBN  |   MBAP  |  QUANT  |  HMVD   |  VMVD   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

  The fields in the H.261 header have the following meanings:

  Start bit position (SBIT): 3 bits
    Number of most significant bits that should be ignored
    in the first data octet.

  End bit position (EBIT): 3 bits
    Number of least significant bits that should be ignored
    in the last data octet.

  INTRA-frame encoded data (I): 1 bit
    Set to 1 if this stream contains only INTRA-frame coded
    blocks. Set to 0 if this stream may or may not contain
    INTRA-frame coded blocks. The sense of this bit may not
    change during the course of the RTP session.

  Motion Vector flag (V): 1 bit
    Set to 0 if motion vectors are not used in this stream.
    Set to 1 if motion vectors may or may not be used in
    this stream. The sense of this bit may not change during
    the course of the session.



Turletti & Huitema          Standards Track                     [Page 5]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


  GOB number (GOBN): 4 bits
    Encodes the GOB number in effect at the start of the
    packet. Set to 0 if the packet begins with a GOB header.

  Macroblock address predictor (MBAP): 5 bits
    Encodes the macroblock address predictor (i.e. the last
    MBA encoded in the previous packet). This predictor ranges
    from 0-32 (to predict the valid MBAs 1-33), but because
    the bit stream cannot be fragmented between a GOB header
    and MB 1, the predictor at the start of the packet can
    never be 0. Therefore, the range is 1-32, which is biased
    by -1 to fit in 5 bits. For example, if MBAP is 0, the
    value of the MBA predictor is 1. Set to 0 if the packet
    begins with a GOB header.

  Quantizer (QUANT): 5 bits
    Quantizer value (MQUANT or GQUANT) in effect prior to the
    start of this packet. Set to 0 if the packet begins with
    a GOB header.

  Horizontal motion vector data (HMVD): 5 bits
    Reference horizontal motion vector data (MVD). Set to 0
    if V flag is 0 or if the packet begins with a GOB header,
    or when the MTYPE of the last MB encoded in the previous
    packet was not MC. HMVD is encoded as a 2's complement
    number, and `10000' corresponding to the value -16 is
    forbidden (motion vector fields range from +/-15).

  Vertical motion vector data (VMVD): 5 bits
    Reference vertical motion vector data (MVD). Set to 0 if
    V flag is 0 or if the packet begins with a GOB header, or
    when the MTYPE of the last MB encoded in the previous
    packet was not MC. VMVD is encoded as a 2's complement
    number, and `10000' corresponding to the value -16 is
    forbidden (motion vector fields range from +/-15).

  Note that the I and V flags are hint flags, i.e. they can be inferred
  from the bit stream. They are included to allow decoders to make
  optimizations that would not be possible if these hints were not
  provided before bit stream was decoded.  Therefore, these bits cannot
  change for the duration of the stream. A conformant implementation
  can always set V=1 and I=0.

4.2.  Recommendations for operation with hardware codecs

  Packetizers for hardware codecs can trivially figure out GOB
  boundaries using the GOB-start pattern included in the H.261 data.
  (Note that software encoders already know the boundaries.) The



Turletti & Huitema          Standards Track                     [Page 6]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


  cheapest packetization implementation is to packetize at the GOB
  level all the GOBs that fit in a packet.  But when a GOB is too
  large, the packetizer has to parse it to do MB fragmentation. (Note
  that only the Huffman encoding must be parsed and that it is not
  necessary to fully decompress the stream, so this requires relatively
  little processing; example implementations can be found in some
  public H.261 codecs such as IVS [4] and VIC [9].) It is recommended
  that MB level fragmentation be used when feasible in order to obtain
  more efficient packetization. Using this fragmentation scheme reduces
  the output packet rate and therefore reduces the overhead.

  At the receiver, the data stream can be depacketized and directed to
  a hardware codec's input.  If the hardware decoder operates at a
  fixed bit rate, synchronization may be maintained by inserting the
  stuffing pattern between MBs (i.e., between packets) when the packet
  arrival rate is slower than the bit rate.

5.  Packet loss issues

  On the Internet, most packet losses are due to network congestion
  rather than transmission errors. Using UDP, no mechanism is available
  at the sender to know if a packet has been successfully received. It
  is up to the application, i.e.  coder and decoder, to handle the
  packet loss. Each RTP packet includes a a sequence number field which
  can be used to detect packet loss.

  H.261 uses the temporal redundancy of video to perform compression.
  This differential coding (or INTER-frame coding) is sensitive to
  packet loss. After a packet loss, parts of the image may remain
  corrupt until all corresponding MBs have been encoded in INTRA-frame
  mode (i.e. encoded independently of past frames). There are several
  ways to mitigate packet loss:

  (1)  One way is to use only INTRA-frame encoding and MB level
       conditional replenishment. That is, only MBs that change
       (beyond some threshold) are transmitted.

  (2)  Another way is to adjust the INTRA-frame encoding
       refreshment rate according to the packet loss observed by
       the receivers. The H.261 recommendation specifies that a
       MB is INTRA-frame encoded at least every 132 times it is
       transmitted. However, the INTRA-frame refreshment rate
       can be raised in order to speed the recovery when the
       measured loss rate is significant.

  (3)  The fastest way to repair a corrupted image is to request
       an INTRA-frame coded image refreshment after a packet
       loss is detected. One means to accomplish this is for the



Turletti & Huitema          Standards Track                     [Page 7]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


       decoder to send to the coder a list of packets lost. The
       coder can decide to encode every MB of every GOB of the
       following video frame in INTRA-frame mode (i.e. Full
       INTRA-frame encoded), or if the coder can deduce from the
       packet sequence numbers which MBs were affected by the
       loss, it can save bandwidth by sending only those MBs in
       INTRA-frame mode. This mode is particularly efficient in
       point-to-point connection or when the number of decoders
       is low.  The next section specifies how the refresh
       function may be implemented.

  Note that the method (1) is currently implemented in the VIC
  videoconferencing software [9]. Methods (2) and (3) are currently
  implemented in the IVS videoconferencing software [4].

5.1.  Use of optional H.261-specific control packets

  This specification defines two H.261-specific RTCP control packets,
  "Full INTRA-frame Request" and "Negative Acknowledgement", described
  in the next section.  Their purpose is to speed up refreshment of the
  video in those situations where their use is feasible.  Support of
  these H.261-specific control packets by the H.261 sender is optional;
  in particular, early experiments have shown that the usage of this
  feature could have very negative effects when the number of sites is
  very large. Thus, these control packets should be used with caution.

  The H.261-specific control packets differ from normal RTCP packets in
  that they are not transmitted to the normal RTCP destination
  transport address for the RTP session (which is often a multicast
  address).  Instead, these control packets are sent directly via
  unicast from the decoder to the coder.  The destination port for
  these control packets is the same port that the coder uses as a
  source port for transmitting RTP (data) packets.  Therefore, these
  packets may be considered "reverse" control packets.

  As a consequence, these control packets may only be used when no RTP
  mixers or translators intervene in the path from the coder to the
  decoder.  If such intermediate systems do intervene, the address of
  the coder would no longer be present as the network-level source
  address in packets received by the decoder, and in fact, it might not
  be possible for the decoder to send packets directly to the coder.

  Some reliable multicast protocols use similar NACK control packets
  transmitted over the normal multicast distribution channel, but they
  typically use random delays to prevent a NACK implosion problem [2].
  The goal of such protocols is to provide reliable multicast packet
  delivery at the expense of delay, which is appropriate for
  applications such as a shared whiteboard.



Turletti & Huitema          Standards Track                     [Page 8]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


  On the other hand, interactive video transmission is more sensitive
  to delay and does not require full reliability.  For video
  applications it is more effective to send the NACK control packets as
  soon as possible, i.e. as soon as a loss is detected, without adding
  any random delays. In this case, multicasting the NACK control
  packets would generate useless traffic between receivers since only
  the coder will use them.  But this method is only effective when the
  number of receivers is small. e.g. in IVS [4] the H.261 specific
  control packets are used only in point-to-point connections or in
  point-to-multipoint connections when there are less than 10
  participants in the conference.

5.2.  H.261 control packets definition

5.2.1.  Full INTRA-frame Request (FIR) packet

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |V=2|P|   MBZ   |  PT=RTCP_FIR  |           length              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                              SSRC                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

  This packet indicates that a receiver requires a full encoded image
  in order to either start decoding with an entire image or to refresh
  its image and speed the recovery after a burst of lost packets. The
  receiver requests the source to force the next image in full "INTRA-
  frame" coding mode, i.e. without using differential coding. The
  various fields are defined in the RTP specification [1]. SSRC is the
  synchronization source identifier for the sender of this packet. The
  value of the packet type (PT) identifier is the constant RTCP_FIR
  (192).

5.2.2.  Negative ACKnowledgements (NACK) packet

  The format of the NACK packet is as follow:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |V=2|P|   MBZ   | PT=RTCP_NACK  |           length              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                              SSRC                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |              FSN              |              BLP              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+




Turletti & Huitema          Standards Track                     [Page 9]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


  The various fields T, P, PT, length and SSRC are defined in the RTP
  specification [1]. The value of the packet type (PT) identifier is
  the constant RTCP_NACK (193). SSRC is the synchronization source
  identifier for the sender of this packet.

  The two remaining fields have the following meanings:

  First Sequence Number (FSN): 16 bits
    Identifies the first sequence number lost.

  Bitmask of following lost packets (BLP): 16 bits
    A bit is set to 1 if the corresponding packet has been lost,
    and set to 0 otherwise. BLP is set to 0 only if no packet
    other than that being NACKed (using the FSN field) has been
    lost. BLP is set to 0x00001 if the packet corresponding to
    the FSN and the following packet have been lost, etc.

6.  Security Considerations

  Security issues are not discussed in this memo.

Authors' Addresses

  Thierry Turletti
  INRIA - RODEO Project
  2004 route des Lucioles
  BP 93, 06902 Sophia Antipolis
  FRANCE

  EMail: [email protected]


  Christian Huitema
  MCC 1J236B Bellcore
  445 South Street
  Morristown, NJ 07960-6438

  EMail: [email protected]

Acknowledgements

  This memo is based on discussion within the AVT working group chaired
  by Stephen Casner. Steve McCanne, Stephen Casner, Ronan Flood, Mark
  Handley, Van Jacobson, Henning G.  Schulzrinne and John Wroclawski
  provided valuable comments.  Stephen Casner and Steve McCanne also
  helped greatly with getting this document into readable form.





Turletti & Huitema          Standards Track                    [Page 10]

RFC 2032           RTP Payload Format for H.261 Video       October 1996


References

  [1]  Schulzrinne, H., Casner, S., Frederick, R., and
       V. Jacobson, "RTP: A Transport Protocol for Real-Time
       Applications", RFC 1889, January 1996.

  [2]  Sridhar Pingali, Don Towsley and James F. Kurose, A
       comparison of sender-initiated and receiver-initiated
       reliable multicast protocols, IEEE GLOBECOM '94.

  [3]  Thierry Turletti, H.261 software codec for
       videoconferencing over the Internet INRIA Research Report
       no 1834, January 1993.

  [4]  Thierry Turletti, INRIA Videoconferencing tool (IVS),
       available by anonymous ftp from zenon.inria.fr in the
       "rodeo/ivs/last_version" directory. See also URL
       <http://www.inria.fr/rodeo/ivs.html>.

  [5]  Frame structure for Audiovisual Services for a 64 to 1920
       kbps Channel in Audiovisual Services ITU-T (International
       Telecommunication Union - Telecommunication
       Standardisation Sector) Recommendation H.221, 1990.

  [6]  Video codec for audiovisual services at p x 64 kbit/s
       ITU-T (International Telecommunication Union -
       Telecommunication Standardisation Sector) Recommendation
       H.261, 1993.

  [7]  Digital Methods of Transmitting Television Information
       ITU-R (International Telecommunication Union -
       Radiocommunication Standardisation Sector) Recommendation
       601, 1986.

  [8]  M.A Sasse, U. Bilting, C-D Schulz, T. Turletti, Remote
       Seminars through MultiMedia Conferencing: Experiences
       from the MICE project, Proc. INET'94/JENC5, Prague, June
       1994, pp. 251/1-251/8.

  [9]  Steve MacCanne, Van Jacobson, VIC Videoconferencing tool,
       available by anonymous ftp from ee.lbl.gov in the
       "conferencing/vic" directory.









Turletti & Huitema          Standards Track                    [Page 11]