#[1]eklitzke.org RSS Feed [2]eklitzke.org Atom Feed

             Unexpected Places You Can And Can’t Use Null Bytes

  Feb 1, 2017

  The traditional way of representing strings in C is using
  [3]null-terminated character arrays. Common C library methods like
  strcpy, printf, etc. detect how long strings are by sequentially
  scanning memory until a null byte is found. This complicates situations
  where the string data itself should contain literal null bytes.
  Generally this doesn’t mean that it’s impossible to use strings with
  embedded null characters, just that you have to be more careful when
  doing so. Typically this is done by using methods that explicitly
  specify the length of strings.

  For instance, one may run into problems using [4]printf(3) to write a
  string containing null bytes to stdout; but the limitation can be
  worked around using [5]fwrite(3) which accepts a parameter describing
  how large the data buffer is:
// INCORRECT: won't work as expected if str contains null bytes
printf("%s", str);

// OK: no problems with embedded null bytes here
fwrite(str, sizeof char, nbytes, stdout);

  However, there are a few places where you don’t have this option, and
  embedded null bytes just won’t work. In these situations you really,
  truly can’t use strings that contain null bytes. In this article I’m
  going to give a few examples of places where you cannot use null bytes;
  and one surprising place where you can.

Filesystem Paths

  On Unix, a path is defined to be a null-terminated C string. This means
  you can’t have a file whose name is foo\0bar or any other string
  containing null bytes.

  To open a file on Unix you use the open(2) syscall. That method has the
  following signature:
int open(const char *pathname, int flags, mode_t mode);

  As you can see, there’s no parameter describing the length of the
  path—the kernel treats the path parameter as a regular null-terminated
  C string. If you were to supply foo\0bar as the path, there’s no way
  the kernel would be able to disambiguate this from the string foo. You
  can confirm this by looking at [6]fs/open.c in the Linux kernel, which
  is the file that defines open(2) and most of the other file-oriented
  system calls. Look for the line that starts with SYSCALL_DEFINE3(open,
  and you’ll see there’s no trickery involved here. Again, there are
  quite a few other system calls operating on file names defined in this
  file, and all of them define paths using const char * parameters.

  While we’re on the topic, it’s worth noting that the only other
  restriction on filenames is that that they cannot contain a /, which is
  the character used to denote directories. Filenames can contain
  arbitrary other binary data, including spaces and newlines, and there’s
  no defined character encoding.

Command Arguments

  C programs define an entry point called main() with the following
  function prototype:
int main(int argc, char **argv);

  The parameter argc contains the number of arguments, and argv is an
  array of null-terminated strings representing the command line
  arguments for the function (with a final null element). Suppose you
  were to try to encode a null byte in one of the argv parameters. The
  issue is that there is no parameter specifying the lengths of the
  strings in argv. Therefore there’s no way for the invoked program to
  know if the argv parameters have embedded null bytes or not.

  You might wonder if this is just a limitation of the C API. For
  instance, what if you write a program in assembly? Is there another way
  to access the argument parameters and get their size?

  Actually, the answer is no. Here’s one way to think about it. The
  prototype for execve(2) is like this:
int execve(const char *filename, char *const argv[], char *const envp[]);

  The parameters to execve need to be passed through to the new process.
  To do this, the kernel must copy this data into the memory of the new
  process. Since the kernel is taking regular char * types (or in this
  case, arrays of them), it has to assume they’re null-delimited when
  copying them.

  If you have a program that needs to be able to work with embedded null
  bytes for parameters, you should have a way to specify such parameters
  either on stdin, or via a file; or preferably, both.

Environment Variables

  If you were paying close attention above, you’ll recall that execve(2)
  actually takes three parameters. The third parameter is a list of
  environment variables for the new process. In fact, most C compilers on
  Unix systems will allow you to define main() with the prototype:
int main(int argc, char **argv, char **envp);

  Under the hood, library calls like [7]getenv(3) and [8]setenv(3) are
  implemented by accessing this environment array (or a copy of it).

  You can’t have null bytes in environment variables (or their values)
  for exactly the same reason that you can’t have them in argv
  parameters.

Bonus: “Abstract” Unix Domain Sockets

  In this “bonus” section I’m going to describe an unexpected place where
  you can use embedded null bytes. Linux implements an esoteric,
  non-standard extension for AF_UNIX sockets that allows you to use null
  bytes in a surprising way. This feature is documented in the Linux man
  page for [9]unix(7).

  Here’s how it works. The system calls bind(2) and connect(2) accept a
  sockaddr struct describing the address to connect to or bind on, as
  well as another parameter called addrlen that describes the size of the
  sockaddr struct. The value for the addrlen parameter should literally
  be the sizeof of the addr struct:
int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);

  The reason this unusual addrlen parameter is required is because the
  socket address needs to be polymorphic for different types of socket
  structures. In C there are multiple socket families that all have
  different addressing schemes. For instance, the address of an IPv4
  sockets uses a 32-bit IPv4 address, the address of a IPv6 sockets uses
  a 128-bit IPv6 address, and the address of a Unix socket normally uses
  a filesystem path. These are all represented with different underlying
  struct types. For instance, an IPv4 socket uses a struct sockaddr_in,
  an IPv6 socket uses a struct sockaddr_in6, and a Unix socket uses a
  struct sockaddr_un.

  Since C doesn’t have real polymorphism, what you do is declare a
  concrete type like sockaddr_in or sockaddr_un and then supply bind(2)
  and connect(2) with a pointer to that struct, cast as a sockaddr *. The
  true length of the underlying socket address is given via addrlen. In a
  more modern language you’d implement this polymorphism using
  inheritance or some type of abstract interface, but C doesn’t have
  these capabilities. The addrlen parameter can be thought of as a clever
  hack to work around this limitation of the C language.

  For Unix sockets on Linux, the sockaddr_un type is defined like this:
struct sockaddr_un {
   sa_family_t sun_family;               /* AF_UNIX */
   char        sun_path[108];            /* pathname */
};

  Normally you’d put a regular null-terminated filesystem path in the
  sun_path field.

  The “abstract” socket feature on Linux instead works like this: you set
  the first byte in sun_path to be \0, and then put up to 107 additional
  bytes after it. Then the addrlen parameter to bind(2) or connect(2) is
  set to be sizeof(sa_family_t), which is two, plus the number of bytes
  you put into sun_path, including the initial null byte.

  The kernel looks at the first two bytes in the addr pointer which
  always holds a sa_family_t representing the socket family type. If it
  sees AF_UNIX, it then computes the size of the value in sun_path using
  addrlen - 2. In this way the kernel can explicitly tell how large the
  value stored in sun_path is, which is why using an initial null byte is
  possible. If the first byte in sun_path is zero then the kernel
  considers the socket name to be “abstract”. Such a socket will exist in
  memory in the kernel, but does not correspond to a filesystem path.

  Abstract sockets have a few interesting uses. One interesting thing
  about them is that they’re reference counted by the kernel. For regular
  Unix sockets defined in the filesystem, you may need to take care to
  remove stale sockets from the filesystem after use. Abstract sockets
  don’t have this problem: once an abstract socket is no longer in use by
  any process, the kernel automatically cleans it up.
    __________________________________________________________________

    * [10]Home
    * [11]Email
    * [12]PGP
    * [13]RSS
    * [14]GitHub

References

  1. https://eklitzke.org/index.rss
  2. https://eklitzke.org/atom.xml?type=blog
  3. https://en.wikipedia.org/wiki/Null-terminated_string
  4. http://man7.org/linux/man-pages/man3/printf.3.html
  5. http://man7.org/linux/man-pages/man3/fwrite.3.html
  6. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/open.c
  7. http://man7.org/linux/man-pages/man3/getenv.3.html
  8. http://man7.org/linux/man-pages/man3/setenv.3.html
  9. http://man7.org/linux/man-pages/man7/unix.7.html
 10. https://eklitzke.org/
 11. mailto:[email protected]
 12. https://eklitzke.org/0x157EFCACBC648422.asc
 13. https://eklitzke.org/index.rss
 14. https://github.com/eklitzke