#[1]eklitzke.org RSS Feed [2]eklitzke.org Atom Feed
Unexpected Places You Can And Can’t Use Null Bytes
Feb 1, 2017
The traditional way of representing strings in C is using
[3]null-terminated character arrays. Common C library methods like
strcpy, printf, etc. detect how long strings are by sequentially
scanning memory until a null byte is found. This complicates situations
where the string data itself should contain literal null bytes.
Generally this doesn’t mean that it’s impossible to use strings with
embedded null characters, just that you have to be more careful when
doing so. Typically this is done by using methods that explicitly
specify the length of strings.
For instance, one may run into problems using [4]printf(3) to write a
string containing null bytes to stdout; but the limitation can be
worked around using [5]fwrite(3) which accepts a parameter describing
how large the data buffer is:
// INCORRECT: won't work as expected if str contains null bytes
printf("%s", str);
// OK: no problems with embedded null bytes here
fwrite(str, sizeof char, nbytes, stdout);
However, there are a few places where you don’t have this option, and
embedded null bytes just won’t work. In these situations you really,
truly can’t use strings that contain null bytes. In this article I’m
going to give a few examples of places where you cannot use null bytes;
and one surprising place where you can.
Filesystem Paths
On Unix, a path is defined to be a null-terminated C string. This means
you can’t have a file whose name is foo\0bar or any other string
containing null bytes.
To open a file on Unix you use the open(2) syscall. That method has the
following signature:
int open(const char *pathname, int flags, mode_t mode);
As you can see, there’s no parameter describing the length of the
path—the kernel treats the path parameter as a regular null-terminated
C string. If you were to supply foo\0bar as the path, there’s no way
the kernel would be able to disambiguate this from the string foo. You
can confirm this by looking at [6]fs/open.c in the Linux kernel, which
is the file that defines open(2) and most of the other file-oriented
system calls. Look for the line that starts with SYSCALL_DEFINE3(open,
and you’ll see there’s no trickery involved here. Again, there are
quite a few other system calls operating on file names defined in this
file, and all of them define paths using const char * parameters.
While we’re on the topic, it’s worth noting that the only other
restriction on filenames is that that they cannot contain a /, which is
the character used to denote directories. Filenames can contain
arbitrary other binary data, including spaces and newlines, and there’s
no defined character encoding.
Command Arguments
C programs define an entry point called main() with the following
function prototype:
int main(int argc, char **argv);
The parameter argc contains the number of arguments, and argv is an
array of null-terminated strings representing the command line
arguments for the function (with a final null element). Suppose you
were to try to encode a null byte in one of the argv parameters. The
issue is that there is no parameter specifying the lengths of the
strings in argv. Therefore there’s no way for the invoked program to
know if the argv parameters have embedded null bytes or not.
You might wonder if this is just a limitation of the C API. For
instance, what if you write a program in assembly? Is there another way
to access the argument parameters and get their size?
Actually, the answer is no. Here’s one way to think about it. The
prototype for execve(2) is like this:
int execve(const char *filename, char *const argv[], char *const envp[]);
The parameters to execve need to be passed through to the new process.
To do this, the kernel must copy this data into the memory of the new
process. Since the kernel is taking regular char * types (or in this
case, arrays of them), it has to assume they’re null-delimited when
copying them.
If you have a program that needs to be able to work with embedded null
bytes for parameters, you should have a way to specify such parameters
either on stdin, or via a file; or preferably, both.
Environment Variables
If you were paying close attention above, you’ll recall that execve(2)
actually takes three parameters. The third parameter is a list of
environment variables for the new process. In fact, most C compilers on
Unix systems will allow you to define main() with the prototype:
int main(int argc, char **argv, char **envp);
Under the hood, library calls like [7]getenv(3) and [8]setenv(3) are
implemented by accessing this environment array (or a copy of it).
You can’t have null bytes in environment variables (or their values)
for exactly the same reason that you can’t have them in argv
parameters.
Bonus: “Abstract” Unix Domain Sockets
In this “bonus” section I’m going to describe an unexpected place where
you can use embedded null bytes. Linux implements an esoteric,
non-standard extension for AF_UNIX sockets that allows you to use null
bytes in a surprising way. This feature is documented in the Linux man
page for [9]unix(7).
Here’s how it works. The system calls bind(2) and connect(2) accept a
sockaddr struct describing the address to connect to or bind on, as
well as another parameter called addrlen that describes the size of the
sockaddr struct. The value for the addrlen parameter should literally
be the sizeof of the addr struct:
int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
The reason this unusual addrlen parameter is required is because the
socket address needs to be polymorphic for different types of socket
structures. In C there are multiple socket families that all have
different addressing schemes. For instance, the address of an IPv4
sockets uses a 32-bit IPv4 address, the address of a IPv6 sockets uses
a 128-bit IPv6 address, and the address of a Unix socket normally uses
a filesystem path. These are all represented with different underlying
struct types. For instance, an IPv4 socket uses a struct sockaddr_in,
an IPv6 socket uses a struct sockaddr_in6, and a Unix socket uses a
struct sockaddr_un.
Since C doesn’t have real polymorphism, what you do is declare a
concrete type like sockaddr_in or sockaddr_un and then supply bind(2)
and connect(2) with a pointer to that struct, cast as a sockaddr *. The
true length of the underlying socket address is given via addrlen. In a
more modern language you’d implement this polymorphism using
inheritance or some type of abstract interface, but C doesn’t have
these capabilities. The addrlen parameter can be thought of as a clever
hack to work around this limitation of the C language.
For Unix sockets on Linux, the sockaddr_un type is defined like this:
struct sockaddr_un {
sa_family_t sun_family; /* AF_UNIX */
char sun_path[108]; /* pathname */
};
Normally you’d put a regular null-terminated filesystem path in the
sun_path field.
The “abstract” socket feature on Linux instead works like this: you set
the first byte in sun_path to be \0, and then put up to 107 additional
bytes after it. Then the addrlen parameter to bind(2) or connect(2) is
set to be sizeof(sa_family_t), which is two, plus the number of bytes
you put into sun_path, including the initial null byte.
The kernel looks at the first two bytes in the addr pointer which
always holds a sa_family_t representing the socket family type. If it
sees AF_UNIX, it then computes the size of the value in sun_path using
addrlen - 2. In this way the kernel can explicitly tell how large the
value stored in sun_path is, which is why using an initial null byte is
possible. If the first byte in sun_path is zero then the kernel
considers the socket name to be “abstract”. Such a socket will exist in
memory in the kernel, but does not correspond to a filesystem path.
Abstract sockets have a few interesting uses. One interesting thing
about them is that they’re reference counted by the kernel. For regular
Unix sockets defined in the filesystem, you may need to take care to
remove stale sockets from the filesystem after use. Abstract sockets
don’t have this problem: once an abstract socket is no longer in use by
any process, the kernel automatically cleans it up.
__________________________________________________________________
* [10]Home
* [11]Email
* [12]PGP
* [13]RSS
* [14]GitHub
References
1.
https://eklitzke.org/index.rss
2.
https://eklitzke.org/atom.xml?type=blog
3.
https://en.wikipedia.org/wiki/Null-terminated_string
4.
http://man7.org/linux/man-pages/man3/printf.3.html
5.
http://man7.org/linux/man-pages/man3/fwrite.3.html
6.
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/open.c
7.
http://man7.org/linux/man-pages/man3/getenv.3.html
8.
http://man7.org/linux/man-pages/man3/setenv.3.html
9.
http://man7.org/linux/man-pages/man7/unix.7.html
10.
https://eklitzke.org/
11. mailto:
[email protected]
12.
https://eklitzke.org/0x157EFCACBC648422.asc
13.
https://eklitzke.org/index.rss
14.
https://github.com/eklitzke