Volume 0x0d, Issue 0x42, Phile #0x0F of 0x11

==Phrack Inc.==

Volume 0x0d, Issue 0x42, Phile #0x0F of 0x11

|=-----------------------------------------------------------------------=|
|=--------------=[ Linux Kernel Heap Tampering Detection ]=--------------=|
|=-----------------------------------------------------------------------=|
|=------------------=[ Larry H. <[email protected]> ]=----------------=|
|=-----------------------------------------------------------------------=|

------[ Index

1 - History and background of the Linux kernel heap allocators

1.1 - SLAB
1.2 - SLOB
1.3 - SLUB
1.4 - SLQB
1.5 - The future

2 - Introduction: What is KERNHEAP?

3 - Integrity assurance for kernel heap allocators

3.1 - Meta-data protection against full and partial overwrites
3.2 - Detection of arbitrary free pointers and freelist corruption
3.3 - Overview of NetBSD and OpenBSD kernel heap safety checks
3.4 - Microsoft Windows 7 kernel pool allocator safe unlinking

4 - Sanitizing memory of the look-aside caches

5 - Deterrence of IPC based kmalloc() overflow exploitation

6 - Prevention of copy_to_user() and copy_from_user() abuse

7 - Prevention of vsyscall overwrites on x86_64

8 - Developing the right regression testsuite for KERNHEAP

9 - The Inevitability of Failure

9.1 - Subverting SELinux and the audit subsystem
9.2 - Subverting AppArmor

10 - References

11 - Thanks and final statements

12 - Source code

------[ 1. History and background of the Linux kernel heap allocators

Before discussing what is KERNHEAP, its internals and design, we will have
a glance at the background and history of Linux kernel heap allocators.

In 1994, Jeff Bonwick from Sun Microsystems presented the SunOS 5.4
kernel heap allocator at USENIX Summer [1]. This allocator produced higher
performance results thanks to its use of caches to hold invariable state
information about the objects, and reduced fragmentation significantly,
grouping similar objects together in caches. When memory was under stress,
the allocator could check the caches for unused objects and let the system
reclaim the memory (that is, shrinking the caches on demand).

We will refer to these units composing the caches as "slabs". A slab
comprises contiguous pages of memory. Each page in the slab holds chunks
(objects or buffers) of the same size. This minimizes internal
fragmentation, since a slab will only contain same-sized chunks, and
only the 'trailing' or free space in the page will be wasted, until it
is required for a new allocation. The following diagram shows the
layout of Bonwick's slab allocator:

+-------+
| CACHE |
+-------+ +---------+
| CACHE |----| EMPTY |
+-------+ +---------+ +------+ +------+
| PARTIAL |----| SLAB |------| PAGE | (objects)
+---------+ +------+ +------+ +-------+
| FULL | ... |-------| CHUNK |
+---------+ +-------+
| CHUNK |
+-------+
| CHUNK |
+-------+
...

These caches operated in a LIFO manner: when an allocation was requested
for a given size, the allocator would seek for the first available free
object in the appropriate slab. This saved the cost of page allocation
and creation of the object altogether.

"A slab consists of one or more pages of virtually contiguous
memory carved up into equal-size chunks, with a reference count
indicating how many of those chunks have been allocated."
Page 5, 3.2 Slabs. [1]

Each slab was managed with a kmem_slab structure, which contained its
reference count, freelist of chunks and linkage to the associated
kmem_cache. Each chunk had a header defined as the kmem_bufctl (chunks
are commonly referred to as buffers in the paper and implementation),
which contained the freelist linkage, address to the buffer and a
pointer to the slab it belongs to. The following diagram shows the
layout of a slab:

.-------------------.
| SLAB (kmem_slab) |
`-------+--+--------'
/ \
+----+---+--+-----+
| bufctl | bufctl |
+-.-'----+.-'-----+
_.-' .-'
+-.-'------.-'-----------------+
| | | ':>=jJ6XKNM|
| buffer | buffer | Unused XQNM|
| | | ':>=jJ6XKNM|
+------------------------------+
[ Page (s) ]

For chunk sizes smaller than 1/8 of a page (ex. 512 bytes for x86), the
meta-data of the slab is contained within the page, at the very end.
The rest of space is then divided in equally sized chunks. Because all
buffers have the same size, only linkage information is required,
allowing the rest of values to be computed at runtime, saving space.
The freelist pointer is stored at the end of the chunk. Bonwick
states that this due to end of data structures being less active than
the beginning, and permitting debugging to work even when an
use-after-free situation has occurred, overwriting data in the buffer,
relying on the freelist pointer being intact. In deliberate attack
scenarios this is obviously a flawed assumption. An additional word was
reserved too to hold a pointer to state information used by objects
initialized through a constructor.

For larger allocations, the meta-data resides out of the page.

The freelist management was simple: each cache maintained a circular
doubly-linked list sorted to put the empty slabs (all buffers
allocated) first, the partial slabs (free and allocated buffers) and
finally the full slabs (reference counter set to zero, all buffers
free). The cache freelist pointer points to the first non-empty slab,
and each slab then contains its own freelist. Bonwick chose this
approach to simplify the memory reclaiming process.

The process of reclaiming memory started at the original
kmem_cache_free() function, which verified the reference counter. If
its value was zero (all buffers free), it moved the full slab to the
tail of the freelist with the rest of full slabs. Section 4 explains
the intrinsic details of hardware cache side effects and optimization.
It is an interesting read due to the hardware used at the time the
paper was written. In order to optimize cache utilization and bus
balance, Bonwick devised 'slab coloring'. Slab coloring is simple: when
a slab is created, the buffer address starts at a different offset
(referred to as the color) from the slab base (since a slab is an
allocated page or pages, this is always aligned to page size).

It is interesting to note that Bonwick already studied different
approaches to detect kernel heap corruption, and implemented them in
the SunOS 5.4 kernel, possibly predating every other kernel in terms of
heap corruption detection). Furthermore, Bonwick noted the performance
impact of these features was minimal.

"Programming errors that corrupt the kernel heap - such as
modifying freed memory, freeing a buffer twice, freeing an
uninitialized pointer, or writing beyond the end of a buffer — are
often difficult to debug. Fortunately, a thoroughly instrumented
ker- nel memory allocator can detect many of these problems."
page 10, 6. Debugging features. [1]

The audit mode enabled storage of the user of every allocation (an
equivalent of the Linux feature that will be briefly described in
the allocator subsections) and provided these traces when corruption
was detected.

Invalid free pointers were detected using a hash lookup in the
kmem_cache_free() function. Once an object was freed, and after the
destructor was called, it filled the space with 0xdeadbeef. Once this
object was being allocated again, the pattern would be verified to see
that no modifications occurred (that is, detection of use-after-free
conditions, or write-after-free more specifically). Allocated objects
were filled with 0xbaddcafe, which marked it as uninitialized.

Redzone checking was also implemented to detect overwrites past the end
of an object, adding a guard value at that position. This was verified
upon free.

Finally, a simple but possibly effective approach to detect memory
leaks used the timestamps from the audit log to find allocations which
had been online for a suspiciously long time. In modern times, this
could be implemented using a kernel thread. SunOS did it from userland
via /dev/kmem, which would be unacceptable in security terms.

For more information about the concepts of slab allocation, refer to
Bonwick's paper at [1] provides an in-depth overview of the theory and
implementation.

---[ 1.1 SLAB

The SLAB allocator in Linux (mm/slab.c) was written by Mark Hemment
in 1996-1997, and further improved through the years by Manfred
Spraul and others. The design follows closely that presented by Bonwick for
his Solaris allocator. It was first integrated in the 2.2 series.
This subsection will avoid describing more theory than the strictly
necessary, but those interested on a more in-depth overview of SLAB
can refer to "Understanding the Linux Virtual Memory Manager" by
Mel Gorman, and its eighth chapter "Slab Allocator" [X].

The caches are defined as a kmem_cache structure, comprised of
(most commonly) page sized slabs, containing initialized objects.
Each cache holds its own GFP flags, the order of pages per slab
(2^n), the number of objects (chunks) per slab, coloring offsets
and range, a pointer to a constructor function, a printable name
and linkage to other caches. Optionally, if enabled, it can define
a set of fields to hold statistics an debugging related
information.

Each kmem_cache has an array of kmem_list3 structures, which contain
the information about partial, full and free slab lists:

struct kmem_list3 {
struct list_head slabs_partial;
struct list_head slabs_full;
struct list_head slabs_free;
unsigned long free_objects;
unsigned int free_limit;
unsigned int colour_next;
...
unsigned long next_reap;
int free_touched;
};

These structures are initialized with kmem_list3_init(), setting
all the reference counters to zero and preparing the list3 to be
linked to its respective cache nodelists list for the proper NUMA
node. This can be found in cpuup_prepare() and kmem_cache_init().

The "reaping" or draining of the cache free lists is done with the
drain_freelist() function, which returns the total number of slabs
released, initiated via cache_reap(). A slab is released using
slab_destroy(), and allocated with the cache_grow() function for a
given NUMA node, flags and cache.

The cache contains the doubly-linked lists for the partial, full
and free lists, and a free object count in free_objects.

A slab is defined with the following structure:

struct slab {
struct list_head list; /* linkage/pointer to freelist */
unsigned long colouroff; /* color / offset */
void *s_mem; /* start address of first object */
unsigned int inuse; /* num of objs active in slab */
kmem_bufctl_t free; /* first free chunk (or none) */
unsigned short nodeid; /* NUMA node id for nodelists */
};

The list member points to the freelist the slab belongs to:
partial, full or empty. The s_mem is used to calculate the address
to a specific object with the color offset. Free holds the list of
objects. The cache of the slab is tracked in the page structure.

The functions used to retrieve the cache a potential object belongs
to is virt_to_cache(), which itself relies on page_get_cache() on a
page structure pointer. It checks that the Slab page flag is set,
and takes the lru.next pointer of the head page (to be compatible
with compound pages, this is no different for normal pages). The
cache is set with page_set_cache(). The behavior to assign pages to
a slab and cache can be seen in slab_map_pages().

The internal function used for cache shrinking is __cache_shrink(),
called from kmem_cache_shrink() and during cache destruction. SLAB
is clearly poor at the scalability side: on NUMA systems with a
large number of nodes, substantial time will be spent on walking
the nodelists, drain each freelist, and so forth. In the process,
it is most likely that some of those nodes won't be under memory
pressure.

slab management data is stored inside the slab itself when the size
is under 1/8 of PAGE_SIZE (512 bytes for x86, same as Bonwick's
allocator). This is done by alloc_slabmgmt(), which either stores
the management structure within the slab, or allocates space for it
from the kmalloc caches (slabp_cache within the kmem_cache
structure, assigned with kmem_find_general_cachep() given the slab
size). Again, this is reflected in slab_destroy() which takes care
of freeing the off-slab management structure when applicable.

The interesting security impact of this logic in managing control
structures is that slabs with their meta-data stored off-slab, in
one of the general kmalloc caches, will be exposed to potential
abuse (ex. in a slab overflow scenario in some adjacent object, the
freelist pointer could be overwritten to leverage a
write4-primitive during unlinking). This is one of the loopholes
which KERNHEAP, as described in this paper, will close or at very
least do everything feasible to deter reliable exploitation.

Since the basic technical aspects of the SLAB allocator are now
covered, the reader can refer to mm/slab.c in any current kernel
release for further information.

---[ 1.2 SLOB

Released in November 2005, it was developed since 2003 by Matt Mackall
for use in embedded systems due to its smaller memory footprint. It
lacks the complexity of all other allocators.

The granularity of the SLOB allocator supports objects as little as 2
bytes in size, though this is subject to architecture-dependent
restrictions (alignment, etc). The author notes that this will
normally be 4 bytes for 32-bit architectures, and 8 bytes on 64-bit.

The chunks (referred as blocks in his comments at mm/slob.c) are
referenced from a singly-linked list within each page. His approach to
reduce fragmentation is to place all objects within three distinctive
lists: under 256 bytes, under 1024 bytes and then any other objects
of size greater than 1024 bytes.

The allocation algorithm is a classic next-fit, returning the first
slab containing enough chunks to hold the object. Released objects are
re-introduced into the freelist in address order.

The kmalloc and kfree layer (that is, the public API exposed from
SLOB) places a 4 byte header in objects within page size, or uses the
lower level page allocator directly if greater in size to allocate
compound pages. In such cases, it stores the size in the page
structure (in page->private). This poses a problem when detecting the
size of an allocated object, since essentially the slob_page and
page structures are the same: it's an union and the values of the
structure members overlap. Size is enforced to match, but using the
wrong place to store a custom value means a corrupted page state.

Before put_page() or free_pages(), SLOB clears the Slob bit, resets
the mapcount atomically and sets the mapping to NULL, then the page
is released back to the low-level page allocator. This prevents the
overlapping fields from leading to the aforementioned corrupted
state situation. This hack allows both SLOB and the page
allocator meta-data to coexist, allowing a lower memory footprint
and overhead.

---[ 1.3 SLUB aka The Unqueued Allocator

The default allocator in several GNU/Linux distributions at the
moment, including Ubuntu and Fedora. It was developed by
Christopher Lameter and merged into the -mm tree in early 2007.

"SLUB is a slab allocator that minimizes cache line usage
instead of managing queues of cached objects (SLAB approach).
Per cpu caching is realized using slabs of objects instead of
queues of objects. SLUB can use memory efficiently and has
enhanced diagnostics." CONFIG_SLUB documentation, Linux kernel.

The SLUB allocator was the first introducing merging, the concept
of grouping slabs of similar properties together, reducing the
number of caches present in the system and internal fragmentation.

This, however, has detrimental security side effects which are
explained in section 3.1. Fortunately even without a patched
kernel, merging can be disabled on runtime.

The debugging facilities are far more flexible than those in SLAB.
They can be enabled on runtime using a boot command line option,
and per-cache.

DMA caches are created on demand, or not-created at all if support
isn't required.

Another important change is the lack of SLAB's per-node partial
lists. SLUB has a single partial list, which prevents partially
free-allocated slabs from being scattered around, reducing
internal fragmentation in such cases, since otherwise those node
local lists would only be filled when allocations happen in that
particular node.

Its cache reaping has better performance than SLAB's, especially on
SMP systems, where it scales better. It does not require walking
the lists every time a slab is to be pushed into the partial list.
For non-SMP systems it doesn't use reaping at all.

Meta-data is stored using the page structure, instead of withing
the beginning of each slab, allowing better data alignment and
again, this reduces internal fragmentation since objects can be
packed tightly together without leaving unused trailing space in
the page(s). Memory requirements to hold control structures is much
lower than SLAB's, as Lameter explains:

"SLAB Object queues exist per node, per CPU. The alien cache
queue even has a queue array that contain a queue for each
processor on each node. For very large systems the number of
queues and the number of objects that may be caught in those
queues grows exponentially. On our systems with 1k nodes /
processors we have several gigabytes just tied up for storing
references to objects for those queues This does not include
the objects that could be on those queues."

To sum it up in a single paragraph: SLUB is a clever allocator
which is designed for modern systems, to scale well, work reliably
in SMP environments and reduce memory footprint of control and
meta-data structures and internal/external fragmentation. This
makes SLUB the best current target for KERNHEAP development.

---[ 1.4 SLQB

The SLQB allocator was developed by Nick Piggin to provide better
scalability and avoid fragmentation as much as possible. It makes a
great deal of an effort to avoid allocation of compound pages,
which is optimal when memory starts running low. Overall, it is a
per-CPU allocator.

The structures used to define the caches are slightly different,
and it shows that the allocator has been to designed from ground
zero to scale on high-end systems. It tries to optimize remote
freeing situations (when an object is freed in a different node/CPU
than it was allocated at). This is relevant to NUMA environments,
mostly. Objects more likely to be subjected to this situation are
long-lived ones, on systems with large numbers of processors.

It defines a slqb_page structure which "overloads" the lower level
page structure, in the same fashion as SLOB does. Instead of an
unused padding, it introduces kmem_cache_list ad freelist pointers.

For each lookaside cache, each CPU has a LIFO list of the objects
local to that node (used for local allocation and freeing), a free
and partial pages lists, a queue for objects being freed remotely
and a queue of already free objects that come from other CPUs remote
free queues. Locking is minimal, but sufficient to control
cross-CPU access to these queues.

Some of the debugging facilities include tracking the user of the
allocated object (storing the caller address, cpu, pid and the
timestamp). This track structure is stored within the allocated
object space, which makes it subject to partial or full overwrites,
thus unsuitable for security purposes like similar facilities in
other allocators (SLAB and SLUB, since SLOB is impaired for
debugging).

Back on SLQB-specific changes, the use of a kmem_cache_cpu
structure per CPU can be observed. An article at LWN.net by
Jonathan Corbet in December 2008, provides a summary about the
significance of this structure:

"Within that per-CPU structure one will find a number of lists
of objects. One of those (freelist) contains a list of
available objects; when a request is made to allocate an
object, the free list will be consulted first. When objects are
freed, they are returned to this list. Since this list is part
of a per-CPU data structure, objects normally remain on the
same processor, minimizing cache line bouncing. More
importantly, the allocation decisions are all done per-CPU,
with no bad cache behavior and no locking required beyond the
disabling of interrupts. The free list is managed as a stack,
so allocation requests will return the most recently freed
objects; again, this approach is taken in an attempt to
optimize memory cache behavior." [5]

In order to couple with memory stress situations, the freelists
can be flushed to return unused partial objects back to the page
allocator when necessary. This works by moving the object to the
remote freelist (rlist) from the CPU-local freelist, and keep a
reference in the remote_free list.

The SLQB allocator is well described in depth in the aforementioned
article and the source code comments. Feel free to refer to these
sources for more in-depth information about its design and
implementation. The original RFC and patch can be found at
http://lkml.org/lkml/2008/12/11/417

---[ 1.5 The future

As architectures and computing platforms evolve, so will the
allocators in the Linux kernel. The current development process
doesn't contribute to a more stable, smaller set of options, and it
will be inevitable to see new allocators introduced into the kernel
mainline, possibly specialized for certain environments.

In the short term, SLUB will remain the default, and there seems to
be an intention to remove SLOB. It is unclear if SLBQ will see
widely spread deployment.

Newly developed allocators will require careful assessment, since
KERNHEAP is tied to certain assumptions about their internals. For
instance, we depend on the ability to track object sizes properly,
and it remains untested for some obscure architectures, NUMA
systems and so forth. Even a simple allocator like SLOB posed a
challenge to implement safety checks, since the internals are
greatly convoluted. Thus, it's uncertain if future ones will
require a redesign of the concepts composing KERNHEAP.

------[ 2. Introduction: What is KERNHEAP?

As of April 2009, no operating system has implemented any form of
hardening in its kernel heap management interfaces. Attacks against the
SLAB allocator in Linux have been documented and made available to the
public as early as 2005, and used to develop highly reliable exploits
to abuse different kernel vulnerabilities involving heap allocated
buffers. The first public exploit making use of kmalloc() exploitation
techniques was the MCAST_MSFILTER exploit by twiz [10].

In January 2009, an obscure, non advertised advisory surfaced about a
buffer overflow in the SCTP implementation in the Linux kernel, which
could be abused remotely, provided that a SCTP based service was
listening on the target host. More specifically, the issue was located
in the code which processes the stream numbers contained in FORWARD-TSN
chunks.

During a SCTP association, a client sends an INIT chunk specifying a
number of inbound and outbound streams, which causes the kernel in the
server to allocate space for them via kmalloc(). After the association
is made effective (involving the exchange of INIT-ACK, COOKIE and
COOKIE-ECHO chunks), the attacker can send a FORWARD-TSN chunk with
more streams than those specified initially in the INIT chunk, leading
to the overflow condition which can be used to overwrite adjacent heap
objects with attacker controlled data. The vulnerability itself had
certain quirks and requirements which made it a good candidate for a
complex exploit, unlikely to be available to the general public, thus
restricted to more technically adept circles on kernel exploitation.
Nonetheless, reliable exploits for this issue were developed and
successfully used in different scenarios (including all major
distributions, such as Red Hat with SELinux enabled, and Ubuntu with
AppArmor).

At some point, Brad Spengler expressed interest on a potential protection
against this vulnerability class, and asked the author what kind of
measures could be taken to prevent new kernel-land heap related bugs
from being exploited. Shortly afterwards, KERNHEAP was born.

After development started, a fully remote exploit against the SCTP flaw
surfaced, developed by sgrakkyu [15]. In private discussions with few
individuals, a technique for executing a successful attack remotely was
proposed: overwrite a syscall pointer to an attacker controlled
location (like a hook) to safely execute our payload out of the
interrupt context. This is exactly what sgrakkyu implemented for
x86_64, using the vsyscall table, which bypasses CONFIG_DEBUG_RODATA
(read-only .rodata) restrictions altogether. His exploit exposed not
only the flawed nature of the vulnerability classification process of
several organizations, the hypocritical and unethical handling of
security flaws of the Linux kernel developers, but also the futility of
SELinux and other security models against kernel vulnerabilities.

In order to prevent and detect exploitation of this class of security
flaws in the kernel, a new set of protections had to be designed and
implemented: KERNHEAP.

KERNHEAP encompasses different concepts to prevent and detect heap
overflows in the Linux kernel, as well as other well known heap related
vulnerabilities, namely double frees, partial overwrites, etc.

These concepts have been implemented introducing modifications into the
different allocators, as well as common interfaces, not only
preventing generic forms of memory corruption but also hardening
specific areas of the kernel which have been used or could be
potentially used to leverage attacks corrupting the heap. For instance,
the IPC subsystem, the copy_to_user() and copy_from_user() APIs and
others.

This is still ongoing research and the Linux kernel is an ever evolving
project which poses significant challenges. The inclusion of new
allocators will always pose a risk for new issues to surface, requiring
these protections to be adapted, or new ones developed for them.

------[ 3. Integrity assurance for kernel heap allocators

---[ 3.1 Meta-data protection against full and partial overwrites

As of the current (yet ever changing) upstream design of the current
kernel allocators (SLUB, SLAB, SLOB, future SLQB, etc.), we assume:

1. A set of caches exist which hold dynamically allocated slabs,
composed of one of more physically contiguous pages, containing
same size chunks.

2. These are initialized by default or created explicitly, always
with a known size. For example, multiple default caches exist to
hold slabs of common sizes which are a multiple of two (32, 64,
128, 256 and so forth).

3. These caches grow or shrink in size as required by the
allocator.

4. At the end of a kmem cache life, it must be destroyed and its
slabs released. The linked list of slabs is implicitly trusted
in this context.

5. The caches can be allocated contiguously, or adjacent to an
actual chain of slabs from another cache. Because the current
kmem_cache structure holds potentially harmful information
(including a pointer to the constructor of the cache), this
could be leveraged in an attack to subvert the execution flow.

6. The debugging facilities of these allocators provide a merely
informational value with their error detection mechanisms, which
are also inherently insecure. They are not enabled by default
and have a extremely high performance impact (accounting up to
50 to 70% slowdown). In addition, they leak information which
could be invaluable for a local attacker (ex. fixed known
values).

We are facing multiple issues in this scenario. First, the kernel
developers expect the third-party to handle situations like a cache
being destroyed while an object is being allocated. Albeit highly
unusual, such circumstances (like {6}) can arise provided the right
conditions are present.

In order to prevent {5} from being abused, we are left with two
realistic possibilities to deter a potential attack: randomization of
the allocator routines (see ASLR from the PaX documentation in [7] for
the concept) or introduce a guard (known in modern times as a 'cookie')
which contains information to validate the integrity of the kmem_cache
structure.

Thus, a decision was made to introduce a guard which works in
'cascade':

+--------------+
| global guard |------------------+
+--------------| kmem_cache guard |------------+
+------------------| slab guard | ...
+------------+

The idea is simple: break down every potential path of abuse and add
integrity information to each lower level structure. By deploying a
check which relies in all the upper level guards, we can detect
corruption of the data at any stage. In addition, this makes the safety
checks more resilient against information leaks, since an attacker will
be forced to access and read a wider range of values than one single
cookie. Such data could be out of range to the context of the execution
path being abused.

The global guard is initialized at the kernheap_init()
function, called from init/main.c during kernel start. In order to
gather entropy for its value, we need to initialize the random32 PRNG
earlier than in a default, upstream kernel. On x86, this is done with
the rdtsc xor'd with the jiffies value, and then seeded multiple times
during different stages of the kernel initialization, ensuring we have
a decent amount of entropy to avoid an easily predictable result.

Unfortunately, an architecture-independent method to seed the PRNG
hasn't been devised yet. Right now this is specific to platforms with a
working get_cycles() implementation (otherwise it falls back to a more
insecure seeding using different counters), though it is intended to
support all architectures where PaX is currently supported.

The slab and kmem_cache structures are defined in mm/slab.c and
mm/slub.c for the SLAB and SLUB allocators, respectively. The kernel
developers have chosen to make their type information static to those
files, and not available in the mm/slab.h header file. Since the
available allocators have generally different internals, they only
export a common API (even though few functions remain as no-op, for
example in SLOB).

A guard field has been added at the start of the kmem_cache structure,
and other structures might be modified to include a similar field
(depending on the allocator). The approach is to add a guard anywhere
where it can provide balanced performance (including memory footprint)
and security results.

In order to calculate the final checksum used in each kmem_cache and
their slabs, a high performance, yet collision resistant hash function
was required. This instantly left options such as the CRC family, FNV,
etc. out, since they are inefficient for our purposes. Therefore,
Murmur2 was chosen [9]. It's an exceptionally fast, yet simple
algorithm created by Austin Appleby, currently used by libmemcached and
other software.

Custom optimized versions were developed to calculate hashes for the
slab and cache structures, taking advantage of the fact that only a
relatively small set of word values need to be hashed.

The coverage of the guard checks is obviously limited to the meta-data,
but yields reliable protection for all objects of 1/8 page size and any
adjacent ones, during allocation and release operations. The
copy_from_user() and copy_to_user() functions have been modified to
include a slab and cache integrity check as well, which is orthogonal
to the boundary enforcement modifications explained in another section
of this paper.

The redzone approach used by the SLAB/SLUB/SLQB allocators used a fixed
known value to detect certain scenarios (explained in the next
subsection). The values are 64-bit long:

#define RED_INACTIVE 0x09F911029D74E35BULL
#define RED_ACTIVE 0xD84156C5635688C0ULL

This is clearly suitable for debugging purposes, but largely
inefficient for security. An immediate improvement would be to generate
these values on runtime, but then it is still possible to avoid writing
over them and still modify the meta-data. This is exactly what is being
prevented by using a checksum guard, which depends on a runtime
generated cookie (at boot time). The examples below show an overwrite
of an object in the kmalloc-64 cache:

slab error in verify_redzone_free(): cache `size-64': memory outside
object was overwritten
Pid: 6643, comm: insmod Not tainted 2.6.29.2-grsec #1
Call Trace:
[<c0889a81>] __slab_error+0x1a/0x1c
[<c088aee9>] cache_free_debugcheck+0x137/0x1f5
[<c088ba14>] kfree+0x9d/0xd2
[<c0802f22>] syscall_call+0x7/0xb
df271338: redzone 1:0xd84156c5635688c0, redzone 2:0x4141414141414141.

Slab corruption: size-64 start=df271398, len=64
Redzone: 0x4141414141414141/0x9f911029d74e35b.
Last user: [<c08d1da5>](free_rb_tree_fname+0x38/0x6f)
000: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41
010: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41
020: 41 41 41 41 41 41 41 41 6b 6b 6b 6b 6b 6b 6b 6b
Prev obj: start=df271340, len=64

Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
Last user: [<c08d1e55>](ext3_htree_store_dirent+0x34/0x124)
000: 48 8e 78 08 3b 49 86 3d a8 1f 27 df e0 10 27 df
010: a8 14 27 df 00 00 00 00 62 d3 03 00 0c 01 75 64
Next obj: start=df2713f0, len=64

Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [<c08d1da5>](free_rb_tree_fname+0x38/0x6f)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

The trail of 0x6B bytes can be observed in the output above. This is
the SLAB_POISON feature. Poisoning is the approach that will be
described in the next subsection. It's basically overwriting the object
contents with a known value to detect modifications post-release or
uninitialized usage. The values are defined (like the redzone ones) at
include/linux/poison.h:

#define POISON_INUSE 0x5a
#define POISON_FREE 0x6b
#define POISON_END 0xa5

KERNHEAP performs validation of the cache guards at allocation and
release related functions. This allows detection of corruption in the
chain of guards and results in a system halt and a stack dump.

The safety checks are triggered from kfree() and kmem_cache_free(),
kmem_cache_destroy() and other places. Additional checkpoints are being
considered, since taking a wrong approach could lead to TOCTOU issues,
again depending on the allocator. In SLUB, merging is disabled to avoid
the potentially detrimental effects (to security) of this feature. This
might kill one of the most attractive points of SLUB, but merging comes
at the cost of letting objects be neighbors to other objects which
would have been placed elsewhere out of reach, allowing overflow
conditions to produce likely exploitable conditions. Even with guard
checks in place, this is still a scenario to be avoided.

One additional change, first introduced by PaX, is to change the
address of the ZERO_SIZE_PTR. In mainline kernel, this address points
to 0x00000010. An address reachable in userland is clearly a bad idea
in security terms, and PaX wisely solves this by setting it to
0xfffffc00, and modifying the ZERO_OR_NULL_PTR macro. This protects
against a situation in which kmalloc is called with a zero size (for
example due to an integer overflow in a length parameter) and the
pointer is used to read or write information from or to userland.

---[ 3.2 Detection of arbitrary free pointers and freelist corruption

In the history of heap related memory corruption vulnerabilities, a
more obscure class of flaws has been long time known, albeit less
publicized: arbitrary pointer and double free issues.

The idea is simple: a programming mistake leads to an exploitable
condition in which the state of the heap allocator can be made
inconsistent when an already freed object is being released again, or
an arbitrary pointer is passed to the free function. This is a strictly
allocator internals-dependent scenario, but generally the goal is to
control a function pointer (for example, a constructor/destructor
function used for object initialization, which is later called) or a
write-n primitive (a single byte, four bytes and so forth).

In practice, these vulnerabilities can pose a true challenge for
exploitation, since thorough knowledge of the allocator and state of
the heap is required. Manipulating the freelist (also known as
freelist in the kernel) might cause the state of the heap to be
unstable post-exploitation and thwart cleanup efforts or graceful
returns. In addition, another thread might try to access it or perform
operations (such as an allocation) which yields a page fault.

In an environment with 2.6.29.2 (grsecurity patch applied, full PaX
feature set enabled except for KERNEXEC, RANDKSTACK and UDEREF) and the
SLAB allocator, the following scenarios could be observed:

1. An object is allocated and shortly afterwards, the object is
released via kfree(). Another allocation follows, and a pointer
referencing to the previous allocation is passed to kfree(),
therefore the newly allocated object is released instead due to the
LIFO nature of the allocator.

void *a = kmalloc(64, GFP_KERNEL);
foo_t *b = (foo_t *) a;

/* ... */
kfree(a);
a = kmalloc(64, GFP_KERNEL);
/* ... */
kfree(b);

2. An object is allocated, and two successive calls to kfree() take
place with no allocation in-between.

void *a = kmalloc(64, GFP_KERNEL);
foo_t *b = (foo_t *) a;

kfree(a);
kfree(b);

In both cases we are releasing an object twice, but the state of the
allocator changes slightly. Also, there could be more than just a
single allocation in-between (for example, if this condition existed
within filesystem or network stack code) leading to less predictable
results. The more obvious result of the first scenario is corruption of
the freelist, and a potential information leak or arbitrary access to
memory in the second (for instance, if an attacker could force a new
allocation before the incorrectly released object is used, he could
control the information stored there).

The following output can be observed in a system using the SLAB
allocator with is debugging facilities enabled:

slab error in verify_redzone_free(): cache `size-64': double free detected
Pid: 4078, comm: insmod Not tainted 2.6.29.2-grsec #1
Call Trace:
[<c0889a81>] __slab_error+0x1a/0x1c
[<c088aee9>] cache_free_debugcheck+0x137/0x1f5
[<c088ba14>] kfree+0x9d/0xd2
[<c0802f22>] syscall_call+0x7/0xb
df2e42e0: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.

The debugging facilities of SLAB and SLUB provide a redzone-based
approach to detect the first scenario, but introduce a performance
impact while being useless security-wise, since the system won't halt
and the state of the allocator will be left unstable. Therefore, their
value is only informational and useful for debugging purposes, not as a
security measure. The redzone values are also static.

The other approach taken by the debugging facilities is poisoning, as
mentioned in the previous subsection. An object is 'poisoned' with a
value, which can be checked at different places to detect if the object
is being used uninitialized or post-release. This rudimentary but
effective method is implemented upstream in a manner which makes it
inefficient for security purposes.

Currently, upstream poisoning is clearly oriented to debugging. It
writes a single-byte pattern in the whole object space, marking the end
with a known value. This incurs in a significant performance impact.

KERNHEAP performs the following safety checks at the time of this
writing:

1. During cache destruction:

a) The guard value is verified.

b) The entire cache is walked, verifying the freelists for
potential corruption. Reference counters, guards, validity of
pointers and other structures are checked. If any mismatch is
found, a system halt ensues.

c) The pointer to the cache itself is changed to ZERO_SIZE_PTR.
This should not affect any well behaving (that is, not broken)
kernel code.

2. After successful kfree, a word value is written to the memory
and pointer location is changed to ZERO_SIZE_PTR. This will
trigger a distinctive page fault if the pointer is accessed
again somewhere. Currently this operation could be invasive for
drivers or code with dubious coding practices.

3. During allocation, if the word value at the start of the
to-be-returned object doesn't match our post-free value, a
system halt ensues.

The object-level guard values (equivalent to the redzoning) are
calculated on runtime. This deters bypassing of the checks via fake
objects, resulting from a slab overflow scenario. It does introduce a
low performance impact on setup and verification, minimized by the use
of inline functions, instead of external definitions like those used
for some of the more general cache checks.

The effectiveness of the reference counter checks is orthogonal
to the deployment of PaX's REFCOUNT, which protects many object
reference counters against overflows (including SLAB/SLUB).

Safe unlinking is enforced in all LIST_HEAD based linked lists, which
obviously includes the partial/empty/full lists for SLAB and several
other structures (including the freelists) in other allocators. If a
corrupted entry is being unlinked, a system halt is forced. The values
used for list pointer poisoning have been changed to point
non-userland-reachable addresses (this change has been taken from PaX).

The use-after-free and double-free detection mechanisms in KERNHEAP are
still under development, and it's very likely that substantial design
changes will occur after the release of this paper.

---[ 3.3 Overview of NetBSD and OpenBSD kernel heap safety checks

At the moment KERNHEAP exclusively covers the Linux kernel, but it is
interesting to observe the approaches taken by other projects to detect
kernel heap integrity issues. In this section we will briefly analyze
the NetBSD and OpenBSD kernels, which are largely the same code base in
regards of kernel malloc implementation and diagnostic checks.

Both currently implement rudimentary but effective measures to detect
use-after-free and double-free scenarios, albeit these are only enabled as
part of the DIAGNOSTIC and DEBUG configurations.

The following source code is taken from NetBSD 4.0 and should be almost
identical to OpenBSD. Their approach to detect use-after-free relies on
copying a known 32-bit value (WEIRD_ADDR, from kern/kern_malloc.c):

/*
* The WEIRD_ADDR is used as known text to copy into free objects so
* that modifications after frees can be detected.
*/
#define WEIRD_ADDR ((uint32_t) 0xdeadbeef)
...

void *malloc(unsigned long size, struct malloc_type *ksp, int flags)
...
{
...
#ifdef DIAGNOSTIC
/*
* Copy in known text to detect modification
* after freeing.
*/
end = (uint32_t *)&cp[copysize];
for (lp = (uint32_t *)cp; lp < end; lp++)
*lp = WEIRD_ADDR;
freep->type = M_FREE;
#endif /* DIAGNOSTIC */

The following checks are the counterparts in free(), which call panic() when
the checks fail, causing a system halt (this obviously has a better security
benefit than just the information approach taken by Linux's SLAB
diagnostics):

#ifdef DIAGNOSTIC
...
if (__predict_false(freep->spare0 == WEIRD_ADDR)) {
for (cp = kbp->kb_next; cp;
cp = ((struct freelist *)cp)->next) {
if (addr != cp)
continue;
printf("multiply freed item %p\n", addr);
panic("free: duplicated free");
}
}
...
copysize = size < MAX_COPY ? size : MAX_COPY;
end = (int32_t *)&((caddr_t)addr)[copysize];
for (lp = (int32_t *)addr; lp < end; lp++)
*lp = WEIRD_ADDR;
freep->type = ksp;
#endif /* DIAGNOSTIC */

Once the object is released, the 32-bit value is copied, along the type
information to detect the potential origin of the problem. This should be
enough to catch basic forms of freelist corruption.

It's worth noting that the freelist_sanitycheck() function provides
integrity checking for the freelist, but is enclosed in an ifdef 0 block.

The problem affecting these diagnostic checks is the use of known values, as
much as Linux's own SLAB redzoning and poisoning might be easily bypassed in
a deliberate attack scenario. It still remains slightly more effective due
to the system halt enforcing upon detection, which isn't present in Linux.

Other sanity checks are done with the reference counters in free():

if (ksp->ks_inuse == 0)
panic("free 1: inuse 0, probable double free");

And validating (with a simple address range test) if the pointer being
freed looks sane:

if (__predict_false((vaddr_t)addr < vm_map_min(kmem_map) ||
(vaddr_t)addr >= vm_map_max(kmem_map)))
panic("free: addr %p not within kmem_map", addr);

Ultimately, users of either NetBSD or OpenBSD might want to enable
KMEMSTATS or DIAGNOSTIC configurations to provide basic protection against
heap corruption in those systems.

---[ 3.4 Microsoft Windows 7 kernel pool allocator safe unlinking

In 26 May 2009, a suspiciously timed article was published by Peter
Beck from the Microsoft Security Engineering Center (MSEC) Security
Science team, about the inclusion of safe unlinking into the Windows 7
kernel pool (the equivalent to the slab allocators in Linux).

This has received a deal of publicity for a change which accounts up to
two lines of effective code, and surprisingly enough, was already
present in non-retail versions of Vista. In addition, safe unlinking
has been present in other heap allocators for a long time: in the GNU
libc since at least 2.3.5 (proposed by Stefan Esser originally to Solar
Designer for the Owl libc) and the Linux kernel since 2006
(CONFIG_DEBUG_LIST).

While it is out of scope for this paper to explain the internals of the
Windows kernel pool allocator, this section will provide a short
overview of it. For true insight the slides by Kostya Kortchinsky,
"Exploiting Kernel Pool Overflows" [14], can provide a through look at
it from a sound security perspective.

The allocator is very similar to SLAB and the API to obtain allocations
and release them is straightforward (nt!ExAllocatePool(WithTag),
nt!ExFreePool(WithTag) and so forth). The default pools (sort of a
kmem_cache equivalent) are the (two) paged, non-paged and session paged
ones. Non-paged for physical memory allocations and paged for pageable
memory. The structure defining a pool can be seen below:

kd> dt nt!_POOL_DESCRIPTOR
+0x000 PoolType : _POOL_TYPE
+0x004 PoolIndex : Uint4B
+0x008 RunningAllocs : Uint4B
+0x00c RunningDeAllocs : Uint4B
+0x010 TotalPages : Uint4B
+0x014 TotalBigPages : Uint4B
+0x018 Threshold : Uint4B
+0x01c LockAddress : Ptr32 Void
+0x020 PendingFrees : Ptr32 Void
+0x024 PendingFreeDepth : Int4B
+0x028 ListHeads : [512] _LIST_ENTRY

The most important member in the structure is ListHeads, which contains
512 linked lists, to hold the free chunks. The granularity of
the allocator is 8 bytes for Windows XP and up, and 32 bytes for
Windows 2000. The maximum allocation size possible is 4080 bytes.
LIST_ENTRY is exactly the same as LIST_HEAD in Linux.

Each chunk contains a 8 byte header. The chunk header is defined as
follows for Windows XP and up:

kd> dt nt!_POOL_HEADER
+0x000 PreviousSize : Pos 0, 9 Bits
+0x000 PoolIndex : Pos 9, 7 Bits
+0x002 BlockSize : Pos 0, 9 Bits
+0x002 PoolType : Pos 9, 7 Bits
+0x000 Ulong1 : Uint4B
+0x004 ProcessBilled : Ptr32 _EPROCESS
+0x004 PoolTag : Uint4B
+0x004 AllocatorBackTraceIndex : Uint2B
+0x006 PoolTagHash : Uint2B

The PreviousSize contains the value of the BlockSize of the previous
chunk, or zero if it's the first. This value could be checked during
unlinking for additional safety, but this isn't the case (their checks
are limited to validity of prev/next pointers relative to the entry
being deleted). PooType is zero if free, and PoolTag contains four
printable characters to identify the user of the allocation. This isn't
authenticated nor verified in any way, therefore it is possible to
provide a bogus tag to one of the allocation or free APIs.

For small allocations, the pool allocator uses lookaside caches, with a
maximum BlockSize of 256 bytes.

Kostya's approach to abuse pool allocator overflows involves the
classic write-4 primitive through unlinking of a fake chunk under his
control. For the rest of information about the allocator internals,
please refer to his excellent slides [14].

The minimal change introduced by Microsoft to enable safe unlinking in
Windows 7 was already present in Vista non-retail builds, thus it is
likely that the announcement was merely a marketing exercise.
Furthermore, Beck states that this allows to detect "memory corruption
at the earliest opportunity", which isn't necessarily correct if they
had pursued a more complete solution (for example, verifying that
pointers belong to actual freelist chunks). Those might incur in a
higher performance overhead, but provide far more consistent
protection.

The affected API is RemoveEntryList(), and the result of unlinking an
entry with incorrect prev/next pointers will be a BugCheck:

Flink = Entry->Flink;
Blink = Entry->Blink;
if (Flink->Blink != Entry) KeBugCheckEx(...);
if (Blink->Flink != Entry) KeBugCheckEx(...);

It's unlikely that there will be further changes to the pool allocator
for Windows 7, but there's still time for this to change before release
date.

------[ 4. Sanitizing memory of the look-aside caches

The objects and data contained in slabs allocated within the kmem
caches could be of sensitive nature, including but not limited to:
cryptographic secrets, PRNG state information, network information,
userland credentials and potentially useful internal kernel state
information to leverage an attack (including our guards or cookie
values).

In addition, neither kfree() nor kmalloc() zero memory, thus allowing
the information to stay there for an indefinite time, unless they are
overwritten after the space is claimed in an allocation procedure. This
is a security risk by itself, since an attacker could essentially rely
on this condition to "spray" the kernel heap with his own fake
structures or machine instructions to further improve the reliability
of his attack.

PaX already provides a feature to sanitize memory upon release, at a
performance cost of roughly 3%. This an opt-all policy, thus it
is not possible to choose in a fine-grained manner what memory is
sanitized and what isn't. Also, it works at the lowest level possible,
the page allocator. While this is a safe approach and ensures that all
allocated memory is properly sanitized, it is desirable to be able to
opt-in voluntarily to have your newly allocated memory treated as
sensitive.

Hence, a GFP_SENSITIVE flag has been introduced. While a security
conscious developer could zero memory on his own, the availability of a
flag to assure this behavior (as well as other enhancements and safety
checks) is convenient. Also, the performance cost is negligible, if
any, since the flag could be applied to specific allocations or caches
altogether.

The low level page allocator uses a PF_sensitive flag internally, with
the associated SetPageSensitive, ClearPagesensitiv and PageSensitive
macros. These changes have been introduced in the linux/page-flags.h
header and mm/page_alloc.c.

SLAB / kmalloc layer Low-level page allocator
include/linux/slab.h include/linux/page-flags.h

+----------------. +--------------+
| SLAB_SENSITIVE | ->| PG_sensitive |
+----------------. | +--------------+
| | |-> SetPageSensitive
| +---------------+ | |-> ClearPageSensitive
\---> | GFP_SENSITIVE |-/ |-> PageSensitive
+---------------+ ...

This will prevent the aforementioned leak of information post-release,
and provide an easy to use mechanism for third-party developers to take
advantage of the additional assurance provided by this feature.

In addition, another loophole that has been removed is related with
situations in which successive allocations are done via kmalloc(), and
the information is still accessible through the newly allocated object.
This happens when the slab is never released back to the page
allocator, since slabs can live for an indefinite amount of time
(there's no assurance as to when the cache will go through shrinkage or
reaping). Upon release, the cache can be checked for the SLAB_SENSITIVE
flag, the page can be checked for the PG_sensitive bit, and the
allocation flags can be checked for GFP_SENSITIVE.

Currently, the following interfaces have been modified to operate with
this flag when appropriate:

- IPC kmem cache
- Cryptographic subsystem (CryptoAPI)
- TTY buffer and auditing API
- WEP encryption and decryption in mac80211 (key storage only)
- AF_KEY sockets implementation
- Audit subsystem

The RBAC engine in grsecurity can be modified to add support for
enabling the sensitive memory flag per-process. Also, a group id based
check could be added, configurable via sysctl. This will allow
fine-grained policy or group based deployment of the current and future
benefits of this flag. SELinux and any other policy based security
frameworks could benefit from this feature as well.

This patchset has been proposed to the mainline kernel developers as of
May 21st 2009 (see http://patchwork.kernel.org/patch/25062). It
received feedback from Alan Cox and Rik van Riel and a different
approach was used after some developers objected to the use of a page
flag, since the functionality can be provided to SLAB/SLUB allocators
and the VMA interfaces without the use of a page flag. Also, the naming
changed to CONFIDENTIAL, to avoid confusion with the term 'sensitive'.

Unfortunately, without a page bit, it's impossible to track down what
pages shall be sanitized upon release, and provide fine-grained control
over these operations, making the gfp flag almost useless, as well as
other interesting features, like sanitizing pages locked via mlock().
The mainline kernel developers oppose the introduction of a new page
flag, even though SLUB and SLOB introduced their own flags when they
were merged, and this wasn't frowned upon in such cases. Hopefully this
will change in the future, and allow a more complete approach to be
merged in mainline at some point.

Despite the fact that Ingo Molnar, Pekka Enberg and Peter Zijlstra
completely missed the point about the initially proposed patches,
new ones performing selective sanitization were sent following up their
recommendations of a completely flawed approach. This case serves as a
good example of how kernel developers without security knowledge nor
experience take decisions that negatively impact conscious users of the
Linux kernel as a whole.

Hopefully, in order to provide a reliable protection, the upstream
approach will finally be selective sanitization using kzfree(),
allowing us to redefine it to kfree() in the appropriate header file,
and use something that actually works. Fixing a broken implementation
is an undesirable burden often found when dealing with the 2.6 branch
of the kernel, as usual.

------[ 5. Deterrence of IPC based kmalloc() overflow exploitation

In addition to the rest of the features which provide a generic
protection against common scenarios of kernel heap corruption, a
modification has been introduced to deter a specific local attack for
abusing kmalloc() overflows successfully. This technique is currently
the only public approach to kernel heap buffer overflow exploitation
and relies on the following circumstances:

1. The attacker has local access to the system and can use the IPC
subsystem, more specifically, create, destroy and perform
operations on semaphores.

2. The attacker is able to abuse a allocate-overflow-free situation
which can be leveraged to overwrite adjacent objects, also
allocated via kmalloc() within the same kmem cache.

3. The attacker can trigger the overflow in the right timing to
ensure that the adjacent object overwritten is under his
control. In this case, the shmid_kernel structure (used
internally within the IPC subsystem), leading to a userland
pointer dereference, pointing at attacker controlled structures.

4. Ultimately, when these attacker controlled structures are used
by the IPC subsystem, a function pointer is called. Since the
attacker controls this information, this is essentially a
game-over scenario. The kernel will execute arbitrary code of
the attacker's choice and this will lead to elevation of
privileges.

Currently, PaX UDEREF [8] on x86 provides solid protection against
(3) and (4). The attacker will be unable to force the kernel into
executing instructions located in the userland address space. A
specific class of vulnerabilities, kernel NULL pointer deferences
(which were, for a long time, overlooked and not considered exploitable
by most of the public players in the security community, with few
exceptions) were mostly eradicated (thanks to both UDEREF and further
restrictions imposed on mmap(), later implemented by Red Hat and
accepted into mainline, albeit containing flaws which made the
restriction effectively useless).

On systems where using UDEREF is unbearable for performance or
functionality reasons (for example, virtualization), a workaround to
harden the IPC subsystem was necessary. Hence, a set of simple safety
checks were devised for the shmid_kernel structure, and the allocation
helper functions have been modified to use their own private cache.

The function pointer verification checks if the pointers located within
the file structure, are actually addresses within the kernel text range
(including modules).

The internal allocation procedures of the IPC code make use of both
vmalloc() and kmalloc(), for sizes greater than a page or lower than a
page, respectively. Thus, the size for the cache objects is PAGE_SIZE,
which might be suboptimal in terms of memory space, but does not impact
performance. These changes have been tested using the IBM ipc_stress
test suite distributed in the Linux Test Project sources, with
successful results (can be obtained from http://ltp.sourceforge.net).

------[ 6. Prevention of copy_to_user() and copy_from_user() abuse

A vast amount of kernel vulnerabilities involving information leaks to
userland, as well as buffer overflows when copying data from userland,
are caused by signedness issues (meaning integer overflows, reference
counter overflows, et cetera). The common scenario is an invalid
integer passed to the copy_to_user() or copy_from_user() functions.

During the development of KERNHEAP, a question was raised about these
functions: Is there a existent, reliable API which allows retrieval of
the target buffer information in both copy-to and copy-from scenarios?

Introducing size awareness in these functions would provide a simple,
yet effective method to deter both information leaks and buffer
overflows through them. Obviously, like in every security system, the
effectiveness of this approach is orthogonal to the deployment of other
measures, to prevent potential corner cases and rare situations useful
for an attacker to bypass the safety checks.

The current kernel heap allocators (including SLOB) provide a function
to retrieve the size of a slab object, as well as testing the validity
of a pointer to see if it's within the known caches (excluding SLOB
which required this function to be written since it's essentially a
no-op in upstream sources). These functions are ksize() and
kmem_validate_ptr() respectively (in each pertinent allocator source:
mm/slab.c, mm/slub.c and mm/slob.c).

In order to detect whether a buffer is stack or heap based in the
kernel, the object_is_on_stack() function (from include/linux/sched.h)
can be used. The drawback of these functions is the computational cost
of looking up the page where this buffer is located, checking its
validity wherever applicable (in the case of kmem_validate_ptr() this
involves validating against a known cache) and performing other tasks
to determine the validity and properties of the buffer. Nonetheless,
the performance impact might be negligible and reasonable for the
additional assurance provided with these changes.

Brad Spengler devised this idea, developed and introduced the checks
into the latest test patches as of April 27th (test10 to test11 from
PaX and the grsecurity counterparts for the current kernel stable
release, 2.6.29.1).

A reliable method to detect stack-based objects is still being
considered for implementation, and might require access to meta-data
used for debuggers or future GCC built-ins.

------[ 7. Prevention of vsyscall overwrites on x86_64

This technique is used in sgrakkyu's exploit for CVE-2009-0065. It
involves overwriting a x86_64 specific location within a top memory
allocated page, containing the vsyscall mapping. This mapping is used
to implement a high performance entry point for the gettimeofday()
system call, and other functionality.

An attacker can target this mapping by means of an arbitrary write-N
primitive and overwrite the machine instructions there to produce a
reliable return vector, for both remote and local attacks. For remote
attacks the attacker will likely use an offset-aware approach for
reliability, but locally it can be used to execute an offset-less
attack, and force the kernel into dereferencing userland memory. This
is problematic since presently PaX does not support UDEREF on x86_64
and the performance cost of its implementation could be significant,
making abuse a safe bet even against hardened environments.

Therefore, contrary to past popular belief, x86_64 systems are more
exposed than i386 in this regard.

During conversations with the PaX Team, some difficulties came to
attention regarding potential approaches to deter this technique:

1. Modifying the location of the vsyscall mapping will break
compatibility. Thus, glibc and other userland software would
require further changes. See arch/x86/kernel/vmlinux_64.lds.S
and arch/x86/kernel/vsyscall_64.c

2. The vsyscall page is defined within the ld linked script for
x86_64 (arch/x86/kernel/vmlinux_64.lds.S). It is defined by
default (as of 2.6.29.3) within the boundaries of the .data
section, thus writable for the kernel. The userland mapping
is read-execute only.

3. Removing vsyscall support might have a large performance impact
on applications making extensive use of gettimeofday().

4. Some data has to be written in this region, therefore it can't
be permanently read-only.

PaX provides a write-protect mechanism used by KERNEXEC, together with
its definition for an actual working read-only .rodata implementation.
Moving the vsyscall within the .rodata section provides reliable
protection against this technique. In order to prevent sections from
overlapping, some changes had to be introduced, since the section has
to be aligned to page size. In non-PaX kernels, .rodata is only
protected if the CONFIG_DEBUG_RODATA option is enabled.

The PaX Team solved {4} using pax_open_kernel() and pax_close_kernel()
to allow writes temporarily. This has some performance impact but is
most likely far lower than removing vsyscall support completely.

This deters abuse of the vsyscall page on x86_64, and prevents
offset-based remote and offset-less local exploits from leveraging a
reliable attack against a kernel vulnerability. Nonetheless, protection
against this venue of attack is still work in progress.

------[ 8. Developing the right regression testsuite for KERNHEAP

Shortly after the initial development process started, it became
evident that a decent set of regression tests was required to check if
the implementation worked as expected. While using single loadable
modules for each test was a straightforward solution, in the longterm,
having a real tool to perform thorough testing seemed the most logical
approach.

Hence, KHTEST has been developed. It's composed of a kernel module
which communicates to a userland Python program over Netlink sockets.
The ctypes API is used to handle the low level structures that define
commands and replies. The kernel module exposes internal APIs to the
userland process, such as:

- kmalloc
- kfree
- memset and memcpy
- copy_to_user and copy_from_user

Using this interface, allocation and release of kernel memory can be
controlled with a simple Python script, allowing efficient development
of testcases:

e = KernHeapTester()
addr = e.kmalloc(size)
e.kfree(addr)
e.kfree(addr)

When this test runs on an unprotected 2.6.29.2 system (SLAB as
allocator, debugging capabilities enabled) the following output can be
observed in the kernel message buffer, with a subsequent BUG on cache
reaping:

KERNHEAP test-suite loaded.
run_cmd_kmalloc: kmalloc(64, 000000b0) returned 0xDF1BEC30
run_cmd_kfree: kfree(0xDF1BEC30)
run_cmd_kfree: kfree(0xDF1BEC30)
slab error in verify_redzone_free(): cache `size-64': double free detected
Pid: 3726, comm: python Not tainted 2.6.29.2-grsec #1
Call Trace:
[<c0889a81>] __slab_error+0x1a/0x1c
[<c088aee9>] cache_free_debugcheck+0x137/0x1f5
[<e082f25c>] ? run_cmd_kfree+0x1e/0x23 [kernheap_test]
[<c088ba14>] kfree+0x9d/0xd2
[<e082f25c>] run_cmd_kfree+0x1e/0x23

kernel BUG at mm/slab.c:2720!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/uevent_seqnum
Pid: 10, comm: events/0 Not tainted (2.6.29.2-grsec #1) VMware Virtual Platform
EIP: 0060:[<c088ac00>] EFLAGS: 00010092 CPU: 0
EIP is at slab_put_obj+0x59/0x75
EAX: 0000004f EBX: df1be000 ECX: c0828819 EDX: c197c000
ESI: 00000021 EDI: df1bec28 EBP: dfb3deb8 ESP: dfb3de9c
DS: 0068 ES: 0068 FS: 00d8 GS: 0000 SS: 0068
Process events/0 (pid: 10, ti=dfb3c000 task=dfb3ae30 task.ti=dfb3c000)
Stack:
c0bc24ee c0bc1fd7 df1bec28 df800040 df1be000 df8065e8 df800040 dfb3dee0
c088b42d 00000000 df1bec28 00000000 00000001 df809db4 df809db4 00000001
df809d80 dfb3df00 c088be34 00000000 df8065e8 df800040 df8065e8 df800040
Call Trace:
[<c088b42d>] ? free_block+0x98/0x103
[<c088be34>] ? drain_array+0x85/0xad
[<c088beba>] ? cache_reap+0x5e/0xfe
[<c083586a>] ? run_workqueue+0xc4/0x18c
[<c088be5c>] ? cache_reap+0x0/0xfe
[<c0838593>] ? kthread+0x0/0x59
[<c0803717>] ? kernel_thread_helper+0x7/0x10

The following code presents a more complex test to evaluate a
double-free situation which will put a random kmalloc cache into an
unpredictable state:

e = KernHeapTester()
addrs = []
kmalloc_sizes = [ 32, 64, 96, 128, 196, 256, 1024, 2048, 4096]

i = 0
while i < 1024:
addr = e.kmalloc(random.choice(kmalloc_sizes))
addrs.append(addr)
i += 1

random.seed(os.urandom(32))
random.shuffle(addrs)
e.kfree(random.choice(addrs))
random.shuffle(addrs)

for addr in addrs:
e.kfree(addr)

On a KERNHEAP protected host:

Kernel panic - not syncing: KERNHEAP: Invalid kfree() in (objp
df38e000) by python:3643, UID:0 EUID:0

The testsuite sources (including both the Python module and the LKM for
the 2.6 series, tested with 2.6.29) are included along this paper.
Adding support for new kernel APIs should be a trivial task, requiring
only modification of the packet handler and the appropriate addition of
a new command structure. Potential improvements include the use of a
shared memory page instead of Netlink responses, to avoid impacting the
allocator state or conflict with our tests.

------[ 9. The Inevitability of Failure

In 1998, members (Loscocco, Smalley et. al) of the Information Assurance
Group at the NSA published a paper titled "The Inevitability of Failure:
The Flawed Assumption of Security in Modern Computing Environments"
[12].

The paper explains how modern computing systems lacked the necessary
features and capabilities for providing true assurance, to prevent
compromise of the information contained in them. As systems were
becoming more and more connected to networks, which were growing
exponentially, the exposure of these systems grew proportionally.
Therefore, the state of art in security had to progress in a similar
pace.

From an academic standpoint, it is interesting to observe that more
than 10 years later, the state of art in security hasn't evolved
dramatically, but threats have gone well beyond the initial
expectations.

"Although public awareness of the need for security
in computing systems is growing rapidly, current
efforts to provide security are unlikely to succeed.
Current security efforts suffer from the flawed
assumption that adequate security can be provided in
applications with the existing security mechanisms of
mainstream operating systems. In reality, the need for
secure operating systems is growing in today's computing
environment due to substantial increases in
connectivity and data sharing." Page 1, [12]

Most of the authors of this paper were involved in the development of
the Flux Advanced Security Kernel (FLASK), at the University of Utah.
Flask itself has its roots in an original joint project of the then
known as Secure Computing Corporation (SCC) (acquired by McAfee in
2008) and the National Security Agency, in 1992 and 1993, the
Distributed Trusted Operating System (DTOS). DTOS inherited the
development and design ideas of a previous project named DTMach
(Distributed Trusted Match) which aimed to introduce a flexible access
control framework into the GNU Mach microkernel. Type Enforcement was
first introduced in DTMach, superseded in Flask with a more flexible
design which allowed far greater granularity (supporting mixing of
different types of labels, beyond only types, such as sensitivity,
roles and domains).

Type Enforcement is a simple concept: a Mandatory Access Control (MAC)
takes precedence over a Discretionary Access Control (DAC) to contain
subjects (processes, users) from accessing or manipulating objects
(files, sockets, directories), based on the decision made by the
security system upon a policy and subject's attached security context.
A subject can undergo a transition from one security context to another
(for example, due to role change) if it's explicitly allowed by the
policy. This design allows fine-grained, albeit complex, decision
making.

Essentially, MAC means that everything is forbidden unless explicitly
allowed by a policy. Moreover, the MAC framework is fully integrated
into the system internals in order to catch every possible data access
situation and store state information.

The true benefits of these systems could be exercised mostly in
military or government environments, where models such as Multi-Level
Security (MLS) are far more applicable than for the general public.

Flask was implemented in the Fluke research operating system (using the
OSKit framework) and ultimately lead to the development of SELinux, a
modification of the Linux kernel, initially standalone and ported
afterwards to use the Linux Security Modules (LSM) framework when its
inclusion into mainline was rejected by Linus Tordvals. Flask is also
the basis for TrustedBSD and OpenSolaris FMAC. Apple's XNU kernel,
albeit being largely based off FreeBSD (which includes TrustedBSD
modifications since 6.0) decided to implement its own security
mechanism (non-MAC) known as Seatbelt, with its own policy language.

While the development of these systems represents a significant step
towards more secure operating systems, without doubt, the real-world
perspective is of a slightly more bleak nature. These systems have
steep learning curves (their policy languages are powerful but complex,
their nature is intrinsically complicated and there's little freely
available support for them, plus the communities dedicated to them are
fairly small and generally oriented towards development), impose strict
restrictions to the system and applications, and in several cases,
might be overkill to the average user or administrator.

A security system which requires (expensive, length) specialized
training is dramatically prone to being disabled by most of its
potential users. This is the reality of SELinux in Fedora and other
systems. The default policies aren't realistic and users will need to
write their own modules if they want to use custom software. In
addition, the solution to this problem was less then suboptimal: the
targeted (now modular) policy was born.

The SELinux targeted policy (used by default in Fedora 10) is
essentially a contradiction of the premises of MAC altogether. Most
applications run under the unconfined_t domain, while a small set of
daemons and other tools run confined under their own domains. While
this allows basic, usable security to be deployed (on a related note,
XNU Seatbelt follows a similar approach, although unsuccessfully), its
effectiveness to stop determined attackers is doubtful.

For instance, the Apache web server daemon (httpd) runs under the
httpd_t domain, and is allowed to access only those files labeled with
the httpd_sys_content_t type. In a PHP local file include scenario this
will prevent an attacker from loading system configuration files, but
won't prevent him from reading passwords from a PHP configuration file
which could provide credentials to connect to the back-end database
server, and further compromise the system by obtaining any access
information stored there. In a relatively more complex scenario, a PHP
code execution vulnerability could be leveraged to access the apache
process file descriptors, and perhaps abuse a vulnerability to leak
memory or inject code to intercept requests. Either way, if an attacker
obtains unconfined_t access, it's a game over situation. This is
acknowledged in [13], along an interesting citation about the managerial
decisions that lead to the targeted policy being developed:

"SELinux can not cause the phones to ring"
"SELinux can not cause our support costs to rise."
Strict Policy Problems, slide 5. [13]

---[ 9.1 Subverting SELinux and the audit subsystem

Fedora comes with SELinux enabled by default, using the targeted
policy. In remote and local kernel exploitation scenarios, disabling
SELinux and the audit framework is desirable, or outright necessary if
MLS or more restrictive policies are used.

In March 2007, Brad Spengler sent a message to a public mailing-list,
announcing the availability of an exploit abusing a kernel NULL pointer
dereference (more specifically, an offset from NULL) which disabled all
LSM modules atomically, including SELinux. tee42-24tee.c exploited a
vulnerability in the tee() system call, which was silently fixed by
Jens Axboe from SUSE (as "[patch 25/45] splice: fix problems with
sys_tee()").

Its approach to disable SELinux locally was extremely reliable and
simplistic at the same. Once the kernel continues execution at the code
in userland, using shellcode is unnecessary. This applies only to local
exploits normally, and allows offset-less exploitation, resulting in
greater reliability. All the LSM disabling logic in tee42-24tee.c is
written in C which can be easily integrated in other local exploits.

The disable_selinux() function has two different stages independent
of each other. The first finds the selinux_enabled 32-bit integer,
through a linear memory search that seeks for a cmp opcode within the
selinux_ctxid_to_string() function (defined in selinux/exports.c and
present only in older kernels). In current kernels, a suitable
replacement is the selinux_string_to_sid() function.

Once the address to selinux_enabled is found, its value is set to zero.
this is the first step towards disabling SELinux. Currently, additional
targets should be selinux_enforcing (to disable enforcement mode) and
selinux_mls_enabled.

The next step is the atomic disabling of all LSM modules. This stage
also relies on an finding an old function of the LSM framework,
unregister_security(), which replaced the security_ops with
dummy_security_ops (a set of default hooks that perform simple DAC
without any further checks), given that the current security_ops
matched the ops parameter.

This function has disappeared in current kernels, but setting the
security_ops to default_security_ops achieves the same effect, and it
should be reasonably easy to find another function to use as reference
in the memory search. This change was likely part of the facelift that
LSM underwent to remove the possibility of using the framework in
loadable kernel modules.

With proper fine-tuning and changes to perform additional opcode
checks, recent kernels should be as easy to write a SELinux/LSM
disabling functionality that works across different architectures.

For remote exploitation, a typical offset-based approach like that used
in sgraykku's sctp_houdini.c exploit (against x86_64) should be reliable
and painless. Simply write a zero value to selinux_enforcing,
selinux_enabled and selinux_mls_enabled (albeit the first is well
enough). Further more, if we already know the address of security_ops
and default_security_ops, we can disable LSMs altogether that way too.

If an attacker has enough permissions to control a SCTP listener or run
his own, then remote exploitation on x86_64 platforms can be made
completely reliable against unknown kernels through the use of the
vsyscall exploitation technique, to return control to the attacker
controller listener in a previous mapped -fixed- address of his choice.
In this scenario, offset-less SELinux/LSM disabling functionality can
be used.

Fortunately, this isn't even necessary since most Linux distributions
still ship with world-readable /boot mount points, and their package
managers don't do anything to solve this when new kernel packages are
installed:

Ubuntu 8.04 (Hardy Heron)
-rw-r--r-- 1 root 413K /boot/abi-2.6.24-24-generic
-rw-r--r-- 1 root 79K /boot/config-2.6.24-24-generic
-rw-r--r-- 1 root 8.0M /boot/initrd.img-2.6.24-24-generic
-rw-r--r-- 1 root 885K /boot/System.map-2.6.24-24-generic
-rw-r--r-- 1 root 62M /boot/vmlinux-debug-2.6.24-24-generic
-rw-r--r-- 1 root 1.9M /boot/vmlinuz-2.6.24-24-generic

Fedora release 10 (Cambridge)
-rw-r--r-- 1 root 84K /boot/config-2.6.27.21-170.2.56.fc10.x86_64
-rw------- 1 root 3.5M /boot/initrd-2.6.27.21-170.2.56.fc10.x86_64.img
-rw-r--r-- 1 root 1.4M /boot/System.map-2.6.27.21-170.2.56.fc10.x86_64
-rwxr-xr-x 1 root 2.6M /boot/vmlinuz-2.6.27.21-170.2.56.fc10.x86_64

Perhaps, one easy step before including complex MAC policy based
security frameworks, would be to learn how to use DAC properly. Contact
your nearest distribution security officer for more information.

---[ 9.2 Subverting AppArmor

Ubuntu and SUSE decided to bundle AppArmor (aka SubDomain) instead
(Novell acquired Immunix in May 2005, only to lay off their developers
in September 2007, leaving AppArmor development "open for the
community"). AppArmor is completely different than SELinux in both
design and implementation.

It uses pathname based security, instead of using filesystem object
labeling. This represents a significant security drawback itself, since
different policies can apply to the same object when it's accessed by
different names. For example, through a symlink. In other words, the
security decision making logic can be forced into using a less secure
policy by accessing the object through a pathname that matches to an
existent policy. It's been argued that labeling-based approaches are
due to requirements of secrecy and information containment, but in
practice, security itself equals to information containment.
Theory-related discussions aside, this section will provide a basic
overview on how AppArmor policy enforcement works, and some techniques
that might be suitable in local and remote exploitation scenarios to
disable it.

The most simple method to disable AppArmor is to target the 32-bit
integers used to determine if it's initialized or enabled. In case
the system being targeted runs a stock kernel, the task of accessing
these symbols is trivial, although an offset-dependent exploit is
certainly suboptimal:

c03fa7ac D apparmorfs_profiles_op
c03fa7c0 D apparmor_path_max
(Determines the maximum length of paths before access is rejected
by default)

c03fa7c4 D apparmor_enabled
(Determines if AppArmor is currently enabled - used on runtime)

c04eb918 B apparmor_initialized
(Determines if AppArmor was enabled on boot time)

c04eb91c B apparmor_complain
(The equivalent to SELinux permissive mode, no enforcement)

c04eb924 B apparmor_audit
(Determines if the audit subsystem will be used to log messages)

c04eb928 B apparmor_logsyscall
(Determines if system call logging is enabled - used on runtime)

A NULL-write primitive suffices to overwrite the values of any of those
integers. But for local or shellcode based exploitation, a function
exists that can disable AppArmor on runtime, apparmor_disable(). This
function is straightforward and reasonably easy to fingerprint:

0xc0200e60 mov eax,0xc03fad54
0xc0200e65 call 0xc031bcd0 <mutex_lock>
0xc0200e6a call 0xc0200110 <aa_profile_ns_list_release>
0xc0200e6f call 0xc01ff260 <free_default_namespace>
0xc0200e74 call 0xc013e910 <synchronize_rcu>
0xc0200e79 call 0xc0201c30 <destroy_apparmorfs>
0xc0200e7e mov eax,0xc03fad54
0xc0200e83 call 0xc031bc80 <mutex_unlock>
0xc0200e88 mov eax,0xc03bba13
0xc0200e8d mov DWORD PTR ds:0xc04eb918,0x0
0xc0200e97 jmp 0xc0200df0 <info_message>

It sets a lock to prevent modifications to the profile list, and
releases it. Afterwards, it unloads the apparmorfs and releases the
lock, resetting the apparmor_initialized variable. This method is
not stealth by any means. A message will be printed to the kernel
message buffer notifying that AppArmor has been unloaded and the lack
of the apparmor directory within /sys/kernel (or the mount-point of the
sysfs) can be easily observed.

The apparmor_audit variable should be preferably reset to turn off
logging to the audit subsystem (which can be disabled itself as
explained in the previous section).

Both AppArmor and SELinux should be disabled together with their
logging facilities, since disabling enforcement alone will turn off
their effective restrictions, but denied operations will still get
recorded. Therefore, it's recommended to reset apparmor_logsyscall,
apparmor_audit, apparmor_enabled and apparmor_complain altogether.

Another viable option, albeit slightly more complex, is to target the
internals of AppArmor, more specifically, the profile list. The main
data structure related to profiles in AppArmor is 'aa_profile' (defined
in apparmor.h):

struct aa_profile {
char *name;
struct list_head list;
struct aa_namespace *ns;

int exec_table_size;
char **exec_table;
struct aa_dfa *file_rules;
struct {
int hat;
int complain;
int audit;
} flags;
int isstale;

kernel_cap_t set_caps;
kernel_cap_t capabilities;
kernel_cap_t audit_caps;
kernel_cap_t quiet_caps;

struct aa_rlimit rlimits;
unsigned int task_count;

struct kref count;
struct list_head task_contexts;
spinlock_t lock;
unsigned long int_flags;
u16 network_families[AF_MAX];
u16 audit_network[AF_MAX];
u16 quiet_network[AF_MAX];
};

The definition in the header file is well commented, thus we will look
only at the interesting fields from an attacker's perspective. The
flags structure contains relevant fields:

1. audit: checked by the PROFILE_AUDIT macro, used to determine if
an event shall be passed to the audit subsystem.

2. hat: checked by the PROFILE_IS_HAT macro, used to determine if
this profile is a subprofile ('hat').

3. complain: checked by the PROFILE_COMPLAIN macro, used to
determine if this profile is in complain/non-enforcement mode
(for example in aa_audit(), from main.c). Events are logged but
no policy is enforced.

From the flags, the immediately useful ones are audit and complain, but
the hat flag is interesting nonetheless. AppArmor supports 'hats',
being subprofiles which are used for transitions from a different
profile to enable different permissions for the same subject. A
subprofile belongs to a profile and has its hat flag set. This is worth
looking at if, for example, altering the hat flag leads to a subprofile
being handled differently (ex. it remains set despite the normal
behavior would be to fall back to the original profile). Investigating
this possibility in depth is out of the scope of this article.

The task_contexts holds a list of the tasks confined by the profile
(the number of tasks is stored in task_count). This is an interesting
target for overwrites, and a look at the aa_unconfine_tasks() function
shows the logic to unconfine all tasks associated for a given profile.
The change itself is done by aa_change_task_context() with NULL
parameters. Each task has an associated context (struct
aa_task_context) which contains references to the applied profile, the
magic cookie, the previous profile, its task struct and other
information. The task context is retrieved using an inlined function:

static inline struct aa_task_context
*aa_task_context(struct task_struct *task)
{
return (struct aa_task_context *) rcu_dereference(task->security);
}

And after this dissertation on AppArmor internals, the long awaited
method to unconfine tasks is unfold: set task->security to NULL. It's
that simple, but it would have been unfair to provide the answer
without a little analytical effort. It should be noted that this method
likely works for most LSM based solutions, unless they specifically
handle the case of a NULL security context with a denial response.

The serialized profiles passed to the kernel are unpacked by the
aa_unpack_profile() function (defined in module_interface.c).

Finally, these structures are allocated within one of the standard kmem
caches, via kmalloc. AppArmor does not use a private cache, therefore
it is feasible to reach these structures in a slab overflow scenario.

The approach to abuse AppArmor isn't really different from that of any
other kernel security frameworks, technical details aside.

------[ 10. References

[1] "The Slab Allocator: An Object-Caching Kernel Memory Allocator"
Jeff Bonwick, Sun Microsystems. USENIX Summer, 1994.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4759

[2] "Anatomy of the Linux slab allocator" M. Tim Jones, Consultant
Engineer, Emulex Corp. 15 May 2007, IBM developerWorks.
http://www.ibm.com/developerworks/linux/library/l-linux-slab-allocator

[3] "Magazines and vmem: Extending the slab allocator to many CPUs
and arbitrary resources" Jeff Bonwick, Sun Microsystems. In Proc.
2001 USENIX Technical Conference. USENIX Association.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.97.708

[4] "The Linux Slab Allocator" Brad Fitzgibbons, 2000.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4759

[5] "SLQB - and then there were four" Jonathan Corbet, 16 December 2008.
http://lwn.net/Articles/311502/

[6] "Kmalloc Internals: Exploring Linux Kernel Memory Allocation"
Sean.
http://jikos.jikos.cz/Kmalloc_Internals.html

[7] "Address Space Layout Randomization" PaX Team, 2003.
http://pax.grsecurity.net/docs/aslr.txt

[8] In-depth description of PaX UDEREF, the PaX Team.
http://grsecurity.net/~spender/uderef.txt

[9] "MurmurHash2" Austin Appleby, 2007.
http://murmurhash.googlepages.com

[10] "Attacking the Core : Kernel Exploiting Notes" sgrakkyu and twiz,
Phrack #64 file 6.
http://phrack.org/issues.html?issue=64&id=6&mode=txt

[11] "Sysenter and the vsyscall page" The Linux kernel. Andries
Brouwer, 2003.
http://www.win.tue.nl/~aeb/linux/lk/lk-4.html

[12] "The Inevitability of Failure: The Flawed Assumption of Security in
Modern Computing Environments" Peter A. Loscocco, Stephen D.
Smalley, Patrick A. Muckelbauer, Ruth C. Taylor, S. Jeff Turner,
John F. Farrell. In Proceedings of the 21st National Information
Systems Security Conference.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.117.5890

[13] "Targeted vs Strict policy History and Strategy" Dan Walsh. 3 March
2005. In Proceedings of the 2005 SELinux Symposium.
http://selinux-symposium.org/2005/presentations/session4/4-1-walsh.pdf

[14] "Exploiting Kernel Pool Overflows" Kostya Kortchinsky. 11 June
2008. In Proceedings of SyScan'08 Hong Kong.
http://immunitysec.com/downloads/KernelPool.odp

[15] "When a "potential D.o.S." means a one-shot remote kernel exploit:
the SCTP story" sgrakkyu. 27 April 2009.
http://kernelbof.blogspot.com/2009/04/kernel-memory-corruptions-are-not-just.html

------[ 11. Thanks and final statements

"For there is nothing hid, which shall not be manifested; neither was
any thing kept secret, but that it should come abroad."
Mark IV:XXII

The research and work for KERNHEAP has been conducted by Larry
Highsmith of Subreption LLC. Thanks to Brad Spengler, for his
contributions to the otherwise collapsing Linux security in the past
decade, the PaX Team (for the same reason, and their behind-the
iron-curtain support, technical acumen and patience). Thanks to the
editorial staff, for letting me publish this work in a convenient
technical channel away of the encumbrances and distractions present in
other forums, where facts and truth can't be expressed non-distilled,
for those morally obligated to do so. Thanks to sgrakkyu for his
feedback, attitude and technical discussions on kernel exploitation.

The decision of SUSE and Canonical to choose AppArmor over more
complete solutions like grsecurity will clearly take a toll in its
security in the long term. This applies to Fedora and Red Hat
Enterprise Linux, albeit SELinux is well suited for federal customers,
which are a relevant part of their user base. The problem, though, is
the inability of SELinux to contemplate kernel vulnerabilities in its
threat model, and the lack of sound and well informed interest on
developing such protections from the side of the Linux kernel
developers. Hopefully, as time passes on and the current maintainers
grow older, younger developers will come to replace them in their
management roles. If they get over past mistakes and don't inherit old
grudges and conflicts of interest, there's hope the Linux kernel will
be more receptive to security patches which actually provide effective
protections, for the benefit of the whole community.

Paraphrasing the last words of a character from an Alexandre Dumas
novel: until the future deigns to reveal the fate of Linux security
to us, all wisdom can be summed up in these two words: Wait and hope.

Last but not least, It should be noted that currently no true mechanism
exists to enforce kernel security protections, and thus, KERNHEAP and
grsecurity could also fall prey to more or less realistic attacks. The
requirements to do this go beyond the capabilities of currently
available hardware, and Trusted Computing seems to be taking a more
DRM-oriented direction, which serves some commercial interests well,
but leaves security lagging behind for another ten years.

We present the next kernel security technology from yesterday, to be
found independently implemented by OpenBSD, Red Hat, Microsoft or all
of them at once, tomorrow.

"And ye shall know the truth, and the truth shall make you free."
John VIII:XXXII

------[ 12. Source code

begin 644 kernheap_phrack-66.tgz
M'XL(`%U3/$H``^P]:U?;2++SU?H5'<]A(Q%C;&-,$H8YAX#)<,+K8+/#W"Q'
M1\AMK$66-)),8'=S?_NMJN[6VV!GD^S,7;09+/6C7EW=5=6OO>6A-^%68`:3
MT+)OUWJ]]1^^]M."9VMS$W_;6YNM[*]Z?FBW.^VMUF:[NP'EVMU6>^,'MOG5
M*:EX9E%LA8S]$+G6`P_GEWLJ_T_ZW);;/^)VR&.76[=-^ZO@P`;N=;OSVK_3
M[6Y1^[>[W?96IPWMW]MH0?NWO@KV)Y[_\O9?7]78*MOS@X?0N9G$3-\S6*?5
M>L,&L^N0!['C>^SH:*_)=EV749&(A3SBX1T?-;'J.0=-B?B(S;P1#UD\X2SF
MX31B_I@^,G!.`^ZQ@3\+;<Z.')M[$6?ZX'1P9+"[=K.%X-8U[4?'L]W9B+.?
MHGCD^,W)S_DDU[DNIH6.=Y-/&]M>[&*2!LT;.S:S?2^*@<;(N?&`6-?W;L2?
MP(J!7H_ML-9]M]79[6QVNKUWO79OL[MW<72TG4!P/-?Q.+OSG9&J9(X=U]43
MH/8$-&GU>C9NL,CY!S=CYG+/T/ZIU0I%`H!:<\9,AWRV0H7]L5XFSC"T6@WZ
MXBSTL$8`1`+T;:WV:>*X(+N`_<1T2&&O")/!`%5M5:^`Q%8-%AA07U(.,`#<
MJYU'<$.1S]IG37.\F$TMQ]/QQ0IO[(;D8A4^[H@]R>WT&E^@'I8<<>@PVR76
MB7[!/,("!C8$V>,`6C$>Z]#"/`P;K'X163?\+5N)V$<$"_)GQ^^NV$<"C%\P
M3D7-J[]Y]0:2=?>Q=84TU_B]$^O]R\.A>;![>'1QWA><:#5!'LC`BGU'IRKM
M*P-46&_#&`2_^(.%!89LN0Z"A@80)-9_A69GENOZMA5SMC)CUP\QCYB^,CLV
MFD20P-7(HR(8V%P[(%"LK8MBB!,%0ED[[`343L@D`$GXH5X7I>N/LZ>(^Q\>
M^M`?4%/&T"&MF*T$S6:3`560A*6G?!KQ6"=%;2E:,4.!V/<]CGSDF0Z=V,ET
MF-@'P,"SZUX:*?R&RA=UL_V$<A=$=A!R_C@38RBAR_='00U<S@.$-?9#MC)"
MM?&]4030J*6HL1%(A.5T]:G)?L=:V]@)_M/#]#=[*NR_2EH#G8ZCF1/S?],E
M7,;_V^CVP/YO];JM9__O>SR+M;_M.MR+OU0-EFK_SM8/+7`%VYO/[?\]GJ7:
M_W:"2<W@83D<C[=_N]>!8"]I?WAO=3I;6YUG__][/#^^6)]%X?JUXZUS[XX%
M#_'$][1ZO:YE8@)[F9C@I#\\.CSYP"+?ON4Q.-XC\(/L^"$`#VG$QX[G(("(
M6=Z(^1`AA.S:`0`^`'(\RWU;1KRU!G]>L]VC(80-WNR^P<XXN!7LKTTVL.XX
MZ.:=]C7#$&V?1W;H4.FWVG#B1`S^(131#\B3^-`_/_FEOWL&G-\`\Q%"QMZ!
MH0@[1+ZGTYGGH(<8:>`K877L6=QE4W\T`_?=OT//AN7EU2"YA/SWF0-06>C[
M,0-/Y@[\_1L!*)QY33:<H$,<`S&(%ZGS,<2)0]]EM]*Y-!KLEKPD>.$QN.UW
MC@7^_C1PN8;4`:*(C4-_RF;0=BY\(ER`A=4_1<0-`;[F"'LTLV.0+P>'S(X=
M$/M#4_O-G[$I-16TJVI.I"L4DD#($8DK1L!"%\#5?8!ZGYQHTB1-TX@(J2)`
MGQ_&;%4D2B62B;L'II16@PU.]SZ8Y[N_-I0`S8M!_QQ3,>_(Q+?^$-_-P<G^
MNXL#>CW?^ZM\A=K[[\]WCS4)VX_46T">(P2B;,P_L?<'9VSL6C>19IKP;OZZ
M>SBLU6H8+K9;,NWP5*9T5,K!0*:\;FGXC=K2/Z(T/87#[email protected]&``9H[2
M'&>0:TK?S),C\_WYZ<49PGKS)DT_WGU_N">QMB"4L:[M4:;:WO&^.>P/AE0@
MG_SA>/?HZ'0/<]J%G(/S?A_3._GTX_[QH$^@-DH9>V>_848WG[%W>O:;.3S%
MG,V*G(/STV/,ZQ4I`WB[>[_TS;WS_NZP#R6VYI40/.RPU_,*$"\@LWGY^R"=
M\]/?4`HM[>3H>/`>9'II'O5/DO9NM337N;9KE+"W?W2DU_&[&?G-7MW0--NU
MHHC=3NSI:#(*]4$<0J>9A=QXJ]4@".+N*#(A^OL(<1Q&=3>.76_4;'-&`7>#
M4K$7E!(#:P1IA40(^6_B"945`;BAU:X4#9X[C6Z>IB$#8P9*O]$I40&I[9Y,
M)54L)T?\=TE=%D3@C'*I1>*>H`QHQ^J*#PET9,66(`!#;HC9<^TD<&1;P7P:
MD<(`8`O(1"M"NFK0)8C(TH`#Z>)$S*4B):/,IAKSOR86D4%P2<%$:TI5RRM$
M03/_P4-?I&85`FS"S(VSI4N2$O;J&S`Q&J&-?A2WG!7Y+L@+';@H5IPJ$\EB
MTHQU*\FU@X=O0.X(M+5,:T3^TD(\E$BUP:$ST<?X!M3.T<"OR03Z(-B&3U`_
MMJ:.^R#&A6@"CD1I[,ZE%D9'D7H3^K,@'6'SH^8S'7DZ/H`W_0O$J4-H:]`M
M__KOX)@B*>,1$''B>URK@1+!.RF3AA.[8_"R,`(Q33WB[AB<T<@<\>O9S<Z!
MY4;$2`TSF@1#VG9T0?62[TG>8]G[!"^@5A,5>8QU_0`&E6J'M+>YN=$SGBZO
MO%997M$882L`F4F#&+F<)L@4<OVH><-C>"]F"^%"B9)KF2\G&A+*I2)(2'!3
M.4$(.=*E[!KL^B'D8SV%8C344D,F#0F":""!]`*\+&P!`=MV_8@3T;70<B!(
MZ]_;(H33ZX3,8&,+XJ(1A!+QA*V,ZFR%Z0J807("Z*J%4\#T"60/PQG'&75H
M^'(FZ4/"9X!K+R(P:)Z%/(X?SD**<W0@!**P'1RAA8*1M<=*I$PP6$'-C!^0
M-@+X^'LB#-,AM\%*KGH"4EEWH;(HQYU>M\%&OHF6=J?5$$'"3AII)+C)4\W[
M"(;(:J:6'<O@K\P@8)@F0@^1B)@P32*%U,48D<&%6,D*2(^H%NCD.7[K\TO3
M_#M5:@K?(9$'N0E"&J1(14G+N%<Q*@PPKNC`F[8@W1@N&'+0D+Y!BK'!;!2[
M,!P[+45`*FQ9HT!"39(@$D5M2!,O,A5-/J39DQ"G/UZQEW^[;[74?R\7EKL(
MT1(5DNZ"X`"M(^A1:!,'E<1C844\%L>6AQ^9(@PI:DUH5S"SH(A%L)B02&Y"
M[`M/01`:^PV:H0CB<`ZIJ6]1H#;VB[1*0%].KXQA%U3E7.F,*LO&SO",A/W!
MN<;X?!F^T_+S.(?*$N<`1A[).6`G9F4U:3$PE:UEOIK2!<PG)BZCDBW9CV0D
M)B.5IH&54@6S_"=TT!\3XV`E?HFU27$[R]I-FGK)%\**-2RDH.2SP0FBW-;]
MV!;_R^<+464$G945RB/7>"I%RE4T2L))R@1`;;#0^J3@AMR^TZ41]OP8L\@0
M2O&3"R5R(2?/_HLB^U21#*3^TO%L/P3@N'$`"]]9[HR_8/H-X%BY-UZBG2Y"
M-(PG,2,?B%CQE$4I%)-*3)UH:L7V1"$<E1"20.;@4\!Q);[*-&4%E+.K3>K&
M<J#7Y2^H)N`UC!KY&8^!3[H-\\,G2@U/RT0D(\*C9&A:D66E_U)=H`=)!T/8
M-*D8F")<)+&#0O8Z*&U(@V21HY'W1"WI.F).QG>D'/!'Q:R:_+1RG$UQ82R
M;PHE`&_2V],A3==UVEL`-K,+(T/;8']A_ZMWU]K4U+DZB+WH&6=\T=1?E-KU
M<?6*#4!*.*6=;O<`N=,&GIE0L`JJ,(H1J'/].M<0TO?W1K%?=)]1OJ(Y:*.&
M]*FMC"]M&:K?4U].!CL1]Z138,(9]I!<Z:YC>525^3A]]=$2_^1`(:'LL+4V
MB4B@4CU(,J53*=([email protected]:6/&E:>*&(M-\^593$CZ?>1Y*5\>@1779
M3`,9Y":@[A7"/DCG3>597\_&;2C"F\K7[75%:B>7VNZ$G"9UI59:[TN$RQ`
M6+'2>GU)="!8)*"B/("JKM"1%#?3W3;M!NXQ@[]$53:G0Q(69%$ZNE\B7=3#
M&EIMP@$GVA2:%S(#_>4N>Q=:=YR=\$_L5S]T1R]>`HCI0Q1C8&B'W(JY*3;(
MF6(OC2Z;EF`E3(%P"JZ(P)Z.'Z)\HAFJNJH/XV^H$]XFC?:8`P)7NW3:1NZS
M0Y])@!=`./__>(_-'_E9:OT_L,+8L=PU7+G\%$+&8EL!GMK_"__2]?\6KO_#
M[_/^W^_R?/WU_V^W$C^@M6LFM9`E6DA3+G(IN;2:&X+7[$\?66(6FUI48MZZ
MJ*G&,X'R5&',3#:6YA33.<1J:R6J7%N14YC0H8K"Q-31Q"B3J/@%#;PN,%T7
M<0!DX.S&QRO\$GN#P?KJE&ZPGUB[1Y9:31<2;8DQ))\I]74(]:[<VSI*O1TK
M!N,%!NY>S+*)W:UJ$D_RJRR:F"%I"<Y$/I+2M((`74VJAJ1"SP/[CB1A]L>4
MYO7.59K]*IF<4A2^N&))4Q0($W6,9)8MH4EDH`7>?-U@KPD_;DP@H>!N8D3\
M-BL'P**VH680+,&SE#,9O81GX2PHIVWF%71&Z`64+#I-=>4TU1.GJ5YTFNH5
M3A,J84E]R?PB<N/9[/YW/TO9_Y$_NW;Y&FGT$KL`G[+_&_+\C]C_V0'[WVUM
M=)_M__=X_H3V'S?(V3Z^WC.AD6)!0&Q86\0/6,3D[Q-D-`"E]<5O;_*I*/&#
MIM:[`1:SG!K"[C]FS_-P\W:,Z6,GC&(C;\^JS=5<8P@U;\`>+01$3@K0>IN:
M]Z;V:Z?B<YB8_9%\27^F5I-LT3H5);*-#L;##?:F1R$S_,&WSB9^PM`!KZTN
MI'9;;WI76K5\E?J0A-M"FHH`X3\YZ#8!M$<<)Z%137OB@^KJ.3J%@T!\Y)V>
M&N)YA?OL4"P"0,3Y"*WV3'SK&QUQ[DOF3B!L=X4PHW3:5\DX3X0H]&CMK,]#
MJ6_G^BK?R4.)J9^`?$M=#C)_9!<>;E95VVYG4;']L(P`T4RT*H'Z[.0\\2QE
M_W&*:"WVUW"":(T."'^-^']C8Z-3./_1Q>QG^_\=GC^3_1_2AO1WH35Z&>56
MBQGJ(K,GW+Z-Y![V:.+/W!&+@;P;VF8?6)YC-S1?4&2-_F[9.*;(`X:T@X1(
M]6^!ID]XP/.:$US@)/9AW/FF$PPH[*%_`;P<`<KOZFN4!2EG%IZ:,$!1+>9^
M/#:=`-H%!,AV,*HF%YZ(LRNS*<CO=)<C)%$*H01EDC"]:FH!DA\A26;?VQN"
M)"$Y3QP$KIRBQRT?[<.3@].C_NZ'EQF(A3EZ`3F=HU=PTVGZ)$5L2X/635:6
MI8>`/RB?U&G"D]"T*RV>!H\0F0(K.HJHPOF5,I3O&OXQZ-BPFDM)0324OR38
M>97)R?)'!!E&U@W*[1HI%56[=0K$@N<B"C1Q51C/KB22PA0#5YGE\E:&KV%V
MX,CJ"/MD17*L>$$=Y\?LJDB*Z>.C2*\>=[@K'>TG.D76GR/7^G'H2^JW@$Z5
M_E,S6X5Q\WE>:]EG,?\O*44S[<O>"O.X_]?I=CK)^=_61H?.?_8VGL]_?I='
MW/^2'&9,FAQ2_U`7PR#$U=55MG_*3DZ'[&+09\-?#@?L^'3_XJC/3D_8+CL[
MAX_A(;P/?AL,^\=8H7BGC(L'2->KKHP1.>)L9E4.^F#Y=(_'Z[CCI9P*_T&E
MVRHPT2W:C&H$?NR,'ZJR'H%W&T_`.H^JLN[DQJ2*K#BNQ#.=TITY/](Y7<Y*
M9PO%R<+==WL@]'*I](AAJSHS.6C8GI,OCAMVJG/5H<.-N=ET]+!;G9T<0-Q\
M)%\<0^S-H[YP&''KJ7*"V]=/%2.NWSQ52AU/;,\1[OG%B?EA^`N0ME]K5PB8
M-E"=02.^>8-W&N'1#29V);'L#45T`1#MR-LNIF)P4TH,K-%V<@,0N%C;VF>P
M[E8,/>QZ!DZCJ>NF&8!IX2/3I"MP4MQJZQQ+[Q"2-PCE<=`^]&PJ^E:X"[U4
M4NP37XX&FCLNB0"=IJ7@"&=I'J"<A&ITK,O^V+U:%H,-H5()PRB*2X*(0OO+
MFR793%C&56J*Q2A8`+]2V/?](>KSF1X8-5W7Z;ZM52/`;7UJUUM6>3$8R)`N
M]H&B/HFTZ);B%;8*XR[>%.2,@!KX"^\"LKA0*R>H[)U?,;OCX;4?@5+*-`48
M!GZVBO;2Q-=B]BTTEFE;-ABZ51(IOB:%]OL'AR=]\_ABV+_4"0:U+PCFWLAC
MQXV!@BE3#/12(%9X(Z[>*K*^JK9HZ^4L`R_*VL8Y`D)E0M>[U?]2PB]S9]Z<
M?*TF[9%)%QQ$L9Z(H2&V>J_]#.).WD'<N'T0`*=2,2EV2"0CRY8O7I*B(*93
M6=!=9`16R.CW!DM:,[E\#=[5_6LB$\_M;L_3#)FJ-G"R5<^=;%?)-[B-T_38
M`C#R?36.;G.JA-LBZ63PX&QWKZ]+>DC^``3W9Z;B$),G&7EDSO+(Z\&H3NYZ
M,(R5;FGWO=D_/V?UE>BM.A(%879R05E"O+CQRC3',\^&7K>MI;?+T15B(`D\
M)"/6,6ZO==K"62;D!>1A&'CC`Q9_%I-&N).$W;.+H4[M3RU$K2,R]D]/^@V6
MR@$JK?U,$C=I7,$5&`2&#95`V]\=[NI04MY<ACLC,5_>-I8`(YF2KN%)+>K@
ME"!XDKT?$\24ANCVE"`WQ7*Z8P_:$!M&]#531+&%'@A\W<8-5A?S>F::7T\$
M=#C`!M$!6.$*/S(V)%N4X;82^8FX:5"*`EIP%N)^:X0E95VL"&)_^VA_`AJ-
M4@\"?LSL4;"*KM!@53TA2<U:;1S8@DS_DD-I14$23]HW@#8A],H!O6J@5Z"H
MH==7V>7EY5MFTQ3O-8=_>,V<F*;E%%P$U@U7'<`/V82'G(T</)*`77A=7<,G
M/61=$%1U#U^N]0H7YZ6UTAV[C:+6-N98K5PM8+FZ&!W@""I@)`)1$E%3F\0^
MV>XHX+8S=M`>BSD_7#B/0!Q<3H&+4W&.!WTX9')T6]=J508D:4F#/4HOLB5'
M?>&*(9R<7V`DIQF1,^B?-+,JWFD0,-3]D-+P&H61#N=%Q5"G`*W,&CBGZ-X;
MZ5;LUOV*>YD;[L1>_#E($ULEJ"8:,N9&'?+=>T?=KTD#6V;THC%.W'Z8Z,1=
MYH;">=U0G&/\LDY(;FO:!1>3&"&D&3_WTLB+1XJ#)@O)1HG"R@'+9W^NO!95
M\:6.2WX18]*/7I(SB5*RUJ!9S<Y]`P]GS&4SJQ'VQ]:5RDNLM`1:(0)5M%SM
M2<G0,<POE0SZ_\M+!E'F)`-]9;YHP)G/]14\)IKE,!FX"G+!>JR05JH\IR?(
M6?PO%$P:KSQIDC)%_WVCE`#+W.SZ)S8I*3M"_:L,049^BYL"Y=RU%Q[:<ZLZ
M7T-UA>(R,)1)%@M\0$TWG,F+V2((2'EJ`Z$6+GYE25E.Y;WO9$>29<'G_O/<
M?]+^DRX4?[T>-*<#J1L$*SM/2L<?J_N0Z,P)OY<D!*5[V^6G(P)3]>F/QS*A
M?+EYSM\5J08+DONH<PV%J[#[LVF06RO'Y2BZ.E'<>4V-1'&X8$&C#7RZH(AV
M*6+@RIQ7K\2<`(6?#OO7O]@+*+2"*_DB@W(<VH&H2*FKJ[)K-63IU0YSMK5,
M]DJK.\(;MR&3Y@H^:ZF^U<'C[ER*Z[@_.E>Y2\@3R)]S4UD4+4,(QIT[3H=2
M%QZLA+.#IQ3#D$X)H^@W.L@Z?F8T`$>@M4)_2NX%R(UD!-NF.T-D]"UO$-$+
M98SR/$1Z>?]/<_KN_/D9T`P_%'M\\?]'`.\6FCLMP];:4JZ(#V"O_3SG]/UR
M",5VIT]A,M5/X=,ET4$*DAL5)%*B*^TR-EZ1(/O8?')11C]3L>4(Q-HLK5RF
M;!EJ(F#7GDB*Z.P_$6-;4<7BU=M,#R@,JGB;`!:9WVPU.3"+L1J^KT-NX:1@
M!;+L?0(Y>>U4F-`TW"<IE.9QL.>059\3M">3ZGA3A/$D:;@BM2AAM"&]0):(
M:ZN)$G'KLB2)Q;\%:1).08$H%9164J6"SB\@:^_LM\7)`N>B3!9%A//(HHAO
M6;*RUT0\35?&;\F1]G_M77MSVS82OW^M3\%HFD32R19?(J6X[=1QG-93)\[$
MN=Y=DXZ&DBA%9[V&E)WX;O+=;W?Q)$%*<N)SKXTPTZ01E\`"6(`+[&]WY:&L
MD+>,)O,9[*&M]2X89'KO7;.8-_8^V4`EU_(Z(K&LUM%P,V\QF6;CI:$#;>S-
M^;-STKJT[@SC470U76W8RJ*Y=36_G"\^S/F&UI1;<='6IC9;4\');+^2CT_*
MF`,?;D/]RN@#=+-MZ@/ZE[_<2(.Z0,(LJS(W#KS,/S]9*XS-/\[,8,&T!HR3
MR^_C21IK.8U"?N5)&:FC=J7J_S:G?^B!94BY2G13T-'9Z8\O<_7+=A/VQ1-5
MLXKXH8K_=E@1/,*(P.^F4L570EWH?<)6!U.L+1-XF^ET%H:_PYC<-<,:\PAY
M[CWO'1W_O+8N6VF'\&-O>36=LJ<)[]RG=5-?.NWK;92F]+`)W&"[S*JD#%EM
M9>!M]!.="*3PP61>1U-AG9+&3I0?/B#L`J''@+*U1X37GN.*,H.3YP`9I-;H
MG6F2/MHD>%./P9O$*5AK^19V0/2AYGSR>.IK%,Z3TY>_')WQQ2O-25E;)>\E
M-X#M`Q7]7BTZ$S.SKH@A0SP?2OU05'Z+SB3Q>(+!39`=BU[?MC/F1IA!O>T3
M[,V:+J)A/)1)A&3\,A0`8P?K]3`;4TY\*$&3%)^<@"0,%*=F,F<5QS!NR>)&
M#4Z]^.A8Q#K*O,8\\,K`;$R@31D'&DY`+)N]``(F?[VST^.3EQ<GM>H&T)YS
M8%?56T=_>_/3^>O,2V=GQ]:W"!.,DL'['U+Y`%VWOL=7?V\\Y-=6ML/_OH@N
MXQ$LP<]K8X/_E^U[#N)_$?8;AB'B?QW/;N_PO_=1%OU_[<_PZB<+\5Y4*G!V
M!`5R!C-O[1];K>FDS\&Q:>N;6HKQGV#'B6;P-*FW^E>3Z=!Z\=TWM5=_?U;G
M&4Y2]&Z*HSE4D\R`#+[M!POX[Q+_6&!(-.N@P?Y:S98].#NGE!FF<0#O'PSD
M6R\8)G<6)<!D*OZ9WLSP#='6P2(9HC\5[H'08'HU7,`G/H6GN:Y=+BILIQ14
MR<P@JGPU[@-%^5\'T?P.L[]N]O_T/5_F?VWC^H<__5W\AWLIN_ROV^9_Y9?_
MH$WW9M%'O)I16M[ST[,3JZ$!25>+%>J+E3TT"(P6T/%:M;5,%H,6O#>9CQ:@
MLE83"1@KR?M)+U9UE=;FRNP(%^FH-H):7L2S-]C:$XO*PROK\BG4_HA8H$R9
M+,[>2!BE6(!&>JR2GA;#%K`5X>^W/LUM,S>V_0C-4__)@X)9+MKE80[+30IN
M/)2Z;CP4AZVE`2"2^6S-5+B%=$:"7!SPQA)'7*0J98=;$[:-.$&\@_EK'L,$
MS>W33=1>.6_L+H3L,;Q/O!V9F=1ZOH!%(R60>:)./_*<JO+*F_M%-I:L1FV$
MV!WS)VF:V9#0U\IG],T<<D<:CO\CPSF"G.-OXD>0`L/FA>8?BK1!QMCH(W"E
MKQ")B(0?2O+\'D?SQW3$3B;Q=6PMW]^DDP&()KR^2&ZH/P_>S;=,?GL!`CO/
MV-,6(U$3.6.WAO$UKD&1_A7Y8G;-RAZLAM,1RU2&!\ZT27G*6'ZS]QAE,YY'
M?3R'XHX#YUM1%T+H;A!#QPYY5!':0H>S.!T+3W3XZP,<OV<QU(\S,KF,J:DG
M1/XJ68R3:&:IW.OHNLYQO(,!XO1D:_UX]2&&G1334>U_#YLU[)Q[=`5(R1CX
M9L.IH8OGO=?/SE^>_5-N-D-8*W9VIU$;3>D0\X5&H@$5R`FERRB4#?0I1H`Z
MQI)])+<'D$7FLZL:E:TB>35W%<NND;1]Q[QM9)77>*86DLLF$UFJBW$(2X$+
MK#1(L6UPR#C1.D\/,N?]/WG27ZT4Z'_H6(,KY*[4OXWZG\W]/U'_"T('];_`
M#7?ZWWV4G?ZWT_^X_J?T`OR?*!DC$(>V7/0VNM8!,3``36L<+9G-):/*30[S
M.E^CL5PE*0X&>P,5%]N&IFO8MF*!1@*;A?W990-1K%#D-(HO4RF*/WCL,]1J
MP%?X!K_![R>@Y4YF$Q:$*OT0+9G-#1F$+XT'O9##P4TJBO]HM9A@QZ[?.K_5
M"_I-"LR*0IYQ>WX-7VSQ*O&50JVN49>N0?1VD0BQ"C=\VKEI'&NA:_I;-*]4
MKU_ALU&L>644K3Q\*=N6AF1";MY.?E.#(DBX`,CG>K<+^YU3+O(,]]%4Q.!6
MH'IS[IML(3&0'&^*U''&A9!^X%S,HCZ*2(P#J5C^=)NNZ]U[D.T>YUT$[]!X
M/]!99?TF&(;^RR?.B/C]Z]1Y=D65HON_"%-T_SL&>2830.M+VUB?_YT*U_]<
M._1=T/]"S['_8K7OHH.;RE>N_VTS_X@6/%A]7'UN&YOB__H!L_\$KMMN!QC_
MS[']7?S_>RDON$K$YCQ:B33VJ;"$5RJ@5+Y$.\^>!<H^^?3M6><4H`X>[6]7
ML!)4X\B+E9PS]\Y_/H"*8M]V**\VTUXE_-]M!W5)$WANK&@DB>UW=!K/,6GP
M:BI#XYLT':?K9FA"22-(/#<,LFWU-9XMF0W&<R5-&PY17=LW:0)?Z[O?'Q31
M=`-5C^O9?5_1Z(EG%(WKC#HF#0RB:@O7E&W2M!U7XR?J=A2-;`K55=F6XX6^
M28.3H?$3PA@:-!1<3O'3'A30X&1H_'1=I-%%\/AF,,6P?==C)'F>P,^W$T,L
M?5!X8-WC?@<]"[NLL>%PT!]4\A0TT(ZG:$+;H*&!=CU/T/2''8/&<VV@Z822
MIF.VY76`'1!`2>.;;?DXT*'G.[*U:.0;5#2MON(ZBDR.VB%P[7=5/6V3H\`'
MKMN*H\@Q.0IMX+JMZND.3'Y@_0#7OJ()37XZN'P"-4)=U^2G@R(4JGYU8I.?
M;@`\AQU%T^7\:)-*(MUIVY+&[U0JY9NAQ-;\V13D;;[_7X;^V(S_\%R???_M
M=N![^/VW0U`)=M__>R@2_Z'-^A\=`O)[C^D?J6RS_@W1N*5E8/WZ=URO[6?/
M?Z#R^#O\U[V4_^?[_[N,U;@V[.&:0(DI[$I3(\9C)EIBQL)`%^9LF^KA1:+U
MG?7X'X\/,[#AQC*)KT>HN1X6!XN(,5)XKQ\-J8J\`72I82WPQGTN8T$LW\[I
MPC#[1EWGI\AE&*\*JT^1V8=75HW2-Y)W9YV9S'F,8W9!.DGQ:EWL",,'`J4P
M;UK8.OS)_):9]X.!*LDP3@';Z31(K*5&1\>YG@JWW+S!81NGV.48+T'S[BGD
M%(M#EG=1K.(5^<FS`^L==IM=K8ZY)28[.\CEI"[O_;4J\!23>UW>MDZ*W6%H
M,,3H*OM2?F"`JFDU%E.&F,'C=)';,1ZXT9@;#ZG>6D$D+_Q=AO)B@;\8UAXK
MAOHQEA:0'"H;`36M"533>G7TXTGOXO17,B?(`WXM'_:'FEK+;Z^7N2*X-;\T
MS5@YS"?PG@]1UFI8K_`I!V00HJ./0`HIS633,<'V5O[NXMV*#`N&_++!T<>#
MQ^G*#)Y]VR%3SD1ZCY]D63TZ.WG]QJH^-\._,1&@RH35RUR?2@2O12(;(84Y
MHVE^#6IP,":1++HEZH5O`_\W,[*#%M;A`4)#UHL?0WGI81XRPE?@Y5_92^?<
M3H),2!@*_@-3QCM-JRIX>3B]JE=E+<2S3HW6I\?O;-B_B\7BH1.D*`U-ZFZA
M3!#+(G#.ETTD>6T+SNO;S.7EY\ZE@?,JFUKXE,6SY:HWG*2(C2+[+'XON<-+
M38OL<:GFGT45/'IS_N+T>"M9V#3_NJQ<9F3AR]O&D!<XJ^@D=W8*,S_&G>,J
M)?4FQ>,/2]?1M!;P2_)ADF*V+OQ,4MX/W%.P#63Q`?&8WY@VBZNX'0-QO4MI
MI;ZEEY.E10G!1,LB(`=/WH##N5PE%J:.7,%V&8TPV![S^Z;>4="%[_2W&\@?
MX2[-G]>O$&`(U<L5#:X6MXY"C:`C+V^OH&)LKX"-7(08WA"/GZB)BA1EAO(C
M269NA)HLKUG!6[Q\BV5^N>4REU=:-1'U%,>45K7$9_*EVFKEAD*'YT$+#0S)
MBZZT'&)7)TS,,(Z&<$3SV0`QM14>]'IXWE]-YCWZI<>C"=9LMNA(<M1D\!&F
M2#[TZ'N&P%/<,[8W:8SH^&WTN$PKS,%0!G1Q_!:1JGFH[_48F8:_>%B7O#89
M:+HDKX:`":@5L'\S7(IB30`5RJAA7<M'2(=M@TCKOVG>TS5\W+*">OGTT\%<
M*8QJGW;M3(=I;)HR3K86K1966OH6^<1>8GX]RJU':?4\%Z3%Z\!O/B:):3LN
M_!$&F(;/1NT_M.%1&&!>74S.U\%\?-W`9OGXH">:#V-57.6#8IRYRX=_TV4^
MB#O46-WZ2E]X:AK@"BYZO&=U&#_AG<IPHQQP4=G#_F[>@M7=\4.^!].`<6#%
MNKUXZ\UX3X;,S,IYKBT9GN=*.UVP5^&4(=9H!AE=$I&'NS_G[W=R'M`F^QN-
MAF)25./ZM',K(OPO,R-^[I3#MX+`TQ,.VJ;/+4=NBS[U")^=6O$$O\T65U.&
M]!Y(S'RQLCXLDDL><`GV7`ZE+CB-B3:Y!L9[S$\36!$V?:#>-A1IM&D>ECV4
M1X*6NP71%B2-;>II='3T^Z7>,SQ[3Q%X/IXN^M'4NC@[>MJ[.'EY<?KF])<3
M-O"4ZFP:8W;&&/T%%NF*W_ND,4NB1F.BCZFADGJ*3^-9X)<_ZP;ESV#K*G^H
M3X/Q$/:V-=5RJ%=)O6B/+GUJ3ETY"9NZ_-JM[.5V>Q.]5>"<;BSNK'_Z;19W
MSL$\HY/HKN:%^TG.V[R0K2*'\_.+LP*7\K,H`1Y_FHS?IS-4EZ1/.:YBY58.
M>OC*$J[E.[M`0=D6_P/Z\F=#@#;A?T*[+?`_@>TXB/]IMW?W__=2\NOP"3>F
M65-^0?]X<\B'Q]8JHD"0[*ASL,:._K\&%8E;6P*8N,-28)%&%]CH0E4"+LK2
M>:4`HRQ=4`HRRM)U2X%&63J]'UFPD49'@".W!'"DTSE^,!J4@(YT.M#ZAP6@
M(L+#9-IUNGX9^"@['QVG#("DT]G#42D(2:>+O&!0!D3*M-ONEH*1,G3AJ!20
ME*'KV/<`2NJ&?-#:Y:`DQ-1PFE)0$H?E`$TI*,EK^Y*F#)3D.XJF#)3D=V1;
MI9"DMJ]HRB!)`1)RFC)(4A#:DJ8,DA1ZCJ`IAR2I<2Z%)'4"R4\I)*GK2II2
M2%*70Y*0I@R2Y-AM5=%7BDG:E5W9E5W9E5W9E5W9E5W9E5W9E;LO_P4A):,A
$`,@`````
`
end

--------[ EOF