PEP: 603
Title: Adding a frozenmap type to collections
Version: $Revision$
Last-Modified: $Date$
Author: Yury Selivanov <[email protected]>
Discussions-To: https://discuss.python.org/t/pep-603-adding-a-frozenmap-type-to-collections/2318/
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 12-Sep-2019
Post-History: 12-Sep-2019


Abstract
========

A *persistent data structure* is defined as a data structure that
preserves the previous version of the data when the data is modified.
Such data structures are effectively *immutable*, as operations on
them do not update the structure in-place, but instead always yield
a new updated structure (see [0]_ for more details.)

This PEP proposes to add a new fully persistent and immutable mapping
type called ``frozenmap`` to the ``collections`` module.

The bulk of ``frozenmap``'s reference implementation is already
used in CPython to implement the ``contextvars`` module.


Rationale
=========

Python has two immutable collection types: ``tuple`` and
``frozenset``.  These types can be used to represent immutable lists
and sets. However, a way to represent immutable *mappings* does not yet
exist, and this PEP proposes a ``frozenmap`` to implement an
immutable *mapping*.

The proposed ``frozenmap`` type:

* implements the ``collections.abc.Mapping`` protocol,
* supports pickling, and
* provides an API for efficient creation of "modified" versions.

The following use cases illustrate why an immutable mapping is
desirable:

* Immutable mappings are hashable which allows their use
 as dictionary keys or set elements.

 This hashable property permits functions decorated with
 ``@functools.lru_cache()`` to accept immutable mappings as
 arguments. Unlike an immutable mapping, passing a plain ``dict``
 to such a function results in error.

* Immutable mappings can hold complex state. Since immutable mappings
 can be copied by reference, transactional mutation of state can be
 efficiently implemented.

* Immutable mappings can be used to safely share dictionaries across
 thread and asynchronous task boundaries. The immutability makes it
 easier to reason about threads and asynchronous tasks.

Lastly, CPython [1]_ already contains the main portion of the C code
required for the ``frozenmap`` implementation.  The C code already
exists to implement the ``contextvars`` module (see :pep:`567` for
more details.) Exposing this C code via a public collection type
drastically increases the number of users of the code.  This leads to
increased code quality by discovering bugs and improving performance
which without a ``frozenmap`` collection would be very challenging
because most programs use the ``contextvars`` module indirectly.


Specification
=============

A new public immutable type ``frozenmap`` is added to the
``collections`` module.

Construction
------------

``frozenmap`` implements a ``dict``-like construction API:

* ``frozenmap()`` creates a new empty immutable mapping;

* ``frozenmap(**kwargs)`` creates a mapping from ``**kwargs``, e.g.
 ``frozenmap(x=10, y=0, z=-1)``

* ``frozenmap(collection)`` creates a mapping from the passed
 ``collection`` object. The passed ``collection`` object can be:

 - a ``dict``,
 - another ``frozenmap``,
 - an object with an ``items()`` method that is expected to return
   a series of key/value tuples, or
 - an iterable of key/value tuples.

Data Access
-----------

``frozenmap`` implements the ``collection.abc.Mapping`` protocol.
Therefore, getters, membership checks, and iteration work the same
way that they would for a ``dict``::

 m = frozenmap(foo='bar')

 assert m['foo'] == 'bar'
 assert m.get('foo') == 'bar'
 assert 'foo' in m

 assert 'baz' not in m
 assert m.get('baz', 'missing') == 'missing'

 assert m == m
 assert m != frozenmap()  # m is not equal to an empty frozenmap

 assert len(m) == 1

 # etc.

Mutation
--------

``frozenmap`` instances are immutable. That said, it is possible
to efficiently produce mutated *copies* of the immutable instance.

The complexity of mutation operations is O(log N) and the resulting
``frozenmap`` copies often consume very little additional memory due
to the use of structural sharing (read [6]_ for more details.)

frozenmap.including(key, value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The method creates a new ``frozenmap`` copy with a new *key* / *value*
pair::

 m = frozenmap(foo=1)
 m2 = m.including('bar', 100)

 print(m)   # will print frozenmap({'foo': 1})
 print(m2)  # will print frozenmap({'foo': 1, 'bar': 100})

frozenmap.excluding(key)
^^^^^^^^^^^^^^^^^^^^^^^^

The method produces a copy of the ``frozenmap`` which does not
include a deleted *key*::

 m = frozenmap(foo=1, bar=100)

 m2 = m.excluding('foo')

 print(m)   # will print frozenmap({'foo': 1, 'bar': 100})
 print(m2)  # will print frozenmap({'bar': 1})

 m3 = m.excluding('spam')  # will throw a KeyError('spam')

frozenmap.union(mapping=None, \*\*kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The method produces a copy of the ``frozenmap`` and adds or modifies
multiple key/values for the created copy.  The signature of
the method matches the signature of the ``frozenmap`` constructor::

 m = frozenmap(foo=1)

 m2 = m.union({'spam': 'ham'})
 print(m2)  # will print frozenmap({'foo': 1, 'spam': 'ham'})

 m3 = m.union(foo=100, y=2)
 print(m3)  # will print frozenmap({'foo': 100, 'y': 2})

 print(m)   # will print frozenmap({'foo': 1})

Calling the ``union()`` method to add/replace N keys is more efficient
than calling the ``including()`` method N times.

frozenmap.mutating()
^^^^^^^^^^^^^^^^^^^^

The method allows efficient copying of a ``frozenmap`` instance with
multiple modifications applied.  This method is especially useful
when the frozenmap in question contains thousands of key/value pairs
and there's a need to update many of them in a performance-critical
section of the code.

The ``frozenmap.mutating()`` method returns a mutable dict-like
copy of the ``frozenmap`` object: an instance of
``collections.FrozenMapCopy``.

The ``FrozenMapCopy`` objects:

* are copy-on-write views of the data of ``frozenmap`` instances
 they were created from;

* are mutable, although any mutations on them do not affect the
 ``frozenmap`` instances they were created from;

* can be passed to the ``frozenmap`` constructor; creating a
 frozenmap from a ``FrozenMapCopy`` object is an O(1)
 operation;

* have O(log N) complexity for get/set operations; creating
 them is an O(1) operation;

* have a ``FrozenMapCopy.close()`` method that prevents any
 further access/mutation of the data;

* can be used as a context manager.

The below example illustrates how ``mutating()`` can be used with
a context manager::

 numbers = frozenmap((i, i ** 2) for i in range(1_000_000))

 with numbers.mutating() as copy:
     for i in numbers:
         if not (numbers[i] % 997):
             del copy[i]

     numbers_without_997_multiples = frozenmap(copy)

     # at this point, *numbers* still has 1_000_000 key/values, and
     # *numbers_without_997_multiples* is a copy of *numbers* without
     # values that are multiples of 997.

     for i in numbers:
         if not (numbers[i] % 593):
             del copy[i]

     numbers_without_593_multiples = frozenmap(copy)

     print(copy[10])  # will print 100.

 print(copy[10])  # This will throw a ValueError as *copy*
                  # has been closed when the "with" block
                  # was executed.

Iteration
---------

As ``frozenmap`` implements the standard ``collections.abc.Mapping``
protocol, so all expected methods of iteration are supported::

 assert list(m) == ['foo']
 assert list(m.items()) == [('foo', 'bar')]
 assert list(m.keys()) == ['foo']
 assert list(m.values()) == ['bar']

Iteration in ``frozenmap``, unlike in ``dict``, does not preserve the
insertion order.

Hashing
-------

``frozenmap`` instances can be hashable just like ``tuple`` objects::

 hash(frozenmap(foo='bar'))  # works
 hash(frozenmap(foo=[]))     # will throw an error

Typing
------

It is possible to use the standard typing notation for frozenmaps::

 m: frozenmap[str, int] = frozenmap()


Implementation
==============

The proposed ``frozenmap`` immutable type uses a Hash Array Mapped
Trie (HAMT) data structure. Functional programming languages,
like Clojure, use HAMT to efficiently implement immutable hash tables,
vectors, and sets.

HAMT
----

The key design contract of HAMT is the guarantee of a predictable
*value* when given the hash of a *key*. For a pair of *key* and *value*,
the hash of the *key* can be used to determine the location of
*value* in the hash map tree.

Immutable mappings implemented with HAMT have O(log N) performance
for ``set()`` and ``get()`` operations.  This efficiency is possible
because mutation operations only affect one branch of the tree,
making it possible to reuse non-mutated branches, and, therefore,
avoiding copying of unmodified data.

Read more about HAMT in [5]_.  The CPython implementation [1]_ has a
fairly detailed description of the algorithm as well.

Performance
-----------

. figure:: pep-0603-hamt_vs_dict.png
  :align: center
  :width: 100%
  :class: invert-in-dark-mode

  Figure 1.  Benchmark code can be found here: [3]_.

The above chart demonstrates that:

* ``frozenmap`` implemented with HAMT displays near O(1) performance
 for all benchmarked dictionary sizes.

* ``dict.copy()`` becomes less efficient when using around
 100-200 items.

. figure:: pep-0603-lookup_hamt.png
  :align: center
  :width: 100%
  :class: invert-in-dark-mode

  Figure 2.  Benchmark code can be found here: [4]_.

Figure 2 compares the lookup costs of ``dict`` versus a HAMT-based
immutable mapping.  HAMT lookup time is ~30% slower than Python dict
lookups on average. This performance difference exists since traversing
a shallow tree is less efficient than lookup in a flat continuous array.

Further to that, quoting [6]_: "[using HAMT] means that in practice
while insertions, deletions, and lookups into a persistent hash array
mapped trie have a computational complexity of O(log n), for most
applications they are effectively constant time, as it would require
an extremely large number of entries to make any operation take more
than a dozen steps."


Design Considerations
=====================

Why "frozenmap" and not "FrozenMap"
-----------------------------------

The lower-case "frozenmap" resonates well with the ``frozenset``
built-in as well as with types like ``collections.defaultdict``.


Why "frozenmap" and not "frozendict"
------------------------------------

"Dict" has a very specific meaning in Python:

* a dict is a concrete implementation of ``abc.MutableMapping`` with
 O(1) get and set operations (``frozenmap`` has O(log N) complexity);

* Python dicts preserve insertion order.

The proposed ``frozenmap`` does not have these mentioned
properties. Instead, ``frozenmap`` has an O(log N) cost of set/get
operations, and it only implements the ``abc.Mapping`` protocol.


Implementation
==============

The full implementation of the proposed ``frozenmap`` type is
available at [2]_.  The package includes C and pure Python
implementations of the type.

See also the HAMT collection implementation as part of the
CPython project tree here: [1]_.


References
==========

. [0] https://en.wikipedia.org/wiki/Persistent_data_structure

. [1] https://github.com/python/cpython/blob/3.8/Python/hamt.c

. [2] https://github.com/MagicStack/immutables

. [3] https://gist.github.com/1st1/be5a1c10aceb0775d0406e879cf87344

. [4] https://gist.github.com/1st1/dbe27f2e14c30cce6f0b5fddfc8c437e

. [5] https://en.wikipedia.org/wiki/Hash_array_mapped_trie#cite_note-bagwell-1

. [6] https://en.wikipedia.org/wiki/Persistent_data_structure#Trees


Acknowledgments
===============

I thank Carol Willing, Ɓukasz Langa, Larry Hastings, and
Guido van Rossum for their feedback, ideas, edits, and discussions
around this PEP.


Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.


.
  Local Variables:
  mode: indented-text
  indent-tabs-mode: nil
  sentence-end-double-space: t
  fill-column: 70
  coding: utf-8
  End: