PEP: 756

PEP: 756
Title: Add PyUnicode_Export() and PyUnicode_Import() C functions
Author: Victor Stinner <[email protected]>
PEP-Delegate: C API Working Group
Discussions-To: https://discuss.python.org/t/63891
Status: Withdrawn
Type: Standards Track
Created: 13-Sep-2024
Python-Version: 3.14
Post-History: `14-Sep-2024 <https://discuss.python.org/t/63891>`__
Resolution: `29-Oct-2024 <https://discuss.python.org/t/63891/62>`__

. highlight:: c

Abstract
========

Add functions to the limited C API version 3.14:

* ``PyUnicode_Export()``: export a Python str object as a ``Py_buffer``
view.
* ``PyUnicode_Import()``: import a Python str object.

On CPython, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
is copied and no conversion is done.

Rationale
=========

PEP 393
-------

:pep:`393` "Flexible String Representation" changed string internals in
Python 3.3 to use three formats:

* ``PyUnicode_1BYTE_KIND``: Unicode range [U+0000; U+00ff],
UCS-1, 1 byte/character.
* ``PyUnicode_2BYTE_KIND``: Unicode range [U+0000; U+ffff],
UCS-2, 2 bytes/character.
* ``PyUnicode_4BYTE_KIND``: Unicode range [U+0000; U+10ffff],
UCS-4, 4 bytes/character.

A Python ``str`` object must always use the most compact format. For
example, a string which only contains ASCII characters must use the
UCS-1 format.

The ``PyUnicode_KIND()`` function can be used to know the format used by
a string.

One of the following functions can be used to access data:

* ``PyUnicode_1BYTE_DATA()`` for ``PyUnicode_1BYTE_KIND``.
* ``PyUnicode_2BYTE_DATA()`` for ``PyUnicode_2BYTE_KIND``.
* ``PyUnicode_4BYTE_DATA()`` for ``PyUnicode_4BYTE_KIND``.

To get the best performance, a C extension should have 3 code paths for
each of these 3 string native formats.

Limited C API
-------------

:pep:`393` functions such as ``PyUnicode_KIND()`` and
``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not
possible to write code specialized for UCS formats. A C extension using
the limited C API can only use less efficient code paths and string
formats.

For example, the `MarkupSafe project
<https://markupsafe.palletsprojects.com/>`_ has a C extension
specialized for UCS formats for best performance, and so cannot use the
limited C API.

Specification
=============

API
---

Add the following API to the limited C API version 3.14::

int32_t PyUnicode_Export(
PyObject *unicode,
int32_t requested_formats,
Py_buffer *view);
PyObject* PyUnicode_Import(
const void *data,
Py_ssize_t nbytes,
int32_t format);

#define PyUnicode_FORMAT_UCS1 0x01 // Py_UCS1*
#define PyUnicode_FORMAT_UCS2 0x02 // Py_UCS2*
#define PyUnicode_FORMAT_UCS4 0x04 // Py_UCS4*
#define PyUnicode_FORMAT_UTF8 0x08 // char*
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)

The ``int32_t`` type is used instead of ``int`` to have a well defined
type size and not depend on the platform or the compiler.
See `Avoid C-specific Types
<https://github.com/capi-workgroup/api-evolution/issues/10>`_ for the
longer rationale.

PyUnicode_Export()
------------------

API::

int32_t PyUnicode_Export(
PyObject *unicode,
int32_t requested_formats,
Py_buffer *view)

Export the contents of the *unicode* string in one of the *requested_formats*.

* On success, fill *view*, and return a format (greater than ``0``).
* On error, set an exception, and return ``-1``.
*view* is left unchanged.

After a successful call to ``PyUnicode_Export()``,
the *view* buffer must be released by ``PyBuffer_Release()``.
The contents of the buffer are valid until they are released.

The buffer is read-only and must not be modified.

The ``view->len`` member must be used to get the string length. The
buffer should end with a trailing NUL character, but it's not
recommended to rely on that because of embedded NUL characters.

*unicode* and *view* must not be NULL.

Available formats:

=================================== ======== ===========================
Constant Identifier Value Description
=================================== ======== ===========================
``PyUnicode_FORMAT_UCS1`` ``0x01`` UCS-1 string (``Py_UCS1*``)
``PyUnicode_FORMAT_UCS2`` ``0x02`` UCS-2 string (``Py_UCS2*``)
``PyUnicode_FORMAT_UCS4`` ``0x04`` UCS-4 string (``Py_UCS4*``)
``PyUnicode_FORMAT_UTF8`` ``0x08`` UTF-8 string (``char*``)
``PyUnicode_FORMAT_ASCII`` ``0x10`` ASCII string (``Py_UCS1*``)
=================================== ======== ===========================

UCS-2 and UCS-4 use the native byte order.

*requested_formats* can be a single format or a bitwise combination of the
formats in the table above.
On success, the returned format will be set to a single one of the requested
formats.

Note that future versions of Python may introduce additional formats.

No memory is copied and no conversion is done.

. _export-complexity:

Export complexity
-----------------

On CPython, an export has a complexity of *O*\ (1): no memory is copied
and no conversion is done.

To get the best performance on CPython and PyPy, it's recommended to
support these 4 formats::

(PyUnicode_FORMAT_UCS1 \
| PyUnicode_FORMAT_UCS2 \
| PyUnicode_FORMAT_UCS4 \
| PyUnicode_FORMAT_UTF8)

PyPy uses UTF-8 natively and so the ``PyUnicode_FORMAT_UTF8`` format is
recommended. It requires a memory copy, since PyPy ``str`` objects can
be moved in memory (PyPy uses a moving garbage collector).

Py_buffer format and item size
------------------------------

``Py_buffer`` uses the following format and item size depending on the
export format:

========================== ================== ============
Export format Buffer format Item size
========================== ================== ============
``PyUnicode_FORMAT_UCS1`` ``"B"`` 1 byte
``PyUnicode_FORMAT_UCS2`` ``"=H"`` 2 bytes
``PyUnicode_FORMAT_UCS4`` ``"=I"`` 4 bytes
``PyUnicode_FORMAT_UTF8`` ``"B"`` 1 byte
``PyUnicode_FORMAT_ASCII`` ``"B"`` 1 byte
========================== ================== ============

PyUnicode_Import()
------------------

API::

PyObject* PyUnicode_Import(
const void *data,
Py_ssize_t nbytes,
int32_t format)

Create a Unicode string object from a buffer in a supported format.

* Return a reference to a new string object on success.
* Set an exception and return ``NULL`` on error.

*data* must not be NULL. *nbytes* must be positive or zero.

See ``PyUnicode_Export()`` for the available formats.

UTF-8 format
------------

CPython 3.14 doesn't use the UTF-8 format internally and doesn't support
exporting a string as UTF-8. The ``PyUnicode_AsUTF8AndSize()`` function
can be used instead.

The ``PyUnicode_FORMAT_UTF8`` format is provided for compatibility with
alternate implementations which may use UTF-8 natively for strings.

ASCII format
------------

When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII strings.

The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
``PyUnicode_Import()`` to validate that a string only contains ASCII
characters.

Surrogate characters and embedded NUL characters
------------------------------------------------

Surrogate characters are allowed: they can be imported and exported.

Embedded NUL characters are allowed: they can be imported and exported.

Implementation
==============

https://github.com/python/cpython/pull/123738

Backwards Compatibility
=======================

There is no impact on the backward compatibility, only new C API
functions are added.

Usage of PEP 393 C APIs
=======================

A code search on PyPI top 7,500 projects (in March 2024) shows that
there are many projects importing and exporting UCS formats with the
regular C API.

PyUnicode_FromKindAndData()
---------------------------

25 projects call ``PyUnicode_FromKindAndData()``:

* **Cython** (3.0.9)
* Levenshtein (0.25.0)
* PyICU (2.12)
* PyICU-binary (2.7.4)
* PyQt5 (5.15.10)
* PyQt6 (6.6.1)
* aiocsv (1.3.1)
* asyncpg (0.29.0)
* biopython (1.83)
* catboost (1.2.3)
* cffi (1.16.0)
* mojimoji (0.0.13)
* mwparserfromhell (0.6.6)
* numba (0.59.0)
* **numpy** (1.26.4)
* orjson (3.9.15)
* pemja (0.4.1)
* pyahocorasick (2.0.0)
* pyjson5 (1.6.6)
* rapidfuzz (3.6.2)
* regex (2023.12.25)
* srsly (2.4.8)
* tokenizers (0.15.2)
* ujson (5.9.0)
* unicodedata2 (15.1.0)

PyUnicode_4BYTE_DATA()
----------------------

21 projects call ``PyUnicode_2BYTE_DATA()`` and/or
``PyUnicode_4BYTE_DATA()``:

* **Cython** (3.0.9)
* **MarkupSafe** (2.1.5)
* Nuitka (2.1.2)
* PyICU (2.12)
* PyICU-binary (2.7.4)
* PyQt5_sip (12.13.0)
* PyQt6_sip (13.6.0)
* biopython (1.83)
* catboost (1.2.3)
* cement (3.0.10)
* cffi (1.16.0)
* duckdb (0.10.0)
* **mypy** (1.9.0)
* **numpy** (1.26.4)
* orjson (3.9.15)
* pemja (0.4.1)
* pyahocorasick (2.0.0)
* pyjson5 (1.6.6)
* pyobjc-core (10.2)
* sip (6.8.3)
* wxPython (4.2.1)

Rejected Ideas
==============

Reject embedded NUL characters and require trailing NUL character
-----------------------------------------------------------------

In C, it's convenient to have a trailing NUL character. For example,
the ``for (; *str != 0; str++)`` loop can be used to iterate on
characters and ``strlen()`` can be used to get a string length.

The problem is that a Python ``str`` object can embed NUL characters.
Example: ``"ab\0c"``. If a string contains an embedded NUL character,
code relying on the NUL character to find the string end truncates the
string. It can lead to bugs, or even security vulnerabilities.
See a previous discussion in the issue `Change PyUnicode_AsUTF8()
to return NULL on embedded null characters
<https://github.com/python/cpython/issues/111089>`_.

Rejecting embedded NUL characters require to scan the string which has
an *O*\ (*n*) complexity.

Reject surrogate characters
---------------------------

Surrogate characters are characters in the Unicode range [U+D800;
U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python
``str`` object can contain arbitrary lone surrogate characters. Example:
``"\uDC80"``.

Rejecting surrogate characters prevents exporting a string which contains
such a character. It can be surprising and annoying since the
``PyUnicode_Export()`` caller doesn't control the string contents.

Allowing surrogate characters allows to export any string and so avoid
this issue. For example, the UTF-8 codec can be used with the
``surrogatepass`` error handler to encode and decode surrogate
characters.

Conversions on demand
---------------------

It would be convenient to convert formats on demand. For example,
convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is
requested.

The problem is that most users expect an export to require no memory
copy and no conversion: an *O*\ (1) complexity. It is better to have an
API where all operations have an *O*\ (1) complexity.

Export to UTF-8
---------------

CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to
allow exporting to UTF-8.

The problem is that the UTF-8 cache doesn't support surrogate
characters. An export is expected to provide the whole string content,
including embedded NUL characters and surrogate characters. To export
surrogate characters, a different code path using the ``surrogatepass``
error handler is needed and each export operation has to allocate a
temporary buffer: *O*\ (n) complexity.

An export is expected to have an *O*\ (1) complexity, so the idea to
export UTF-8 in CPython was abadonned.

Discussions
===========

* https://discuss.python.org/t/63891
* https://github.com/capi-workgroup/decisions/issues/33
* https://github.com/python/cpython/issues/119609

Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.