Let's consider this bit of Python code:

Let's consider this bit of Python code:

with open('foo', 'r') as fp:
content = fp.read()

print('{}, 0x{:X}'.format(len(content), ord(content[0])))

What does it do? It opens a file, reads it, and then prints "an integer
representing the Unicode code point of" the first character in that
file.

So, after consulting "man ascii" and doing an "echo a >foo", you'd
expect this to print "2, 0x61" (length of 2 due to the final newline).

*A lot* of Python code I've seen and written does a simple
"open(filename, mode)". But, uhm ... The type of the variable "content"
is "str", which, in Python, means a "sequence of Unicode code points".
In other words, Python *decodes* the file you're reading on the fly. But
according to which encoding? ASCII? UTF-8? Something else? Well, nobody
knows, it is in fact platform-specific.

Let's make the example more obvious. On a shell prompt, do this:

$ printf '\360\237\220\247\n' >foo

This writes 5 bytes to the file. On my system, running the Python script
now prints:

$ ./test.py
2, 0x1F427

Python decoded the file and "content" is now a "str" of length 2. It
holds exactly one Unicode code point (a penguin emoji) and the final
newline. In my case, Python decoded the file using UTF-8, because I'm
using an UTF-8 locale (`en_US.UTF-8`).

But if you use another locale, you might get this:

$ LANG=en_US.ISO-8859-1 ./test.py
5, 0xF0

Completely different result.

If you want to force Python to use UTF-8, you have to do this:

with open('foo', 'r', encoding='UTF-8') as fp:
content = fp.read()

Now the calls above show this:

$ ./test.py
2, 0x1F427

$ LANG=en_US.ISO-8859-1 ./test.py
2, 0x1F427

To be honest, I wasn't aware of this platform-specific behaviour and I
assumed that Python defaulted to UTF-8 here. Well, now I know that it
doesn't.