Many Linux distributions will or have already switched over to Python 3 as default interpreter when running the
python command. At the same time many people are still used to Python 2 and might not be aware of the changes that come with that switch. Especially string and byte handling is different from Python 2. This article is about some common pitfalls and how to prevent them with Python 3.
A simple example
Let's start with a simple piece of code.
$ python2 -c 'print("A \x80")' | xxd 00000000: 4120 800a
So all we want is to generate a byte sequence of "A" + " " + 0x80 + new line. We directly pipe the result into
xxd so that we can check if get what we want. As we can see the
xxd output looks fine.
No we do the same again, this time with Python 3.
$ python3 -c 'print("A \x80")' | xxd 00000000: 4120 c280 0a
As you can see for some reason another byte
0xc2 was added between our byte
0x20 (space) and byte
0x80. The attentive geek might already have spotted what's going on: Python 3 converted our byte sequence to UTF-8. We can easily verify that manually.
The Unicode code point which represents the character
U+0080. In order to convert this to UTF-8 we need to follow an encoding rule as mentioned in this Wikipedia article. In our case we have to use:
110xxxxx 10xxxxxx. All we need to do is to shift the bytes of the Unicode code point into this encoding rule from the right by only replacing the "x" bits (step 1) and then filling the remaining "x" with 0 (step 2). Let's do it...
Hex: 0x0080 Bin: 1000 0000 Encoding Rule: 110x xxxx 10xx xxxx Step 1: 110x xx10 1000 0000 Step 2/Result: 1100 0010 1000 0000 = 0xc2 = 0x80
So this looks exactly like the output Python 3 gave us. The question is now, why did that happen.
Why is that?
In Python 3 the default for strings is UTF-8 encoded Unicode. So Python 3 converted our "string" to UTF-8 before it printed out the string. Now what we gave Python were not only ASCII characters, but also the non-ASCII character
0x80. As you might know ASCII is basically a subset of UTF-8, that means all ASCII characters (everything smaller or equal
0x7f) will look the same in UTF-8 only characters larger than
0x80 will look different. Now that we know about that, how can we achieve the same with Python 3 as we did with Python 2?
The answer is, if you want to use abritrary byte sequences in Python 3 you need to wrap your bytes with
b' '. Let's try it.
$ python3 -c "print(b'A \x80')" | xxd 00000000: 6227 4120 5c78 3830 270a b'A \x80'.
Unfortunately this doesn't work. As you can see instead of
0x41 0x20 0x80 0x0a we get a literal presentation of
b'A \x80'. This is not what we wanted. It turns out the proper way to do it is that:
$ python3 -c 'import sys; sys.stdout.buffer.write(b"A \x80\x0a")' | xxd 00000000: 4120 80
Now that we solved this mystery, let's have a look at another piece of code.
Playing with os.system()...
Imagine the following hacky example.
1 #!/usr/bin/env python 2 3 import os; 4 5 cmd='echo' 6 arg=b'\x80' 7 out= '/tmp/foobar' 8 os.system(cmd+' '+arg+' > '+out)
So we want Python to run an echo command which we give byte
0x80 as argument. Then we want to write that byte to
With python2 you get:
$ cat /tmp/foobar| xxd 00000000: 80
Looks like what we expected. Now the same code interpreted with python3...
$ python3 ./test.py Traceback (most recent call last): File "./test.py", line 8, in <module> os.system(cmd+' '+arg+' > '+out) TypeError: must be str, not bytes
...we get nothing, because you can't easily mix strings with arbitrary bytes (non 7-bit ASCII characters) as we have just learned.
Instead you need to do it like that:
1 #!/usr/bin/env python 2 3 import os; 4 5 cmd=b'echo -n' 6 arg=b'\x80' 7 out=b'/tmp/foobar' 8 os.system(cmd+b' '+arg+b' > '+out)
Let's check with
$ cat /tmp/foobar| xxd 00000000: 80
That looks right.
What have we learned? Don't mix strings with byte sequences in Python 3 or you will end up with a mess. :)