Python 2 vs Python 3: Bytes and Strings

Many Linux distributions will or have already switched over to Python 3 as default interpreter when running the python command. At the same time many people are still used to Python 2 and might not be aware of the changes that come with that switch. Especially string and byte handling is different from Python 2. This article is about some common pitfalls and how to prevent them with Python 3.

A simple example

Let's start with a simple piece of code.

$ python2 -c 'print("A \x80")' | xxd 
00000000: 4120 800a   

So all we want is to generate a byte sequence of "A" + " " + 0x80 + new line. We directly pipe the result into xxd so that we can check if get what we want. As we can see the xxd output looks fine.

No we do the same again, this time with Python 3.

$ python3 -c 'print("A \x80")' | xxd
00000000: 4120 c280 0a   

As you can see for some reason another byte 0xc2 was added between our byte 0x20 (space) and byte 0x80. The attentive geek might already have spotted what's going on: Python 3 converted our byte sequence to UTF-8. We can easily verify that manually.
The Unicode code point which represents the character 0x80 is U+0080. In order to convert this to UTF-8 we need to follow an encoding rule as mentioned in this Wikipedia article. In our case we have to use: 110xxxxx 10xxxxxx. All we need to do is to shift the bytes of the Unicode code point into this encoding rule from the right by only replacing the "x" bits (step 1) and then filling the remaining "x" with 0 (step 2). Let's do it...

Hex: 0x0080
Bin: 1000 0000

Encoding Rule: 110x xxxx 10xx xxxx
Step 1:        110x xx10 1000 0000
Step 2/Result: 1100 0010 1000 0000
                = 0xc2     = 0x80 

So this looks exactly like the output Python 3 gave us. The question is now, why did that happen.

Why is that?

In Python 3 the default for strings is UTF-8 encoded Unicode. So Python 3 converted our "string" to UTF-8 before it printed out the string. Now what we gave Python were not only ASCII characters, but also the non-ASCII character 0x80. As you might know ASCII is basically a subset of UTF-8, that means all ASCII characters (everything smaller or equal 0x7f) will look the same in UTF-8 only characters larger than 0x80 will look different. Now that we know about that, how can we achieve the same with Python 3 as we did with Python 2?

The solution...

The answer is, if you want to use abritrary byte sequences in Python 3 you need to wrap your bytes with b' '. Let's try it.

$ python3 -c "print(b'A \x80')" | xxd                                 
00000000: 6227 4120 5c78 3830 270a                 b'A \x80'.

Unfortunately this doesn't work. As you can see instead of 0x41 0x20 0x80 0x0a we get a literal presentation of b'A \x80'. This is not what we wanted. It turns out the proper way to do it is that:

$ python3 -c 'import sys; sys.stdout.buffer.write(b"A \x80\x0a")' | xxd
00000000: 4120 80  

Now that we solved this mystery, let's have a look at another piece of code.

Playing with os.system()...

Imagine the following hacky example.

  1 #!/usr/bin/env python
  2 
  3 import os;
  4 
  5 cmd='echo'
  6 arg=b'\x80'
  7 out= '/tmp/foobar'
  8 os.system(cmd+' '+arg+' > '+out)

So we want Python to run an echo command which we give byte 0x80 as argument. Then we want to write that byte to /tmp/foobar.

With python2 you get:

$ cat /tmp/foobar| xxd
00000000: 80   

Looks like what we expected. Now the same code interpreted with python3...

$ python3 ./test.py          
Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    os.system(cmd+' '+arg+' > '+out)
TypeError: must be str, not bytes

...we get nothing, because you can't easily mix strings with arbitrary bytes (non 7-bit ASCII characters) as we have just learned.

Instead you need to do it like that:

 1 #!/usr/bin/env python
 2 
 3 import os;
 4 
 5 cmd=b'echo -n'
 6 arg=b'\x80'
 7 out=b'/tmp/foobar'
 8 os.system(cmd+b' '+arg+b' > '+out)

Let's check with xxd again...

$ cat /tmp/foobar| xxd
00000000: 80   

That looks right.

Lessons learned

What have we learned? Don't mix strings with byte sequences in Python 3 or you will end up with a mess. :)

comments (0) - add comment

No comments so far, leave one?