Discussion:
bytes != str ... a few notes
John Machin
2008-12-15 11:03:35 UTC
Permalink
In Python 3, b'abc' != 'abc' (and rightly so). I have scribbled down
some notes (below) which may be useful to others. Is there a "porting
tips" wiki or other public place for posting this kind of thing?

A couple of minor points on other topics:

1. It would have been nice if 3.x ord(a_byte) just quietly returned
a_byte; porters would have not needed to change anything.

2. bytes.join() and bytearray.join() exist and work (on an iterable
which may contain a mixture of bytes and bytearray objects) just as
extrapolation from str.join would lead you to expect, but the help needs
a little fixing and there's no mention of them in the Library Reference
Manual. I've raised a bug report: http://bugs.python.org/issue4669

Cheers,
John

=== Comparing bytes objects with str objects ===

In Python 3.x, a bytes object will never ever compare equal to a str object.

Porter's problem (example):

data = open(fpath, "rb").read(8)
OLE2_SIG = "\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
if data == OLE2_SIG:
# This point is unreachable in 3.x, because data is bytes (has been
# read from a file opened in binary mode) and OLE_2SIG is a str
# object.

Solution for "simple" porting:
OLE2_SIG = b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"

A tentative solution when maintaining one codebase which runs as is on
2.x and from which 3.x code is generated:

# ... excerpt from "include file"
if python_version >= (3, 0):
def STR2BYTES(x, encoding='latin1'):
return x.encode(encoding)
else:
def STR2BYTES(x):
return x

# ... changed code
OLE2_SIG = STR2BYTES("\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1")

How to find cases of this problem:
1. Can't be detected by Python 2.6 -3 option.
2. Can't be detected/handled by 2to3 script.
3. Is detected by Python 3.x -b (warn) and -bb (error) command-line
options to check for bytes/str [in]equality comparisons. [Aside: these
options are documented in the expected place but not mentioned in the
porting notes
(http://docs.python.org/dev/py3k/whatsnew/3.0.html#porting-to-python-3-0)]
4. Should be detected by your tests but the point where the test fails
may be some distance from the actual comparison.
5. Search your code for bytesy things like \x and \0.
6. Read your code (but turn 2.x mindset off because if you don't, the
code will look just fine!).

=== end of screed ===
Michael Watkins
2008-12-15 18:01:50 UTC
Permalink
Post by John Machin
=== Comparing bytes objects with str objects ===
A tentative solution when maintaining one codebase which runs as is on
return x.encode(encoding)
def STR2BYTES(x): return x
Perhaps your STR2BYTES function should test to see if "x" is already a
byte string, to avoid recasting errors. As it stands should "x" be recast
later down the road by some other chunk of code which is oblivious to the
Post by John Machin
STR2BYTES(something_already_a_byte_string)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in STR2BYTES
AttributeError: 'bytes' object has no attribute 'encode'

An approach similar to yours is what the authors of Durus, a ZODB-like
Python Object Database, have done. They add an isinstance(s, byte_string)
test to avoid any attempt at re-encoding a byte string (which would lead
to an attribute error since a byte string will never have an "encode"
method.

Sadly the same is not true in 2.x and below.

Browse the relevant module:

http://www.mems-exchange.org/software/durus/Durus-3.8.tar.gz/Durus-3.8/utils.py

Or peek at this snippet from within::

if sys.version < "3":
from __builtin__ import xrange
from __builtin__ import str as byte_string
def iteritems(x):
return x.iteritems()
def next(x):
return x.next()
from cStringIO import StringIO as BytesIO
from cPickle import dumps, loads, Unpickler, Pickler
else:
xrange = range
from builtins import next, bytearray, bytes
byte_string = (bytearray, bytes)
def iteritems(x):
return x.items()
from io import BytesIO
from pickle import dumps, loads, Unpickler, Pickler

def as_bytes(s):
"""Return a byte_string produced from the string s."""
if isinstance(s, byte_string):
return s
else:
return s.encode('latin1')

empty_byte_string = as_bytes("")

I wish it were as easy as searching for '\x'-y looking literals to find
areas that will work in 2 but fail in 3. That's a start but there are
little surprises to find elsewhere. Consider the following which of course
Post by John Machin
x = ":".join(('1','plus','two'))
x
'1:plus:two'
Post by John Machin
hashlib.md5(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object supporting the buffer API required
Post by John Machin
hashlib.md5(as_bytes(x))
hashlib.md5(as_bytes(x)).hexdigest()
'4e3a3a8075a6982177c24af5179ec82c'

Failing code and failing unit tests ought to pick up most of these sorts
of issues.

Loading...