0

I have recently taken up the activity of parsing binary data with Python but am confused by the way "byte" items are treated by Python. Take for e.g. the following interpreter conversation:

>>> f = open('somefile.gz', 'rb') >>> f <open file 'textfile.gz', mode 'rb' at 0xb77f4d88> >>> bytes = f.read() >>> bytes[0] '\x1f' >>> len(bytes[0]) 1 >>> int(bytes[0]) <---- calling __str__ automatically on bytes[0] ? Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '\x1f' 

The above session shows that bytes[0] has the size of 1 byte but the __str__ representation is a hexadecimal one. No worries, but when I try to treat bytes[0] as a single byte, I get funky behaviour.

If I want to parse/interpret a binary stream based on some specification where the specification includes representation in hexadecimal, binary and decimal, how would I go about doing that.

An e.g. would be "first two bytes are \xbeef, the next is a decimal 8 followed by a packed bit field where each of the 8 bits of the byte represent some flag? I guess there are a few modules out there which make this task easy but I'd want to do it from scratch.

I have seen references to struct module but is there no way of checking the bytes read directly without introducing a new module? Something like bytes[0] == 0xbeef ?

Can someone please help me out with how normally folks parse binary data conforming a specification using Python? Thanks.

1
  • 1
    Don't worry about "introducing a new module". Many modules of the Python standard library contain core functionality that is simply separated out into a namespace of its own, but is an integral part of the language. Many of the modules are even compiled into the interpreter (at least when using CPython). Commented Feb 11, 2012 at 14:46

2 Answers 2

5

You're using Python 2.x. Prior to Python 3.0, reading a file, even a binary file, returns a string. What you're calling a "bytes" object is really a string. Indexing into a string as you do with "bytes[0]" just returns a 1-character string.

The struct module would probably be best suited to what you want, but you can do what you ask without it if you really want to:

"Something like bytes[0] == 0xbeef ?"

This won't work because 0xbeef is a two-byte sequence, but bytes[0] is only a single byte. You can do this instead:

bytes[0:2] == b'\xbe\xef' 

In Python 3.x, things work a little bit more like you'd expect. Reading a binary file returns a bytes object that behaves like a sequence of 1-byte unsigned integers, not as a string.

Sign up to request clarification or add additional context in comments.

4 Comments

Wait a sec, I thought bytes were Python 3.x only, what is that b specifier in front of the hex values?
@sasuke: If you just open up your Python 2.7 console and try to put it in there, you'll notice that it also works there. Thus, it's upwards compatible to write it in this way (a Good Thing).
@SvenMarnach Nobody here is claiming that bytes are mutable. Perhaps you're confusing == with =? We're talking about testing whether the first two bytes in a sequence equal some value, not mutating the sequence.
Sorry, not sure how I got this one wrong. (Had to do a bogus edit to be able to take back my downvote.)
2

If you want to parse binary data, check out struct module. Here is an example from the doc:

>>> from struct import * >>> pack('hhl', 1, 2, 3) '\x00\x01\x00\x02\x00\x00\x00\x03' >>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03') (1, 2, 3) >>> calcsize('hhl') 8 

Learn more about unpack :)

So if you want to read the first 2 bytes as an unsigned short, and do the test with 0xbeef:

struct.unpack('H', bytes[0:2]) == 0xbeef 

5 Comments

Hmm...but I haven't used ord anywhere? Maybe I was unclear: is there a way wherein I can do bytes[0] == 0xbeef or more specifically, why isn't it allowed?
@sasuke: Strings in Python are immutable, so you can't change them. (Moreover, even if you could, you couldn't assign a 16-bit number to its first byte because it simply wouldn't fit.) I suggest to use a list while building your data, and finally pack() it into a binary string.
well, bytes[0] is 1 byte length, while 0xbeef is 2 bytes length. In addition, bytes[0] is a string while 0xbeef is an int. Without struct, then you need to do something like: map(ord, a) == [0xbe, 0xef], but it doesn't care about little/big endian.
So if you want to read a short (2 byte long) and do the test, you can do: struct.unpack('H', bytes[0:2]) == 0xbeef will work.
Sorry for the confusion guys, I meant comparing stuff like bytes[0] == 0xbe. @Sven: For the time being, I'm looking to "unpack" stuff from binary stream and make sure it follows the specification.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.