Any byte sequences that can not be present in valid x86 code?

Question

Any byte sequences that can not be present in valid x86 code?

I'm looking for a byte sequence (or sequences), to inject into an x86 program compiled using GCC, that can not show up in the binary as a by product of compilation.

The reason is that I want these byte sequences to act as "labels", so that I can recognize them later during inspection.

Is it possible to construct patterns of bytes, so that, searching through the binary, these patterns will not show up except with very small probability (I prefer probability zero). In other words, I want to minimize the number of false positives!

Do you have control of the program's source code? If you can ensure that there aren't any goto loops, you can insert code like A: jmp B followed later by B: jmp A. It's unlikely that any compiler would generate such code in the absence of gotos. — Mark Plotnick
– Mark Plotnick, Commented Apr 30, 2017 at 2:08

Margaret Bloom · Accepted Answer · 2017-04-29 16:29:28Z

There are sequences that today are not a valid encoding of any instruction.
Rather than digging in the opcode table present in the Intel Manual 2 you can exploit two facts of the x86 architecture:

These should also be more stable across generations than reserved opcodes.

The sequence 666666666666666666666666666666 (15 operand-size override prefixes, but any prefix will do) will generate an #UD exception because it is invalid.

For what it's worth, there is a specific instruction that fulfills the role of invalid instruction: ud2.
It's presence in a binary module is possible but its more idiomatic than an invalid encoding and it is standard, for example Linux uses it to mark a bug for if ud2 is the execution flow, the code behind it cannot be valid.

That said, if I got you right, that's not going to be useful to you.
You want to skip the process of decoding the instructions and scan the code section of the binary instead.

There is no guarantee that the code section will contain only code, for example ARM compilers generate literal pools - that's definitively uncommon on x86 though.
However the compilers usually align functions to a specific boundary (usually 16 bytes), this can be done in several ways - like stretching the previous function or with a mere padding.
This padding can be a sequence of bytes of any value - hence arbitrary bytes can be present in the code section.

Long story short, there is no universal byte sequence that appear with probability zero in the code section.
Everything that it's not in the execution flow can have any value.

We will deal with probability later, for now lets assume the 66..66h appears rarely enough in an executable.
You can't just use it directly, as 66..66h can be part of two instructions and thus be a valid sequence:

mov rax, 6666666666666666h db 66h, 66h, 66h , 66h db 66h, 66h, 66h nop

is valid.
This is due to the immediate operands of instructions - the biggest immediate can be 8 bytes in length (as today), so the sequence must be lengthen to 15 + 8 = 23 bytes.
If you really want to be safe again future features, you can use a sequence of 14 + 15 = 29 bytes (for the 15-byte instruction length limit).

It's possible to find 23/29 bytes of value 66h in the code section or in the whole binary.
But how probable is that?
If the bytes in a binary were uniformly random then the probability would be astronomically small: 256^-23 = 2^-184.

Well, the point is that the bytes in a binary are not uniformly random.
You can open a file with an embedded icon to confirm that.

You can make the probability arbitrarily small by stretching the sequence - it's up to you to find a compromise between the length and an acceptable number of false positives.

It's unclear what you want to do but here some advice:

Most, if not all, building tools support generating a map file.
It is a file with all the symbols/names and their addresses.
If you could use actual labels (with a prefix and a random suffix) you'd collect them easily after the build.
Most output formats can be enriched with meta-information.
You can add an ELF/PE section with a table of offsets to the locations you want to mark.

Thanks for an excellent answer. What I want to do is to embed watermark pieces into the executable. However, to be of any use, I must somehow be able to recognize them again even in the presence of an attacker. I cannot rely on debugging symbols, since these can just be stripped. Also, I cannot just write down the positions of the pieces within the binary, since the attacker could shift the binary to relocate things. Therefore, I've chosen to mark my pieces with a byte sequence, but in order to retrieve as few false positives, I need a sequence that's not likely to be present.
@Shuzheng I'm a bit out of context, you can surely try a few things and see what works best :)
Don't you need a sequence of length 16 to ensure invalidity? I mean, there could be an instruction (starting with 1 byte) taking up 14 of its bytes?
@Shuzheng With 15 prefixes there is no space left for an opcode, thus the sequence is invalid.
What I mean is: could there be some instruction starting at a position before the 0x66 sequence that takes up some of sequences' bytes? In this case, since an instruction can at most be 15 bytes long, an instruction starting one byte before the sequence could (maybe) take up 14 of the 15 bytes? Since, two prefixes of the same group generates an error, I would say we need a 0x66 sequence of 16 bytes to ensure the next instruction has two of these prefix's?

Collectives™ on Stack Overflow

Any byte sequences that can not be present in valid x86 code?

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related