Revision 58c8ef06-cd01-4508-a58f-9968157b58b0 - Retrocomputing Stack Exchange

The 8087 has two components: a control unit, and an execution unit. The control unit handles bus operations and interactions with the 8086; the execution unit handles actual floating-point operations. As you surmise, the execution unit can’t handle two operations, so the execution flow has to wait until one floating-point operation completes before the next one can be submitted. That’s the first synchronisation requirement; the second is when an 8086 instruction reads memory that the 8087 is supposed to write — the 8086 has to wait for the 8087 to have finished, otherwise the read is likely to return rubbish.
To make things simple, the rule of thumb is that all 8087 instructions need a `wait` or `fwait` (opcode 0x9B) *before* them (this is easier than after — but `fwait` is also required after FPU instructions writing to memory). Some instructions — in particular, `finit`, and instructions that only reflect the state of the control unit (`fstsw`, `fstcw`, `fldcw`, `fstenv`, and `fldenv`) — include a `wait` opcode in their documented encoding, so they don’t need an explicit `wait` in front of them; they also don’t need a subsequent `wait`, the execution unit is ready with no delay. You can spot these instructions by looking for their non-waiting alternate, starting with `fn` (`fninit` etc.).
Commercial 8086 assemblers would generally insert `fwait`s as appropriate on their own, so most programmers wouldn’t have had to take care of this. 286 and later CPUs handle the waits in hardware and don’t need explicit `fwait`s in the program.
`wait` and `fwait` have the same encoding, 0x9B, but assemblers and linkers can be made to handle them differently: `wait` always ends up as 0x9B, but `fwait` can be handled in a way that works with emulated FPUs. The 0x9B instruction tells the CPU to wait for its `/TEST` pin to be low; on systems with no FPU, this will wait forever. Since FPUs were rare on PCs, most programs that could make use of an FPU also included code that worked without; instead of implementing everything twice, a common technique was to write FPU-dependant code, and have an emulator handle the FPU instructions if no FPU was present. Assemblers could be told to set code up for such an emulator; they would then emit an `int` instead of `fwait`, and the emulator would replace the `int` if an FPU was found. See [What is the protocol for x87 floating point emulation in MS-DOS?][1] for details.
For further information, see the 8087 Numeric Data Processor supplement in the [Intel iAPX86,88 User’s Manual][2]. See also https://retrocomputing.stackexchange.com/q/9173/79 and [Norbert Juffa’s FAQ][3].
[1]: https://reverseengineering.stackexchange.com/q/12272/11644
[2]: http://www.bitsavers.org/components/intel/8086/1981_iAPX_86_88_Users_Manual.pdf
[3]: https://dougx.net/gaming/coproc.html