The 8087 has two components: a control unit, and an execution unit. The control unit handles bus operations and interactions with the 8086; the execution unit handles actual floating-point operations. As you surmise, the execution unit can’t handle two operations, so the execution flow has to wait until one floating-point operation completes before the next one can be submitted. That’s the first synchronisation requirement; the second is when an 8086 instruction reads memory that the 8087 is supposed to write — the 8086 has to wait for the 8087 to have finished, otherwise the read is likely to return rubbish.

To make things simple, the rule of thumb is that all 8087 instructions need a `wait` or `fwait`¹ *before* them (this is easier than after — but `fwait` is also required after FPU instructions writing to memory), except instructions that only reflect the state of the control unit (`fstsw`, `fstcw`, `fldcw`, `fstenv`, and `fldend`).

Commercial 8086 assemblers would generally insert `fwait`s as appropriate on their own, so most programmers wouldn’t have had to take care of this. 286 and later CPUs handle the waits in hardware and don’t need explicit `fwait`s in the program.

For details, see for example Robert L. Hummel’s *The Processor and Coprocessor*.

---

<sup>¹ `fwait` is preferable since it works with FPU emulators.</sup>