The 8087 has two components: a control unit, and an execution unit. The control unit handles bus operations and interactions with the 8086; the execution unit handles actual floating-point operations. As you surmise, the execution unit can’t handle two operations, so the execution flow has to wait until one floating-point operation completes before the next one can be submitted. That’s the first synchronisation requirement; the second is when an 8086 instruction reads memory that the 8087 is supposed to write — the 8086 has to wait for the 8087 to have finished, otherwise the read is likely to return rubbish.
To make things simple, the rule of thumb is that all 8087 instructions need a wait or fwait (opcode 0x9B) before them (this is easier than after — but fwait is also required after FPU instructions writing to memory). Some instructions — in particular, finit, and instructions that only reflect the state of the control unit (fstsw, fstcw, fldcw, fstenv, and fldenv) — include a wait opcode in their documented encoding, so they don’t need an explicit wait in front of them; they also don’t need a subsequent wait, the execution unit is ready with no delay. You can spot these instructions by looking for their non-waiting alternate, starting with fn (fninit etc.).
Commercial 8086 assemblers would generally insert fwaits as appropriate on their own, so most programmers wouldn’t have had to take care of this. 286 and later CPUs handle the waits in hardware and don’t need explicit fwaits in the program.
wait and fwait have the same encoding, 0x9B, but assemblers and linkers can be made to handle them differently: wait always ends up as 0x9B, but fwait can be handled in a way that works with emulated FPUs. The 0x9B instruction tells the CPU to wait for its /TEST pin to be low; on systems with no FPU, this will wait forever. Since FPUs were rare on PCs, most programs that could make use of an FPU also included code that worked without; instead of implementing everything twice, a common technique was to write FPU-dependant code, and have an emulator handle the FPU instructions if no FPU was present. Assemblers could be told to set code up for such an emulator; they would then emit an nop before fwait, to leave enough room for two-byte interrupt instruction, and a linker fix-up to point to the instruction. The linker would fix this up appropriately, depending on the FPU library used to build the final executable. See What is the protocol for x87 floating point emulation in MS-DOS? for details. With Borland’s Turbo Assembler, this involves the /e and /r command-line options, or EMUL/NOEMUL directives. (But with assemblers and linkers capable of doing this, you wouldn’t explicitly specify wait/fwait in general anyway.)
For further information, see the 8087 Numeric Data Processor supplement in the Intel iAPX86,88 User’s Manual. See also How did the 8086 interface with the 8087 FPU coprocessor? and Norbert Juffa’s FAQ.