In PSOLA, pitch estimation is used to space out or overlap segments by integer multiples of the estimated pitch period in time. Previous frames (or pitch period length sub-segments thereof) can be duplicated (as needed) in order to not leave any gaps between frames. Segments can then be concatenated or cross-faded. This can lengthen or shorted the composited result, and thus change the time duration. Then this intermediate result can resampled to change the pitch (and possibly undo some or all of the time duration change), and to provide the proper number of samples needed for the output sample rate.
There will be some jitter, as frames are usually only shifted by a minimum of one whole pitch period, not by fractions thereof. Buffering may be required to cover this time jitter, the amount related to the longest period (lowest pitch) expected to be handled.
The "clever math" might be in the choice and implementation of an accurate enough but low latency pitch estimation algorithm. If the pitch cannot be estimated (due to consonants or polyphony, etc.), apparent quality may suffer.