The 8051 was designed the particular way it was to support multi-threaded programming ... in a day and age before "threads" were in most peoples' vocabulary. The 8051 references didn't explicitly call it out as such, because the actual term "threads" and all things related to it wasn't in their vocabulary at the time either. But, multi-threaded programming has been around, in one form or another, since at least the 1960's.
In today's language: the registers R0-R7 are meant to be used as thread-local registers, so that when you do a context-switch from one thread to another, you basically only need to save A, B, DPTR and PSW, before switching the stack pointer SP and R0-R7 register window. The base address of the R0-R7 window is just PSW&0x18 (thus: either 0x00, 0x08, 0x10 or 0x18), so the switch already happens when PSW is context-switched. All of that makes for blazingly-fast thread-switching and you could literally make all your interrupt handlers as simple thread switchers and use the low interrupt priority as a de facto level 2 priority for threads, with the bottom level 1 priority being that of being outside of interrupt-handlers. Level 3 would be that of being inside a high-priority interrupt handler, which would be mostly reserved for time-critical tasks and spooling.
Unfortunately, there's hardly anyone who's used R0-R7 that way, though it has been. For instance, the original firmware for some of the Rise Tools Displays was entirely thread-based, in 8051 assembly, and used R0-R7 that way. It was also multiprocessor-based with synchronization between units taking place through the 8051 synchronous communications mode.
By way of contrast, the Stepper Motor Driver Demo here was thread-based, but treated R0-R7 as synonymous with addresses 0x00-0x08, which is not the best way to use those registers. It could be adapted and improved upon (particularly, the handlers in the run-time kernel) so that it doesn't need to separately stack any of R0-R7 when handling an interrupt. If you want practice, try modifying it appropriately and try to set it up in a simulator. The design and rationales are contained in The 'Drive' Reference. You'll need the assembler CAS for it.
So, in answer to your question: you shouldn't have to do any push'es or pop's on R0-R7 at all. Instead, you set them up, for different processes, in different areas (regardless of whether your design is thread-based or not). So, when you do a switch to another process, it all comes down to re-loading the PSW for the process in question, and the rest is taken care of, just by that alone, since PSW&0x18 sets the base address for the R0-R7 register window. The demo application just cited, in contrast, did "push 0" and "push 1", along with "pop 1" and "pop 0", which is what you actually want to avoid doing.
Similarly, the need to do push'es and pop's within a process (or thread) is minimized by having 8 such registers around, in addition to the large internal register area. Register windows and large register caches are design features characteristic of RISC processors, thereby giving the 8051 some RISC attributes in its design. Compare it to the 8080/8085 and 8086 evolutionary lineage, and you'll see that it followed an entirely different design direction.