Timeline for Why did x86 support self-modifying code without explicit flushes in the 80s and 90s?

Current License: CC BY-SA 4.0

20 events

when toggle format	what		by	license	comment
Dec 11, 2022 at 21:55	comment	added	Peter Cordes		Nate and @another-dave: IDK if it would make sense or be plausible, but a system could maybe exist where code can be loaded from disk via DMA, but there's no guaranteed / well-defined way to store new machine code and jump to it (without writing to disk and loading back in). Perhaps because of a lack of cache-flush instructions, but DMA is cache-coherent including I-cache (possible but silly). Or because of separate memory spaces for code + data, like a true Harvard with the only connection being external storage.
Dec 9, 2022 at 22:48	comment	added	Nate Eldredge		@another-dave: Right. At the lowest level, what the CPU sees is that we store some bytes in memory, then at some later time, we try to execute those bytes. It has to be possible on any computer that's going to be able to load programs from mass storage; the only question is, what do you have to do in between. For classic x86 the answer is "jump". For modern x86, "nothing". For ARM, a cache flush procedure. For something with W^X, change page table permissions and flush TLBs. Et cetera. So, which of those counts as "supporting self-modifying code"?
Dec 8, 2022 at 12:55	comment	added	dave		You're right as to terminology, of course, but I think the architectural point is precisely that the boundary is blurred.
Dec 8, 2022 at 11:57	comment	added	TonyM		@another-dave, "Anything that loads code into memory is 'self-modifying' in some sense" That's not what the term 'self-modifying code' means, though. Programs and overlays aren't SMC, neither are dynamically constructed jump tables etc. SMC is a a program that writes to parts of itself, usually small parts, typically to obtain an fine speed improvement or space saving in a restricted system. I'm sure we could all get a long discussion about the exact boundary where the term does and doesn't apply but its very distinct from loading separate program files into memory.
Dec 4, 2022 at 20:31	comment	added	Peter Cordes		@Joshua: Normally when kernel folks talk about FP, they're including SIMD registers as well. Everything non-integer, everything that doesn't always get saved at interrupt / system-call entry points. Most ISAs don't have separate registers for SIMD, they just do scalar FP in the same registers as SIMD, like modern x86 using SSE for scalar math. That brings up an interesting point: x87 is used rarely enough in modern Unix/Linux systems that a kernel could use it internally and do lazy save/restore of user-space. Maybe. Some video codecs still have MMX code though.
Dec 4, 2022 at 20:23	comment	added	Joshua		@PeterCordes: It's also possible it didn't actually merge. It's pretty useless on x64 because yeah SSE2 unless you use the long double type.
Dec 4, 2022 at 20:01	comment	added	Peter Cordes		@Joshua: Is it possible you were misinterpreting some details or something, and it was actually saying BSD was changing from lazy to eager FP context saving? So it saves/restores on every user-space context switch (but still not on interrupts or system calls). Linux used to do lazy FP context switching, and set a control register bit so FP instructions would fault and trigger restore of the FP state for this process. But with more processes using SSE2 all the time (e.g. memcpy), and higher cost of interrupts vs. extra stores, it doesn't futz around with that. I'd expect BSD does the same.
Dec 4, 2022 at 19:55	comment	added	Joshua		@PeterCordes: I saw the news article, it was BSD. I'm getting annoyed that I can't find it again.
Dec 4, 2022 at 14:39	comment	added	Peter Cordes		Same with long mode but worse: HW still needs to handle 32-bit compat mode, so the HW snooping exists, nothing to gain from changing. (Except maybe a distant future where 32-bit mode is mostly not used, then HW might finally drop pipeline snooping and make 32-bit mode disastrously slow, with every `jmp` serializing? Or making explicit-coherency instructions usable in 32-bit mode, and a control-register bit to control supporting coherent I-cache or not?)
Dec 4, 2022 at 14:36	comment	added	Peter Cordes		Then, if every program ever written for 386 uses those correctly, despite the fact that no testing on 386 can verify that since it would still Just Work on 386, they'll work correctly on future CPUs with non-coherent I-caches / pipelines. But if any commercially-relevant software ignores it or has bugs in their implementation, hardware will have to cater to it to run that software. It also still has to support SMC in real mode efficiently enough, so needs hardware to track that. Also, x86 has always had cache-coherent DMA since 8086 / 386 didn't need special instructions after DMA.
Dec 4, 2022 at 14:33	comment	added	Peter Cordes		@NateEldredge: For Protected Mode or Long Mode to have changed the semantics for self-modifying code, they'd have to have introduced new instructions that would (in future CPUs) flush / invalidate caches, and defined semantics for when programs were required to use them. Like ARM / MIPS / PowerPC have, with syncing data cache back to a shared cache, and invalidating instruction caches, before jumping to a recently-written address.
Dec 4, 2022 at 13:16	comment	added	Peter Cordes		@Joshua: What kernels do you have in mind that actually emulate FPU instructions? That's not at all how it works under Linux; when a rare piece of code (like software RAID5 or RAID6) wants to use those registers, it calls kernel_fpu_begin() to trigger a save of the FPU/SIMD state. Then SIMD instructions like `vpxor` can run without corrupting user-space state. See Why am I able to perform floating point operations inside a Linux kernel module? - getting this wrong leads to silent corruption of user-space, not trapping to emulation.
Dec 4, 2022 at 5:35	comment	added	Joshua		@Mark: What's funny is FPU emulation is back. It turns out it's faster to use FPU in kernel mode all the time and only save/restore FPU registers on context switch than save/restore every system call. The balance went the other way because there's so few FP instructions in kernel mode.
Dec 4, 2022 at 3:11	comment	added	Nate Eldredge		@Neil: For coprocessor instructions in particular, I suppose one could wire up the WAIT pin to generate an external interrupt. But I don't know if the PC or other systems did so. In later machines, this behavior was built in, independent of the "invalid instruction" exception which was also added.
Dec 4, 2022 at 0:54	comment	added	Neil		@another-dave According to reverseengineering.stackexchange.com/a/12277 the original 8086 didn't have a trap for invalid instructions.
Dec 4, 2022 at 0:19	comment	added	dave		@Mark - can't you just compile floating-point instructions and then handle the trap (by emulating the instruction) if there's no FPU? That's how it's been done on non-Intel systems.
Dec 3, 2022 at 21:55	comment	added	Mark		Prior to the Pentium CPUs, the most common form of self-modifying code was floating-point emulation: the program would be compiled with interrupt calls everywhere that a floating-point instruction was needed. When one of those interrupts was hit, it would either be replaced with the appropriate FPU instruction (if an FPU was present), or the emulation code would run (if it wasn't). The Pentium made this obsolete by always including an FPU.
Dec 3, 2022 at 13:06	comment	added	dave		Anything that loads code into memory is 'self-modifying' in some sense, so you can't actually forbid it. But explicit flushing is not generally a problem in such cases.
Dec 3, 2022 at 13:03	vote	accept	rwallace
Dec 7, 2022 at 0:25
Dec 3, 2022 at 5:54	history	answered	Nate Eldredge	CC BY-SA 4.0

toggle format