8

I have read Intel 64 and IA-32 Architectures SDM vol 3A, 9.2 MEMORY ORDERING, but there was one question that kept bothering me.

If I first write to a memory address, then send an interprocessor interrupt(IPI) with x2APIC, that mean sending IPI doesn't need writing memory (just use wrmsr). Another core recive the IPI and read the memory, will it read the correct value?

For example:

Initially x = 0

Processor 0:

mov [ _x], 1 wrmsr # use x2APIC to send IPI 

Processor 1:

# resive IPI, in the interrupt service routine: mov r1, [ _x] 

Is r1 = 0 allowed ?

3 Answers 3

5

That is an interesting question. On the face of it, one would think that since WRMSR is a serializing instruction it flushes the preceding memory writes and all is well. Even then, to quote the manual:

These instructions force the processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all buffered writes to memory before the next instruction is fetched and executed.

(Emphasis mine)

It doesn't say anything about the ordering with respect to sending the IPI as that is part of the current instruction, not the next one. So this theoretically means the other core could execute the mov r1, [ _x] while the originating core is still busy flushing stuff but is very unlikely given that the target core would need to service the interrupt which probably has a lot higher latency.

As @harold mentioned, this point is moot since WRMSR is not always serializing. Reading the footnote that I initially missed:

WRMSR to the IA32_TSC_DEADLINE MSR (MSR index 6E0H) and the X2APIC MSRs (MSR indices 802H to 83FH) are not serializing.

So there is absolutely no guarantee that the write to x is flushed.

Sign up to request clarification or add additional context in comments.

6 Comments

It's possibly worse than that since: "An execution of WRMSR to any non-serializing MSR is not serializing. Non-serializing MSRs include ... any of the x2APIC MSRs"
Nice find! That's what I get for not reading the footnote. Thanks!
Thanks for your answer! It seems to be more interesting now :)
So, will mov [ _x], 1; sfence; wrmsr make it safe?
@untitled: No, think of sfence as just putting a divider on the "conveyor belt" that is the store buffer. It can exec and even retire from the ROB while older stores are still not committed to L1d cache. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?
|
4

From Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1

11.12.3 MSR Access in x2APIC Mode

To allow for efficient access to the APIC registers in x2APIC mode, the serializing semantics of WRMSR are relaxed when writing to the APIC registers. Thus, system software should not use “WRMSR to APIC registers in x2APIC mode” as a serializing instruction. Read and write accesses to the APIC registers will occur in program order. A WRMSR to an APIC register may complete before all preceding stores are globally visible; software can prevent this by inserting a serializing instruction or the sequence MFENCE;LFENCE before the WRMSR.

The RDMSR instruction is not serializing and this behavior is unchanged when reading APIC registers in x2APIC mode. System software accessing the APIC registers using the RDMSR instruction should not expect a serializing behavior. (Note: The MMIO-based xAPIC interface is mapped by system software as an un-cached region. Consequently, read/writes to the xAPIC-MMIO interface have serializing semantics in the xAPIC mode.)

However, I still don't know if this will work with amd processors.

9 Comments

Using xchg for the store, followed by lfence, should be cheaper. xchg is a full barrier, and its own store commits to cache as part of execution since it's an atomic RMW. Following it with lfence should also work on AMD with Spectre mitigation enabled, which makes lfence work as an execution barrier like on Intel, instead of a nop. Is LFENCE serializing on AMD processors? mfence;lfence will almost certainly work on AMD as well, if you want to use the slower sequence Intel's manual mentions.
@PeterCordes Thanks very much! What about lock cmpxchg, lock add, lock addx, lock inc, followed by lfence ?
Those are all equivalent to xchg with its implicit lock prefix in how they work as an atomic RMW. The fact that there's an ALU operation between the load + store doesn't make things any harder or easier for the microcode.
@PeterCordes Since lfence is not necessarily committing from the memory-order buffer to L1d cache, so even with "MFENCE;LFENCE" from the sender, the target CPU may not observe the sender's stores when in IPI handler. Is this right?
@ChangbinDu: store ; mfence ; lfence ; wrmsr will commit the store to L1d cache before sending the IPI. lfence waits for the mfence to retire (and thus for the store buffer to drain) before any later instructions can even start executing. (mfence alone is fine some CPUs like Skylake with updated microcode, where it includes an execution barrier even for non-memory ops.) This answer is quoting Intel's manual as guaranteeing that MFENCE;LFENCE is safe for this purpose on Intel CPUs. (xchg-store ; lfence ; wrmsr would I think also be safe.)
|
1

It is possible that Processor 1 doesn't observe the value written by Processor 0 when processing the IPI. Locking and memory barriers are required to reach cache coherency between IPI.

In the current Linux kernel, this is implied by the lock-less list (llist) APIs, which use atomic operations plus full barriers. The IPI request is queued to list by llist_add(). Then in the IPI handler, llist_del_all() can ensure that the node just added is fetched. (old kernel versions use spinlock instead)

void __smp_call_single_queue(int cpu, struct llist_node *node) { ... /* * The list addition should be visible to the target CPU when it pops * the head of the list to pull the entry off it in the IPI handler * because of normal cache coherency rules implied by the underlying * llist ops. * * If IPIs can go out of order to the cache coherency protocol * in an architecture, sufficient synchronisation should be added * to arch code to make it appear to obey cache coherency WRT * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ if (llist_add(node, &per_cpu(call_single_queue, cpu))) send_call_function_single_ipi(cpu); } 

There's an another synchronization before writing X2APIC registers to make stores globally visible.

  • A barrier mfence+lfence is required before sending IPI over X2APIC because writing x2APIC MSRs is not a serializing instruction on Intel CPUs. The mfence drain store buffer, and the lfence drain ROB.

  • AMD CPUs don't need a fence before writing x2APIC/TSC_DEADLINE MSRs because WRMSR is not relaxed by AMD.

Here is the code in the Linux kernel when triggering IPI for cross-core function call. It has "mfence; lfence" before sending IPI to the target CPU. For AMD CPUs, the weak_wrmsr_fence() is a nop.

static void x2apic_send_IPI(int cpu, int vector) { u32 dest = per_cpu(x86_cpu_to_apicid, cpu); /* x2apic MSRs are special and need a special fence: */ weak_wrmsr_fence(); __x2apic_send_IPI_dest(dest, vector, APIC_DEST_PHYSICAL); } /* * Make previous memory operations globally visible before * a WRMSR. * * MFENCE makes writes visible, but only affects load/store * instructions. WRMSR is unfortunately not a load/store * instruction and is unaffected by MFENCE. The LFENCE ensures * that the WRMSR is not reordered. * * Most WRMSRs are full serializing instructions themselves and * do not require this barrier. This is only required for the * IA32_TSC_DEADLINE and X2APIC MSRs. */ static inline void weak_wrmsr_fence(void) { alternative("mfence; lfence", "", ALT_NOT(X86_FEATURE_APIC_MSRS_FENCE)); } 

By the way, a similar barrier on aarch64 is dsb ishst.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.