It is possible that Processor 1 doesn't observe the value written by Processor 0 when processing the IPI. Locking and memory barriers are required to reach cache coherency between IPI.
In the current Linux kernel, this is implied by the lock-less list (llist) APIs, which use atomic operations plus full barriers. The IPI request is queued to list by llist_add(). Then in the IPI handler, llist_del_all() can ensure that the node just added is fetched. (old kernel versions use spinlock instead)
void __smp_call_single_queue(int cpu, struct llist_node *node) { ... /* * The list addition should be visible to the target CPU when it pops * the head of the list to pull the entry off it in the IPI handler * because of normal cache coherency rules implied by the underlying * llist ops. * * If IPIs can go out of order to the cache coherency protocol * in an architecture, sufficient synchronisation should be added * to arch code to make it appear to obey cache coherency WRT * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ if (llist_add(node, &per_cpu(call_single_queue, cpu))) send_call_function_single_ipi(cpu); }
There's an another synchronization before writing X2APIC registers to make stores globally visible.
A barrier mfence+lfence is required before sending IPI over X2APIC because writing x2APIC MSRs is not a serializing instruction on Intel CPUs. The mfence drain store buffer, and the lfence drain ROB.
AMD CPUs don't need a fence before writing x2APIC/TSC_DEADLINE MSRs because WRMSR is not relaxed by AMD.
Here is the code in the Linux kernel when triggering IPI for cross-core function call. It has "mfence; lfence" before sending IPI to the target CPU. For AMD CPUs, the weak_wrmsr_fence() is a nop.
static void x2apic_send_IPI(int cpu, int vector) { u32 dest = per_cpu(x86_cpu_to_apicid, cpu); /* x2apic MSRs are special and need a special fence: */ weak_wrmsr_fence(); __x2apic_send_IPI_dest(dest, vector, APIC_DEST_PHYSICAL); } /* * Make previous memory operations globally visible before * a WRMSR. * * MFENCE makes writes visible, but only affects load/store * instructions. WRMSR is unfortunately not a load/store * instruction and is unaffected by MFENCE. The LFENCE ensures * that the WRMSR is not reordered. * * Most WRMSRs are full serializing instructions themselves and * do not require this barrier. This is only required for the * IA32_TSC_DEADLINE and X2APIC MSRs. */ static inline void weak_wrmsr_fence(void) { alternative("mfence; lfence", "", ALT_NOT(X86_FEATURE_APIC_MSRS_FENCE)); }
By the way, a similar barrier on aarch64 is dsb ishst.