i am using this textbook Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson), and there is a section I don't really understand very well.
C code:
void write_read(long *src, long *dst, long n) { long cnt = n; long val = 0; while (cnt) { *dst = val; val = (*src)+1; cnt--; } } Inner loop of write_read:
#src in %rdi, dst in %rsi, val in %rax .L3: movq %rax, (%rsi) # Write val to dst movq (%rdi), %rax # t = *src addq $1, %rax # val = t+1 subq $1, %rdx # cnt-- jne .L3 # If != 0, goto loop Given this code, the textbook gives this diagram to describe the program flow 
This is the explanation given, for those who don't have access to the TB:
Figure 5.35 shows a data-flow representation of this loop code. The instruction
movq %rax,(%rsi)is translated into two operations: The s_addr instruction computes the address for the store operation, creates an entry in the store buffer, and sets the address field for that entry. The s_data operation sets the data field for the entry. As we will see, the fact that these two computations are performed independently can be important to program performance. This motivates the separate functional units for these operations in the reference machine.In addition to the data dependencies between the operations caused by the writing and reading of registers, the arcs on the right of the operators denote a set of implicit dependencies for these operations. In particular, the address computation of the s_addr operation must clearly precede the s_data operation.
In addition, the load operation generated by decoding the instruction
movq (%rdi), %raxmust check the addresses of any pending store operations, creating a data dependency between it and the s_addr operation. The figure shows a dashed arc between the s_data and load operations. This dependency is conditional: if the two addresses match, the load operation must wait until the s_data has deposited its result into the store buffer, but if the two addresses differ, the two operations can proceed independently.
a) What I am not really clear about is why after this line movq %rax,(%rsi) there needs to be a load done after s_data is called? I'm assuming that when s_data is called, the value of %rax is stored in the location that the address of %rsi is pointing to? Does this mean that after every s_data there needs to be a load call?
b) It doesn't really show in the diagram but from what I understand from the explanation given in the book, movq (%rdi), %rax this line requires its own set of s_addr and s_data? So is it accurate to say that all movq calls require an s_addr and s_data call followed by the check to check if the addresses match before calling load ?
Quite confused over these parts, would appreciate if someone can explain how the s_addr and s_data calls work with load and when it is required to have these functions, thank you!!
rsianddsiare not incremented, so the code simply copies the same byte time and time again. Shouldn't thosemovqinstructions belodsqandstosq?lodsqorstosq, in this exercise we're givenmovqto work with :) and we have to 'break it down' to s_addr/s_data as shown in the diagram... @TonyKCcode, exactly the same applies. I know it's only an example, but it doesn't have to be so pointless! (Also, the two snippets behave very differently ifcntis zero.)s_addruop not executed yet, regardless ofs_data). github.com/travisdowns/uarch-bench/wiki/… has some SKL experimentsdo{}while()style loop is just the loop, omitting a check forcnt == 0to skip over the whole loop that a compiler would have put around it. Or else it's from a build wherecntwas a compile-time constant and thus known non-zero. Since the question isn't about that, it's not a big deal, I don't think the question needs more clutter.