How can assembly language makes computer with specific cache design runs faster?

Question

I am new to assembly language and cache design and recently our professor gave us a question about writing assembly language instructions to make computers with specific cache design run faster. I have no clue how to use assembly to improve performance. Can I get any hints?

The two cache designs are like this:

Cache A: 128 sets, 2-way set associative, 32-byte blocks, write-through, and no-write-allocate. Cache B: 256 sets, direct-mapped, 32-byte blocks, write-back, and write-allocate.

The question is:

Describe a little assembly language program snippet, two instructions are sufficient, that makes Computer A (uses the Cache A design) run as much faster as possible than Computer B (uses the Cache B design).

And there is another question asking the opposite:

Write a little assembly language program snippet, two instructions is sufficient, that makes Computer B run as much faster as possible than Computer A.

Peter Cordes · Accepted Answer · 2019-04-05 22:42:42Z

To be slow with the direct-mapped cache but fast with the associative cache, your best bet is probably 2 loads¹.

Create a conflict-miss due to cache aliasing on that machine but not the other. i.e. 2 loads that can't both hit in cache back-to-back, because they index the same set.

Assume the snippet will be run in a loop, or that cache is already hot for some other reason before your snippet runs. You can probably also assume that a register holds a valid pointer with some known alignment relative to a 32-byte cache-lie boundary, i.e. you can set pre-conditions for your snippet.

Footnote 1: Or maybe stores, but load misses more obviously need to stall the CPU because they can't be hidden by a store buffer. Only by scoreboarding to not stall until the load results are actually used

To make the write-through / no-write-allocate cache run slow, maybe store and then load an adjacent address, or the address you just stored. On a write-back / write-allocate cache, the load will hit. (But only after waiting for the store miss to bring the data into cache.)

Reloading the same address you just stored could be fast on both machines if there's also a store buffer with store-forwarding.

And subsequent runs of the same snipped will get cache hits because the load would allocate the line in cache.

If your machine is CISC with post-increment addressing modes, there's more you can do with just 2 instructions if you imagine them as a loop body. It's unclear what kind of pre-conditions you're supposed to / allowed to assume for the cache.

Just 2 stores to the same line or even same address can demonstrate the cost of write-through: with write-back + write-allocate, you'll get a hit on the 2nd store.

Is it possible to implement write-back to Cache A on the first one and implement 2-way set associative to Cache B on the second one to make it runs faster?
@JishengYu: Oh, I hadn't noticed the different allocation and write-through/back policy. Yes, read after write of the same location is probably a lot slower on the write-through / no-write-allocate cache. Or of an adjacent location, to defeat store-forwarding.

Collectives™ on Stack Overflow

How can assembly language makes computer with specific cache design runs faster?

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related