How many cycles of instructions are needed to execute RISC-V in a single cycle processor?

Question

I've been looking at "single-cycle" processors such as PicoRV32, and I've noticed that barring textbook examples of single cycle processors with magical fully combinational and separate instruction and data memories such as "Digital Design and Computer Architecture" by Harris & Harris, it doesn't seem possible to actually create a true single cycle processor.

Looking at the state machines of these processors, such as the PicoRV32 processors, I've been trying to wrapping my head behind how a more realistic processor would work.

My current understanding is assuming that all memory returns in one cycle, the absolute fastest a RISC CPU with a Von Neumann style memory (shared data and memory) would two cycles.

Cycle 1: Fetch the current instruction from memory. Since it takes one cycle to get the instruction from memory, we can only wait during this cycle.
Cycle 2: Decode, execute, and writeback.

Some instruction, such as loads would require three cycles.

Cycle 1: Fetch the current instruction from memory.
Cycle 2: Decode, and request data from memory
Cycle 3: Execute/writeback to register file

Stores would also have to be three cycles.

Some instructions, such as loads would require three cycles.

Cycle 1: Fetch the current instruction from memory.
Cycle 2: Decode/execute and pull the data from the register file
Cycle 3: Write data to the memory

I'm not entirely sure whether my understanding is correct or not. The state machine for the PicoRV32 processor seems to require four cycles minimum (fetch-decode-read from register file-execute/store/load/shift) and probably a more straightforwards state machine but I was wondering if my methodology is possible.

What's a cycle, exactly, as you see it? The clock itself? Or, for example, would the 12-clock 8051 single-cycle period work for you? Do you permit co-active pipelines? Or are they forbidden? Do you have access to multi-port von Neumann memories (such as are sometimes used with video circuitry)? Or are you only allowed a single port to memory? Do you need rock-solid 1-cycle per instruction as is found in the ADSP-21xx DSP or the MIPS R2000? — periblepsis
– periblepsis, Commented Jan 8 at 7:33
@periblepsis one cycle is the one clock period. So no 12-clock single cycle period. No pipeline at all. And single port to memory(though given my one cycle time to read from memory restriction, I'm not sure how having multiple ports would help). — itisyeetimetoday
– itisyeetimetoday, Commented Jan 8 at 7:35
@itisyeetimetoday Have a look at the ADSP-2111 or ADSP-2105. Tell me what about them doesn't qualify? — periblepsis
– periblepsis, Commented Jan 8 at 7:36
This seems like the "No true Scotsman" fallacy. You have invented your own idea of what a "true single-cycle processor" means (which you conclude is impossible) and then use it to argue that no CPU is single-cycle. — pipe
– pipe, Commented Jan 8 at 7:57

Kuba hasn't forgotten Monica · Accepted Answer · 2025-01-08 10:09:14Z

The main misconception you have is that "single cycle" implies single-cycle instruction latency. Fast CPUs have instruction latencies on the order of 10 cycles as that's the pipeline depth, yet can execute a new instruction every cycle.

some instruction, such as loads would require three cycles.

In a pipeline, all those three operations happen in parallel, just not for the same instruction. While instruction i executes, instruction i+2 is being fetched instruction i+1 data is loaded, etc.

In a non-pipeline design, data "percolates" from read ports of the data, code and register memory, through the combinatorial core, into the write ports.

There are two ways single cycle can be done:

Faster: pipelining, where multiple operations - such as code fetch, decode, data fetch, data write - all happen in parallel. A single instruction takes multiple cycles, but due to pipelining, a new instruction starts executing every cycle. Thus, the instruction rate is one/cycle, but instruction latency is not - it is (1/d)/cycle, where d is pipeline depth.
Slower: no pipeline, Harvard architecture, double-ported RAM and registers. New results/writes/control register contents are stored on the rising edge of the clock. Everything else is combinatorial logic. Instruction and data flow through combinatorial decode, execute, etc. Results settle by the end of the cycle and get latched then. The state retained by the CPU, i.e. stored in latches etc., is minimal. For a CPU without flags, e.g. RiscV, the only state is register contents and memory contents.

Speed of execution suffers since cycle length and instruction latency are necessarily the same. Power consumption potentially suffers since there are lots of spurious transitions as the instruction and data percolate through the combinatorial logic. The number of flip-flops is minimized, though. So, for, say, a homebrew CPU using discrete logic - using NAND gates or transistors, say, this may be a good tradeoff of speed for low circuit complexity.

devnull · Accepted Answer · 2025-01-08 14:53:20Z

The state machine for the PicoRV32 processor seems to require four cycles minimum

If it has a control FSM it is not a "single-cycle" processor, according to the simplified and didatical reference architecture presented in the Harris' books. I've also failed to find in the PicoRV32 documentation a claim that it is a "single-cycle" micro-architecture in the same sense.

Check this section at the RISC-V version of the book (mostly the same in the other versions I know):

7.3.5 Performance Analysis

Recall from Equation 7.1 that the execution time of a program is the product of the number of instructions, the cycles per instruction, and the cycle time. Each instruction in the single-cycle processor takes one clock cycle, so the clock cycles per instruction (CPI) is 1. The cycle time is set by the critical path. In our processor, the lw instruction is the most time-consuming and involves the critical path shown in Figure 7.17. As indicated by heavy blue lines, the critical path starts with the PC loading a new address on the rising edge of the clock. The instruction memory then reads the new instruction (1), and the register file reads rs1 as SrcA. While the register file is reading (2), the immediate field is sign-extended based on ImmSrc and selected at the SrcB multiplexer (path highlighted in gray). The ALU adds SrcA and SrcB to find the memory address. The data memory reads (3) from this address, and the Result multiplexer selects ReadData as Result. Finally, Result must set up at the register file before the next rising clock edge so that it can be properly written. (4)

(1) The instruction memory used is read only and combinational

(2) Also a combinational read from the register

(3) The read operation of the data memory is also combinational

(4) At the next clock edge, the result will be stored and the next instruction will be loaded

Referenced Figure (red markings on the synchronous parts are mine):

Full reference: "Digital Design and Computer Architecture - RISC-V Edition", Sarah L. Harris, David Harris, 2022

Stack Exchange Network

How many cycles of instructions are needed to execute RISC-V in a single cycle processor?

2 Answers 2

Hot Network Questions

How many cycles of instructions are needed to execute RISC-V in a single cycle processor?

2 Answers 2

Related

Hot Network Questions