We implemented a 3-stage pipeline consisting of the following stages:

In the FD stage, both the IMEM and PC register were clocked simultaneously. In the X and MWB stages, the DMEM and pipeline registers between these stages were also clocked together.
The following components are part of the FD stage:
In the X stage, we have the following components:
The MWB stage includes:
We went with a simple branch-not-taken strategy. In case of a misprediction, we simply inserted a NOP (No-Operation) and reloaded the previous PC and instruction.
Initally, we used a no-cache architecture that reads directly from the memory. We later swwapped it for a direct-mapped cache.

Here is the FSM for the write-through cache we implemented.

One of the critical paths we anticipated lies between the pipeline register storing the ALU output during the transition from the X stage to the MWB stage and the DMEM cache dout. This path corresponds to a load instruction. The signals pass through complex logic to determine the DCACHE read/write enable signals, access the DCACHE, and then traverse the load MUX and WB MUX. Due to the complexity of this sequence, we expected this path to have the longest delay. Below is the post-PAR timing histogram -

To address the delays and optimize performance, we implemented forwarding for various dependencies:
We implemented forwarding for 1-cycle and 2-cycle dependencies to minimize stalls, particularly in scenarios with no-cache. Additioanlly, in the case of the direct-mapped cache, we disabled the clk on the pipeline registers till both the DCACHE and ICACHE complete the necessary transactions (particularly for READ) with the memory.
We began by pipelining all of our data path signals along with a few key control signals. This approach was implemented to ensure that, in case a signal was needed at any stage, we could efficiently feed it to the ALU or read/write addresses to/from the caches. The primary motivation was to prevent stalls that could arise from 1-cycle and 2-cycle dependencies. This strategy proved effective in reducing performance bottlenecks.
Subsequently, we removed any unnecessary signals that had been pipelined in order to minimize power consumption and area usage. Since our design uses a 3-stage pipeline, we anticipated the possibility of a branch misprediction. In the event of a misprediction (where the default assumption was that the branch was not taken), we chose to pass a NOP instruction rather than stalling for 2 cycles. After the NOP, we could reload the correct PC and instruction, as we had a copy of the previous values, thus ensuring minimal disruption to pipeline flow.
We acknowledged that adding registers throughout the pipeline would result in both a power and area tradeoff. However, we determined that this tradeoff was worthwhile in order to achieve a performance boost. Specifically, in the case of cache-based memory, we had to disable the clock to all registers, effectively freezing the pipeline, while cache interactions were processed. This approach allowed us to avoid unnecessary delays caused by waiting for memory access, improving overall system performance.
Here is the performance of our design against some benchmark tests that were provided to us
| Test Name | No Cache | Cache | Difference | % Change | Ratio |
|---|---|---|---|---|---|
| cachetest.out | 2359232 | 4377780 | 2018548 | 85.56% | 1.85559538 |
| final.out | 4441 | 9701 | 5260 | 118.44% | 2.184417924 |
| fib.out | 4196 | 8976 | 4780 | 113.92% | 2.139180172 |
| sum.out | 13678643 | 28082641 | 2018548 | 105.30% | 2.053028287 |
| replace.out | 13678661 | 28338923 | 14660262 | 107.18% | 2.071761483 |
We achieved a post-PAR slack = 0.107 ns @ 20 ns clk (50 MHz). We estimate a power consumption of 23.2 mW from the post-PAR simulations for the critical path.
Here is the floorplan of our design. We estimate an area of around 1.18 sqmm for the entire design with the direct-mapped cache.
