Pipelined ARM

Lab 3 is simple to describe but difficult to do. It is to pipeline your Lab 2 design. There needs to be at least 5 pipeline stages:

Fetch: read the instruction from code memory

Decode/Register access: decode the instruction and read the register file

Execute: execute the instruction

Memory write: read or write from memory

Write back: Write the results back to the register file

Optionally you can implement a 6 stage pipeline, one where the second write register operand is implemented in the 6th stage. The Lab 2 solution supported the second write, so if you are going to start from the lab 2 solution instead of your lab 2, then you either need to implement a six stage pipeline or you need to drop support for writing the second operand.

FAQ

When is the due date? TBD. This is the last lab. But there's another homework, and I need to look at the schedule. THIS LAB IS HARD. Please start early. I received a lot of questions in the last week Lab 2. This is not good. If you start this lab 4-5 days before it is due you will find it super stressful. Pipelining introduces a lot more moving parts to the design. Go slow and methodical and have lots of test cases.

Do I need to use my lab 2 solution or can I start from the provided lab 2 solution? You can start from your solution, the provided solution or mix and match components.

Do I need to handle data hazards via forwarding? Absolutely. That's what this lab is all about.

What do I do with the instructions that are fetched after a taken branch? You need to implement the ARM32 or ARM64 ISA correctly. In this ISA that means you need to squash those instructions (make them have no effect on on architecturally visible state).

Do my memories need to be clocked for read? Yes.

Can I do this lab one stage at a time? Yes. That's how I intend to create the solution. I'll probably start at fetch, and pipeline that one stage. Then do the next, and so on. Always ensuring a working design as I go.

For those of you using the open source tools I wanted to tell you how I debug things.  I use the BX board and python debug script if it’s a simple thing.  But Lab 3 has pushed the limit of “simple” for me.  So I’ve resorted to creating waveforms in simulation and viewing them on my laptop.  To generate the waveforms I use verilator, and to view them I use gtkwave.  Here are two files that I use that make this easy.  Note that even if you are doing the software side of this class this will be useful.  Verilator and gtkwave are much simpler tools (to me) than Quartus, etc.
 
Good luck!
 
-Mark
 
/// This file “makesim.sh” creates a trace of the CPU
makesim.sh:
 

#!/bin/bash

rm -rf obj_dir

rm -f cpu.vcd

verilator --cc --trace cpu.v --exe sim.cpp

make -j -C obj_dir/ -f Vcpu.mk Vcpu

obj_dir/Vcpu

///// This file is the C++ file needed for verilator

sim.cpp:

#include "Vcpu.h"

#include "verilated.h"

#include "verilated_vcd_c.h"

 

 

int main(int argc, char **argv, char **env) {

  int i;

  int clk;

  Verilated::commandArgs(argc, argv);

  // init top verilog instance

  Vcpu* top = new Vcpu;

  // init trace dump

  Verilated::traceEverOn(true);

  VerilatedVcdC* tfp = new VerilatedVcdC;

  top->trace (tfp, 99);

  tfp->open ("cpu.vcd");

  // initialize simulation inputs

  top->clk_in = 1;

  top->nreset = 0;

  // run simulation for 1000 clock periods

  for (i=0; i<1000; i++) {

    top->nreset = (i > 2);

    // dump variables into VCD file and toggle clock

    for (clk=0; clk<2; clk++) {

      tfp->dump (2*i+clk);

      top->clk_in = !top->clk_in;

      top->eval ();

    }

    if (Verilated::gotFinish())  exit(0);

  }

  tfp->close();

  exit(0);

You will need to “stall” the pipeline now and then.  For example, a LOAD/USE dependence.  For me personally?  I find it easier to write these things as “Advance” rather than “Stall”.  That is, I don’t compute stalls, I compute whether a pipeline stage can advance.  You do what you want of course, but that’s just how my brain works.  As an aside, when I teach this in class, I think “stall”.  It’s when I go to implement it in Verilog that I think “advance”.
 
At the end of the day, my pipelined processor implementation took about 500 more LUTs than the non-pipelined version.  Total size was ~ 3800 LUTs including USB interface, or about 2500 LUTs for the processor itself.  Don’t be concerned if yours is 1000 or so LUTs more than the non-pipelined version.  There’s a lot of variation in how you can design these things.