Lab 2 E-Mail follow ups from Professor (Updated Feb.5)

FAQ
Memories
C/Assembly to Verilog
PC/memory map/etc
Testcode update
More FAQs
Verilog and signness for shift Link

Announcements(FAQ) on Lab 2 Feb.2:

Do I have to implement the cpsr register?
YES. You should implement the 4 arithmetic bits.

How do I implement the condition bits?
This is part of the work of this lab, to understand it. But here’s a hint: one of the bits is C for carry out. A simple trick to get this bit is to zero extend the numbers from 32 to 33 bits… Here’s a link that you might find helpful: http://teaching.idallen.com/dat2343/10f/notes/040_overflow.txt This is also directly from the ARM instruction manual:
"The data processing operations may be classified as logical or arithmetic. The logical operations (AND, EOR, TST, TEQ, ORR, MOV, BIC, MVN) perform the logical action on all corresponding bits of the operand or operands to produce the result. If the S bit is set (and Rd is not R15, see below) the V flag in the CPSR will be unaffected, the C flag will be set to the carry out from the barrel shifter (or preserved when the shift operation is LSL #0), the Z flag will be set if and only if the result is all zeros, and the N flag will be set to the logical value of bit 31 of the result.

The arithmetic operations (SUB, RSB, ADD, ADC, SBC, RSC, CMP, CMN) treat each operand as a 32 bit integer (either unsigned or 2’s complement signed, the two are equivalent). If the S bit is set (and Rd is not R15) the V flag in the CPSR will be set if an overflow occurs into bit 31 of the result; this may be ignored if the operands were considered unsigned, but warns of a possible error if the operands were 2’scomplement signed. The C flag will be set to the carry out of bit 31 of the ALU, the Z flag will be set if and only if the result was zero, and the N flag will be set to the value of bit 31 of the result (indicating a negative result if the operands are considered to be 2’s complement signed).” Here’s the manual: http://vision.gel.ulaval.ca/~jflalonde/cours/1001/h17/docs/arm-instructionset.pdf

Do I have to support the COND field on instructions?
YES. Once you implement cpsr this becomes rather easy.

How do I do signed arithmetic in Verilog?
Bit vectors in Verilog have no signness, they are basically unsigned. When I write Verilog I try not to write “-“ for subtract. Instead I manually do 2’s complement math. e.g. C = A + ~B + 1 instead of C = A - B. What I have found is that this generally pushes through a wider variety of synthesis tools intro smaller designs. For example, yosys makes more compact designs this way. I haven’t checked in on Altera or Synopsys lately, although I hope they do a better job :) Remember the mantra on synthesis tools: trust but verify. For those doing the HW option on the BX board, the icebox_stat tool is a useful thing to use, or remove the -q flag from the SConstruct file on place and route and you can get a report of resource usage.

Do I have to implement byte orientated loads and stores as well as words?
NO. But you’ll find this isn’t that difficult to do if you want to go ahead and implement it.

Do I have to implement the shift/rotate on immediate and registers?
NO. You can if you like, however. I did in the solution set just to see how large it would be. It’s not bad. You can implement it and it won’t take all that many LUTs because it’s not a general purpose multiplier.

One last thing, you do not have to implement the pre and post increment or decrement addressing mode. Those are just silly and no other architecture has them. There’s nothing to learn there and two ported register files are expensive.

Memories (Feb.3/4)

I figured out something unfortunate about Yosys and ICE FPGAs.  It will not infer block RAMS for asynchronous RAMS.  It will only infer them for asynchronous ROMS :(   So two options for lab 2:

Option 1: Make your data memory very very very tiny.  Like 8 words.

Option 2: Or make your memories synchronous.

module test (
 input clk, wen, ren,
 input [7:0] waddr, raddr,
 input [31:0] wdata,
 output reg [31:0] rdata
);
 reg [31:0] mem [0:255];
 always @(posedge clk) begin
 if (wen)
  mem[waddr] <= wdata;
 rdata <= mem[raddr];
 end
endmodule

 
If you are using Altera parts I don't think this is an issue.  Last I remember their BRAMs could infer asynchronous RAMs.  But I'll have to investigate that before I know for sure.
 
Note that the solution for Lab 2 (which I'm working on now) will use *synchronous* RAMs, because that's closer to what you need for Lab 3.  
 
Following up on this most recent email, yosys is not very smart about inferring memories.  If you have the least bit of complex logic around the read an write ports and enables, then it will get confused.  Here’s an example of how to fix it (this is the register file from the solution I’m working on for lab 2).  Note how it pulls out the control logic into an always @(*) block and leaves the posedge block extremely simple.
 
 I realized in the middle of the night that my sample register file has phases in it and is writing two registers.  You do not have to do this.  I decided to implement the base offset update addressing mode in my solution, just for kicks.  This requires two write ports to the register file.  Again you do /not/ have to do this in your processor.
 
reg [3:0] rf_rs1;
reg [3:0] rf_rs2;
reg [3:0] rf_rs3;
reg [3:0] rf_ws1;
reg [3:0] rf_ws2;
reg [31:0] rf_wd1;
reg [31:0] rf_wd2;
reg rf_we1;
reg rf_we2;
reg [31:0] rf_d1;
reg [31:0] rf_d2;
reg [31:0] rf_d3;

reg [31:0] rf_d1_raw;
reg [31:0] rf_d2_raw;
reg [31:0] rf_d3_raw;
reg [31:0] rf_wd;
reg [3:0] rf_ws;
reg read_reg_file;
reg write_reg_file;

always @(posedge clk) begin
 if (read_reg_file) begin
  rf_d1_raw <= rf[rf_rs1];
  rf_d2_raw <= rf[rf_rs2];
  rf_d3_raw <= rf[rf_rs3];
 end
 if (write_reg_file)
  rf[rf_ws] <= rf_wd;
 end

always @(*) begin
 read_reg_file = false;
 write_reg_file = false;
 if (phase == phase_regread)
  read_reg_file = true;
 if (phase == phase_regwrite1 && rf_we1) begin
  write_reg_file = true;
  rf_ws = rf_ws1;
  rf_wd = rf_wd1;
 end
 else if (phase == phase_regwrite2 && rf_we2) begin
  write_reg_file = true;
  rf_ws = rf_ws1;
  rf_wd = rf_wd1;
 end

 ///////// TODO: possibly this is pc_plus8 not pc
 rf_d1 = (rf_rs1 == r15) ? pc : rf_d1_raw;
 rf_d2 = (rf_rs2 == r15) ? pc : rf_d2_raw;
 rf_d3 = (rf_rs3 == r15) ? pc : rf_d3_raw;

end

More about C/Assenbly to Verilog(Feb.4)

I took the snow day this morning to write up the necessary Makefile, linker script and conversion tool to get code that you can write in C or assembly into hex files that you can load into your Verilog modules.  This is a nice way to test your code, but there’s a learning curve.  I hope the learning curve is worth it and you take this on for your projects.   I’m going to try and explain what is going on here in this email as well as how you can adapt the code to your own uses.  It will not work “out of the box” because everyone’s system is different and everyone’s Verilog is different.  This is about a lecture’s worth of material so, a pretty good topic to tackle on a snow day ;)
 
Tool chains (C compilers, assemblers, linkers, etc) are specific to an “environment”.  For example, that means a C compiler for x86 Linux produces binaries that are not compatible for x86 Windows.  The GNU toolchain can be targeted to just about anything and for ARM you can target it enough so that it produces code that will run on your stripped down ARM processor — assuming it doesn’t generate instructions you haven’t implemented (more on this below).  There are three types of files to note here:
 
 * source files (C, assembly, etc)  (you write these in a text editor)
 * object files   (you compile source files with gcc to produce these)
 * executables  (you link together object files to produce this)
 
Source files are the most straightforward.  They are your assembly file or if you like, C code you write to do what you want.  Object files are just the binary representation of these source files.  For C they have been transformed into the target assembly instruction set (in binary form).  Assembly source files, are, somewhat by definition, already specific to an instruction set but is in human-readable form.  Compiling an assembly file produces a binary representation of it.  An executable file is all of your object files (and libraries of code too) linked together into a file that an operating system can load as a process.  Rather confusingly, executables are also referred to as “binaries”, even though object files are also a binary format in this process.  Nevertheless, when somewhat says “do you have a binary?” what they are really saying is “do you have an executable?"
 
For your processor, you do have an “environment”.  It’s just not what the ARM tool chains target by default.  There is no operating system.  There is a memory configuration you are creating.  But there is a way to use this tool chain to help you get working programs into your processor.  What you need is to first produce those programs to behave as if they had a memory map compatible with your processor, and you need to get the executable file into a format you can import into your Verilog project.  I’ll cover these two topics next
 
Memory map: processes execute on an OS assuming a particular memory map.  The OS is expected to provide this.  What this means is “code is at this address”, “data is at this address” and “stack is over there”, etc.  I’m not sure what the ARM32 tools will default too (something for ARM Linux I imagine), but it surely is /not/ what your Verilog ARM processor expects.  If your ARM processor is like my solution set (doesn’t have to be, but just using the solution as an example), the code lives in one Verilog array and the data lives in another (actually my data is in four arrays, more on that later).  Fetches are directed at the code array, while load and store instructions are directed at the data array.  Both of these are 0 index.  Meaning, code address 0 loads the first element of the code array and data address zero loads the first byte from the data array.  This is /not/ how a modern processor works.  Code and data live in the same address space.  You could do this for Lab 2 if you wanted, but I don’t recommend it, because by Lab 3 you will need to undo it.  It’s best for the labs that you keep the code and data spaces separate.
 
So what to do?  The “trick” is to rely on the high order bits to separate your code and data from each other.  For example, in my processor and in the testcode.tar file I posted on the class webpage, I assume code lives at address 0x00000000, and data lives at address 0x00000400.  This is still a very small memory foot print (only 1024 bytes of code!), but that’s ok.  You can easily munge things around if you want more code.  So as long as the data array is less than or equal to 1024 bytes, accesses to address 0x400 will automatically wrap around to 0x000 in the data array.  But from the program code perspective it will “think” it’s accessing address 0x400.  You could put Verilog code in to do a bounds check or whatever if you like, but I don’t see the need for now.
 
An important element of the memory map is program binaries (executables) consist of more than just code and data.  They infact contain several “segments”.  Some of these segments are code (on Unix systems the code segment has historically been called “.text”).  Others do in fact contain data (thankfully named just “.data”).  But another important segment is *read only data* named something like “.rodata”.  The gcc compiler puts things like strings in the .rodata segment, and by default on many systems (ARM included) .rodata gets appended next to code in the program address space.  This is not what you want…. since you want it to be up near your data so you can actually load from it.   Thankfully the linker can do this for you.  You just need to tell it the name of each segment you care about and where it should be.  This is done with what is called a “linker script”.  There’s a file in the testcode.tar file called “ld.script”.  This is the file that tells the linker where to place your code, data, and rodata.  If you want to use a different memory layout than my solution then you will need to hack the ld.script file.
 
Conversion to Verilog: The second step you need to do to use the GNU tools to program your processor is to get the binary into your Verilog.  This is thankfully easy with Verilog, once you get the files formatted correctly.  Verilog contains a command you can put in an initial block to read hex digits from a file.  Here’s a nice link to the topic: https://timetoexplore.net/blog/initialize-memory-in-verilog    The tricky bit is to get the object file data out.  Here I provided a bash script to help you do that (elftohex.sh).  This script takes in a program binary filename and produces 5 files: code.hex, data0.hex, data1.hex, data2.hex, and data3.hex.
 
Why 5 files and not 2?  I decided to support both word and byte level loads in the lab 2 solution.  The easiest way to do this was to use four arrays in Verilog — one for each low order two bits of the data address.  Byte loads are directed from the appropriate arrays, and word loads come from all four.  If you did not do this, that is fine, it wasn’t required as part of the lab 2 requirements.  But you will need to edit the elftohex script to produce the appropriate array types that you need.
 
Things you will most certainly need to do yourself to use this code: As I mentioned, much of this stuff you cannot just pick up and run with unchanged.  Your tools and setup and Verilog will be different than mine.  To use this stuff you’ll need to hack on it.  At a minimum I expect you will need: edit ld.script to adjust the memory layout, hack elftohex.sh to produce the appropriate array types for the appropriate section names.  Hack your Verilog to import the arrays as needed.
 
Why bother?  Well for one thing if you can run code spewed out from the compiler then there’s a pretty good chance your processor is working!  That’s the first thing.  Second thing is it’ll let you write larger and more varied test cases faster than by hand.  Finally, it’s just fun.  You’re building a real processor.  Why not get it connected up to the GNU tool chain and program it that way :)

PC / memory map / etc (Feb. 5)

I realized in the middle of the night I left out an important part of the memory map discussion.  Processors need to start somewhere.  We discussed this briefly at the start of the quarter, and it was supposed to be part of Monday’s lecture. But given its saliency I thought I’d email something out.
 
Processors start somewhere.  Where they start is architecture dependent.  Meaning, x86 processors start at a different location than ARM processors.  Usually the same type of processor will start at the same address.  With emphasis on the word *usually*.  There are exceptions to every rule in computer engineering.
 
For the solution in lab 1 and lab 2 I just made my ARM processor start at location zero.  This is done in the always block that updates the PC:
 
always @(posedge clk) begin
 if (!nreset)
  pc <= 32’d0;
 else begin
  // Good stuff here
 end
end
 
Now once you’ve chosen a start address your code has to conform to it.  The testcode.tar file posted on the website will setup an executable, as part of the link stage, to start at location 0.  It does this by meeting three requirements: (1) the linker is told to put the code (.text) segment at address 0; (2) the linker is linking for ARM Linux and on such systems programs start at a label called “_start”.  So there is a start.s file with a label at the top called _start: ; (3) the start.o object is put first on the command line of objects to link.  The linker is straightforward in its operation and then puts this object file first in the code segment.  If you violate any of these requirements your test code will not function correctly.
 
What is _start ?  C programmers are usually told that programs start at a function called main().  But this is not true.  Programs start at _start, inside of libc (or on some systems inside of a library called ld.so or object ld.o).  This start section of the code does a few things, such as initialize the libc library itself.  On Windows systems it may parse the command line string into the argument array that main() expects.  On a Linux system the stack is already setup by the operating system.  But on the processor you are building there is no operating system, so the start.s file I posted choses an address for the stack, near the top of the data arrays in the lab 2 solution.  It also sets of the frame pointer because gcc is using the frame pointer by default.  After doing this it then invokes main() like it was a function.  When main returns, ordinarily the _start block would invoke the exit() system call, telling the operating system to terminate the process.  But again, since you have no OS, the _start block that I provided just jumps back to location 0 and effectively restarts the code.  Data is unchanged, however, so you can actually write test code to detect that it is re-entering itself (if you want).

Testcode Update(Feb.5)

Been hacking on the lab 2 solution.  Let me /strongly/ encourage everyone to be working hard on lab 2.  It’s a lot of work to get all the little bits of the instruction semantics correct.
 
In the process of debugging my solution with C code I discovered something interesting that required an update to the testcode.tar file (just now updated on the website — if you downloaded it before, please re-download).  ARM, unless you are running on the newer ARM cores expects read access to the code segment.  It does this frequently to store constants.  For example, consider this C code:
 
int array[100] = { 4, 5, 6 };

int x = 0x12;
volatile unsigned char *debug_port;

int main() {
 int register i;
 debug_port = (unsigned char *) 0xffff0000;

 for (i =0; i < 10; i++) {
 (*debug_port) = (unsigned char) i;
 array[i] = array[i] + i;
 }
}

 
This gets compiled into:
 
00000018 <main>:
  18:   e59f303c ldr r3, [pc, #60] ; 5c <main+0x44>
  1c:   e59f203c ldr r2, [pc, #60] ; 60 <main+0x48>
  20: e5832000 str r2, [r3]
  24: e59f1038 ldr r1, [pc, #56] ; 64 <main+0x4c>
  28: e3a03000 mov r3, #0, 0
  2c:   e59fc028 ldr ip, [pc, #40] ; 5c <main+0x44>
  30: e59c2000 ldr r2, [ip]
  34: e20300ff and r0, r3, #255, 0 ; 0xff
  38: e5c20000 strb r0, [r2]
  3c:   e5b12004 ldr r2, [r1, #4]!
  40: e0822003 add r2, r2, r3
  44: e5812000 str r2, [r1]
  48: e2833001 add r3, r3, #1, 0
  4c:   e353000a cmp r3, #10, 0
  50: 1afffff6    bne 30 <main+0x18>
  54: e3a00000 mov r0, #0, 0
  58: e12fff1e    bx lr
  5c:   0000119c .word 0x0000119c
  60: ffff0000   .word 0xffff0000
  64: 00001000 .word 0x00001000
 
See the .words at the end of the function?  That's the ARM way of getting a large constant into a register.  Specifically this line of C code "debug_port = (unsigned char *) 0xffff0000;" is handled by this assembly instruction: "ldr r2, [pc, #60]" which is loading a data item from the *code* segment.  This is a problem in the design I've been doing where the code and data are stored in separate Verilog arrays and both are essentially zero based.
 
To solve this problem I decided in my design, and in the testcode.tar infrastructure I posted, to duplicate the code in the code array and in the data array.  And then stick the actual data up higher.  Again, my solution supports both word and byte loads, so you may not need all these lines, but I initialize my arrays like this:
 
  initial begin
 $readmemh("testcode/code.hex", code_mem);

// Mirror code into data segment
 $readmemh("testcode/code_data0.hex", data0, 0);
 $readmemh("testcode/code_data1.hex", data1, 0);
 $readmemh("testcode/code_data2.hex", data2, 0);
 $readmemh("testcode/code_data3.hex", data3, 0);

// Put the actual data up by 4096 bytes (1024 X 4)
 $readmemh("testcode/data0.hex", data0, 1024);
 $readmemh("testcode/data1.hex", data1, 1024);
 $readmemh("testcode/data2.hex", data2, 1024);
 $readmemh("testcode/data3.hex", data3, 1024);
end
 
The new testcode.tar file is setup to support a 4K code space and a 4K data space.  The actual Verilog arrays are 4K for code and 8K for data, supporting the mirroring of the code space into the data space.
 
I'll explain more in class on this, but this duplication isn't necessary in practice.  In reality both the code and data are cached in separate arrays (caches) which ultimately are backed by main memory.  But we're not implementing caches so we have this little bobble in the design.  For lab 2, it's easy to fix this bobble by doing the PC instruction fetch in a separate cycle from the memory load.  I almost did this.  But in lab 3, when we pipeline the design, it's nice to be able to both fetch and load from memory at the same time.  So I ultimately decided to just choose the duplicating route for now.

More FAQs(Feb.5)

* Do I have to implement all arithmetic instructions?  NO.  But like I said you will not find it hard to do so.  It’ll add a fraction more lines to your code.
 
* Do I have to implement condition codes?  YES.  Except you do not have to implement them for the logical operators correctly (see next question).
 
* Do I have to implement the rotate and shifts for the ALU operand?  NO.   Implementing them does not require that many lines of Verilog (perhaps 25-50) but implementing them correctly takes time.  I’m still debugging them in the lab 2 solution :)
 
* Do I have to implement write back of base register on load or store?  NO.  It is not difficult to implement this, however.
 
* Do I have to use the C to Verilog stuff to test my processor?  NO.  I did this because I’m lazy and thought it would be easier to test my solution this way.  But you certainly don’t have to if you feel confident in your own testing methodology.
 
Finally, I found the following useful for tracking down bugs in my design (code below my signature).  I’m sending this out to give you some ideas on how you might debug your own designs.  I expanded the test code to have 8 bytes of output.  The solution to lab 2 is multi-cycle, to deal with the synchronous RAMs.  What I did was tweak the output to the ports to output useful information about the result of the previous phase.  I also altered these depending on what type of bug I was trying to fix.  For example, when I had a bug in the memory system I altered what was displayed during the regwrite1 and regwrite2 phases to show the memory address and the result from memory (for a load).  By tracking the processor data flow phase by phase and looking at the key outputs from the prior phase I was able to fix a lot of bugs :)
 
-Mark
 
  // Debug outputs
  assign debug_port1 = pc[7:0];
  assign debug_port2 = {zero, last_phase, zero, phase };
  assign debug_port3 = program_debug;
  assign debug_port4 = debug_word[31:24];
  assign debug_port5 = debug_word[23:16];
  assign debug_port6 = debug_word[15:8];
  assign debug_port7 = debug_word[7:0];
 
  reg [31:0] debug_word;
  always @(*) begin
    case (phase)
      phase_fetch:      debug_word = pc;
      phase_regread:    debug_word = inst;
      phase_mem:        debug_word = alu_result;
      phase_regwrite1:  debug_word = rf_wd1;
      phase_regwrite2:  debug_word = rf_wd2;
      default:          debug_word = 32'd0;
    endcase
  end
 
 // This is a multi-cycle execution engine.  These are the phases.
  localparam phase_fetch = 3'b000;
  localparam phase_regread = 3'b001;
  localparam phase_mem = 3'b010;
  localparam phase_regwrite1 = 3'b011;
  localparam phase_regwrite2 = 3'b100;
  localparam phase_fault = 3'b111;
  reg [2:0] phase;