Lab 2 Follow ups from Professor

Do I have to implement the cpsr register?
YES. You should implement the 4 arithmetic bits.

How do I implement the condition bits?
This is part of the work of this lab, to understand it. But here’s a hint: one of the bits is C for carry out. A simple trick to get this bit is to zero extend the numbers from 32 to 33 bits… Here’s a link that you might find helpful: http://teaching.idallen.com/dat2343/10f/notes/040_overflow.txt This is also directly from the ARM instruction manual:
"The data processing operations may be classified as logical or arithmetic. The logical operations (AND, EOR, TST, TEQ, ORR, MOV, BIC, MVN) perform the logical action on all corresponding bits of the operand or operands to produce the result. If the S bit is set (and Rd is not R15, see below) the V flag in the CPSR will be unaffected, the C flag will be set to the carry out from the barrel shifter (or preserved when the shift operation is LSL #0), the Z flag will be set if and only if the result is all zeros, and the N flag will be set to the logical value of bit 31 of the result.

The arithmetic operations (SUB, RSB, ADD, ADC, SBC, RSC, CMP, CMN) treat each operand as a 32 bit integer (either unsigned or 2’s complement signed, the two are equivalent). If the S bit is set (and Rd is not R15) the V flag in the CPSR will be set if an overflow occurs into bit 31 of the result; this may be ignored if the operands were considered unsigned, but warns of a possible error if the operands were 2’scomplement signed. The C flag will be set to the carry out of bit 31 of the ALU, the Z flag will be set if and only if the result was zero, and the N flag will be set to the value of bit 31 of the result (indicating a negative result if the operands are considered to be 2’s complement signed).” Here’s the manual: http://vision.gel.ulaval.ca/~jflalonde/cours/1001/h17/docs/arm-instructionset.pdf

Do I have to support the COND field on instructions?
YES. Once you implement cpsr this becomes rather easy.

How do I do signed arithmetic in Verilog?
Bit vectors in Verilog have no signness, they are basically unsigned. When I write Verilog I try not to write “-“ for subtract. Instead I manually do 2’s complement math. e.g. C = A + ~B + 1 instead of C = A - B. What I have found is that this generally pushes through a wider variety of synthesis tools intro smaller designs. For example, yosys makes more compact designs this way. I haven’t checked in on Altera or Synopsys lately, although I hope they do a better job :) Remember the mantra on synthesis tools: trust but verify. For those doing the HW option on the BX board, the icebox_stat tool is a useful thing to use, or remove the -q flag from the SConstruct file on place and route and you can get a report of resource usage.

Do I have to implement byte orientated loads and stores as well as words?
NO. But you’ll find this isn’t that difficult to do if you want to go ahead and implement it.

Do I have to implement the shift/rotate on immediate and registers?
NO. You can if you like, however. I did in the solution set just to see how large it would be. It’s not bad. You can implement it and it won’t take all that many LUTs because it’s not a general purpose multiplier.

One last thing, you do not have to implement the pre and post increment or decrement addressing mode. Those are just silly and no other architecture has them. There’s nothing to learn there and two ported register files are expensive.

Memories (Feb.3/4)

If you are using Altera parts I don't think this is an issue. Last I remember their BRAMs could infer asynchronous RAMs. But I'll have to investigate that before I know for sure.

Note that the solution for Lab 2 (which I'm working on now) will use *synchronous* RAMs, because that's closer to what you need for Lab 3.

Following up on this most recent email, yosys is not very smart about inferring memories. If you have the least bit of complex logic around the read an write ports and enables, then it will get confused. Here’s an example of how to fix it (this is the register file from the solution I’m working on for lab 2). Note how it pulls out the control logic into an always @(*) block and leaves the posedge block extremely simple.

I realized in the middle of the night that my sample register file has phases in it and is writing two registers. You do not have to do this. I decided to implement the base offset update addressing mode in my solution, just for kicks. This requires two write ports to the register file. Again you do /not/ have to do this in your processor.

reg [3:0] rf_rs1;
reg [3:0] rf_rs2;
reg [3:0] rf_rs3;
reg [3:0] rf_ws1;
reg [3:0] rf_ws2;
reg [31:0] rf_wd1;
reg [31:0] rf_wd2;
reg rf_we1;
reg rf_we2;
reg [31:0] rf_d1;
reg [31:0] rf_d2;
reg [31:0] rf_d3;

reg [31:0] rf_d1_raw;
reg [31:0] rf_d2_raw;
reg [31:0] rf_d3_raw;
reg [31:0] rf_wd;
reg [3:0] rf_ws;
reg read_reg_file;
reg write_reg_file;

always @(posedge clk) begin
if (read_reg_file) begin
rf_d1_raw <= rf[rf_rs1];
rf_d2_raw <= rf[rf_rs2];
rf_d3_raw <= rf[rf_rs3];
end
if (write_reg_file)
rf[rf_ws] <= rf_wd;
end

always @(*) begin
read_reg_file = false;

write_reg_file = false;
if (phase == phase_regread)
read_reg_file = true;
if (phase == phase_regwrite1 && rf_we1) begin
write_reg_file = true;
rf_ws = rf_ws1;
rf_wd = rf_wd1;
end
else if (phase == phase_regwrite2 && rf_we2) begin
write_reg_file = true;
rf_ws = rf_ws1;
rf_wd = rf_wd1;
end

///////// TODO: possibly this is pc_plus8 not pc
rf_d1 = (rf_rs1 == r15) ? pc : rf_d1_raw;
rf_d2 = (rf_rs2 == r15) ? pc : rf_d2_raw;
rf_d3 = (rf_rs3 == r15) ? pc : rf_d3_raw;

end

More about C/Assenbly to Verilog(Feb.4)

I took the snow day this morning to write up the necessary Makefile, linker script and conversion tool to get code that you can write in C or assembly into hex files that you can load into your Verilog modules. This is a nice way to test your code, but there’s a learning curve. I hope the learning curve is worth it and you take this on for your projects. I’m going to try and explain what is going on here in this email as well as how you can adapt the code to your own uses. It will not work “out of the box” because everyone’s system is different and everyone’s Verilog is different. This is about a lecture’s worth of material so, a pretty good topic to tackle on a snow day ;)

Source files are the most straightforward. They are your assembly file or if you like, C code you write to do what you want. Object files are just the binary representation of these source files. For C they have been transformed into the target assembly instruction set (in binary form). Assembly source files, are, somewhat by definition, already specific to an instruction set but is in human-readable form. Compiling an assembly file produces a binary representation of it. An executable file is all of your object files (and libraries of code too) linked together into a file that an operating system can load as a process. Rather confusingly, executables are also referred to as “binaries”, even though object files are also a binary format in this process. Nevertheless, when somewhat says “do you have a binary?” what they are really saying is “do you have an executable?"

Memory map: processes execute on an OS assuming a particular memory map. The OS is expected to provide this. What this means is “code is at this address”, “data is at this address” and “stack is over there”, etc. I’m not sure what the ARM32 tools will default too (something for ARM Linux I imagine), but it surely is /not/ what your Verilog ARM processor expects. If your ARM processor is like my solution set (doesn’t have to be, but just using the solution as an example), the code lives in one Verilog array and the data lives in another (actually my data is in four arrays, more on that later). Fetches are directed at the code array, while load and store instructions are directed at the data array. Both of these are 0 index. Meaning, code address 0 loads the first element of the code array and data address zero loads the first byte from the data array. This is /not/ how a modern processor works. Code and data live in the same address space. You could do this for Lab 2 if you wanted, but I don’t recommend it, because by Lab 3 you will need to undo it. It’s best for the labs that you keep the code and data spaces separate.

So what to do? The “trick” is to rely on the high order bits to separate your code and data from each other. For example, in my processor and in the testcode.tar file I posted on the class webpage, I assume code lives at address 0x00000000, and data lives at address 0x00000400. This is still a very small memory foot print (only 1024 bytes of code!), but that’s ok. You can easily munge things around if you want more code. So as long as the data array is less than or equal to 1024 bytes, accesses to address 0x400 will automatically wrap around to 0x000 in the data array. But from the program code perspective it will “think” it’s accessing address 0x400. You could put Verilog code in to do a bounds check or whatever if you like, but I don’t see the need for now.

An important element of the memory map is program binaries (executables) consist of more than just code and data. They infact contain several “segments”. Some of these segments are code (on Unix systems the code segment has historically been called “.text”). Others do in fact contain data (thankfully named just “.data”). But another important segment is *read only data* named something like “.rodata”. The gcc compiler puts things like strings in the .rodata segment, and by default on many systems (ARM included) .rodata gets appended next to code in the program address space. This is not what you want…. since you want it to be up near your data so you can actually load from it. Thankfully the linker can do this for you. You just need to tell it the name of each segment you care about and where it should be. This is done with what is called a “linker script”. There’s a file in the testcode.tar file called “ld.script”. This is the file that tells the linker where to place your code, data, and rodata. If you want to use a different memory layout than my solution then you will need to hack the ld.script file.

Conversion to Verilog: The second step you need to do to use the GNU tools to program your processor is to get the binary into your Verilog. This is thankfully easy with Verilog, once you get the files formatted correctly. Verilog contains a command you can put in an initial block to read hex digits from a file. Here’s a nice link to the topic: https://timetoexplore.net/blog/initialize-memory-in-verilog The tricky bit is to get the object file data out. Here I provided a bash script to help you do that (elftohex.sh). This script takes in a program binary filename and produces 5 files: code.hex, data0.hex, data1.hex, data2.hex, and data3.hex.

PC / memory map / etc (Feb. 5)

I realized in the middle of the night I left out an important part of the memory map discussion. Processors need to start somewhere. We discussed this briefly at the start of the quarter, and it was supposed to be part of Monday’s lecture. But given its saliency I thought I’d email something out.

Processors start somewhere. Where they start is architecture dependent. Meaning, x86 processors start at a different location than ARM processors. Usually the same type of processor will start at the same address. With emphasis on the word *usually*. There are exceptions to every rule in computer engineering.

For the solution in lab 1 and lab 2 I just made my ARM processor start at location zero. This is done in the always block that updates the PC:

always @(posedge clk) begin

if (!nreset)

pc <= 32’d0;

else begin

// Good stuff here

end

Now once you’ve chosen a start address your code has to conform to it. The testcode.tar file posted on the website will setup an executable, as part of the link stage, to start at location 0. It does this by meeting three requirements: (1) the linker is told to put the code (.text) segment at address 0; (2) the linker is linking for ARM Linux and on such systems programs start at a label called “_start”. So there is a start.s file with a label at the top called _start: ; (3) the start.o object is put first on the command line of objects to link. The linker is straightforward in its operation and then puts this object file first in the code segment. If you violate any of these requirements your test code will not function correctly.

What is _start ? C programmers are usually told that programs start at a function called main(). But this is not true. Programs start at _start, inside of libc (or on some systems inside of a library called ld.so or object ld.o). This start section of the code does a few things, such as initialize the libc library itself. On Windows systems it may parse the command line string into the argument array that main() expects. On a Linux system the stack is already setup by the operating system. But on the processor you are building there is no operating system, so the start.s file I posted choses an address for the stack, near the top of the data arrays in the lab 2 solution. It also sets of the frame pointer because gcc is using the frame pointer by default. After doing this it then invokes main() like it was a function. When main returns, ordinarily the _start block would invoke the exit() system call, telling the operating system to terminate the process. But again, since you have no OS, the _start block that I provided just jumps back to location 0 and effectively restarts the code. Data is unchanged, however, so you can actually write test code to detect that it is re-entering itself (if you want).

Testcode Update(Feb.5)

Been hacking on the lab 2 solution. Let me /strongly/ encourage everyone to be working hard on lab 2. It’s a lot of work to get all the little bits of the instruction semantics correct.

In the process of debugging my solution with C code I discovered something interesting that required an update to the testcode.tar file (just now updated on the website — if you downloaded it before, please re-download). ARM, unless you are running on the newer ARM cores expects read access to the code segment. It does this frequently to store constants. For example, consider this C code:

int array[100] = { 4, 5, 6 };

int x = 0x12;
volatile unsigned char *debug_port;

int main() {
int register i;
debug_port = (unsigned char *) 0xffff0000;

for (i =0; i < 10; i++) {
(*debug_port) = (unsigned char) i;
array[i] = array[i] + i;
}
}

This gets compiled into:

00000018 <main>:

18: e59f303c ldr r3, [pc, #60] ; 5c <main+0x44>

1c: e59f203c ldr r2, [pc, #60] ; 60 <main+0x48>

20: e5832000 str r2, [r3]

24: e59f1038 ldr r1, [pc, #56] ; 64 <main+0x4c>

28: e3a03000 mov r3, #0, 0

2c: e59fc028 ldr ip, [pc, #40] ; 5c <main+0x44>

30: e59c2000 ldr r2, [ip]

34: e20300ff and r0, r3, #255, 0 ; 0xff

38: e5c20000 strb r0, [r2]

3c: e5b12004 ldr r2, [r1, #4]!

40: e0822003 add r2, r2, r3

44: e5812000 str r2, [r1]

48: e2833001 add r3, r3, #1, 0

4c: e353000a cmp r3, #10, 0

50: 1afffff6 bne 30 <main+0x18>

54: e3a00000 mov r0, #0, 0

58: e12fff1e bx lr

5c: 0000119c .word 0x0000119c

60: ffff0000 .word 0xffff0000

64: 00001000 .word 0x00001000

See the .words at the end of the function? That's the ARM way of getting a large constant into a register. Specifically this line of C code "debug_port = (unsigned char *) 0xffff0000;" is handled by this assembly instruction: "ldr r2, [pc, #60]" which is loading a data item from the *code* segment. This is a problem in the design I've been doing where the code and data are stored in separate Verilog arrays and both are essentially zero based.

To solve this problem I decided in my design, and in the testcode.tar infrastructure I posted, to duplicate the code in the code array and in the data array. And then stick the actual data up higher. Again, my solution supports both word and byte loads, so you may not need all these lines, but I initialize my arrays like this:

initial begin
$readmemh("testcode/code.hex", code_mem);

// Mirror code into data segment
$readmemh("testcode/code_data0.hex", data0, 0);
$readmemh("testcode/code_data1.hex", data1, 0);
$readmemh("testcode/code_data2.hex", data2, 0);
$readmemh("testcode/code_data3.hex", data3, 0);

// Put the actual data up by 4096 bytes (1024 X 4)
$readmemh("testcode/data0.hex", data0, 1024);
$readmemh("testcode/data1.hex", data1, 1024);
$readmemh("testcode/data2.hex", data2, 1024);
$readmemh("testcode/data3.hex", data3, 1024);
end

The new testcode.tar file is setup to support a 4K code space and a 4K data space. The actual Verilog arrays are 4K for code and 8K for data, supporting the mirroring of the code space into the data space.

I'll explain more in class on this, but this duplication isn't necessary in practice. In reality both the code and data are cached in separate arrays (caches) which ultimately are backed by main memory. But we're not implementing caches so we have this little bobble in the design. For lab 2, it's easy to fix this bobble by doing the PC instruction fetch in a separate cycle from the memory load. I almost did this. But in lab 3, when we pipeline the design, it's nice to be able to both fetch and load from memory at the same time. So I ultimately decided to just choose the duplicating route for now.

More FAQs(Feb.5)

Finally, I found the following useful for tracking down bugs in my design (code below my signature). I’m sending this out to give you some ideas on how you might debug your own designs. I expanded the test code to have 8 bytes of output. The solution to lab 2 is multi-cycle, to deal with the synchronous RAMs. What I did was tweak the output to the ports to output useful information about the result of the previous phase. I also altered these depending on what type of bug I was trying to fix. For example, when I had a bug in the memory system I altered what was displayed during the regwrite1 and regwrite2 phases to show the memory address and the result from memory (for a load). By tracking the processor data flow phase by phase and looking at the key outputs from the prior phase I was able to fix a lot of bugs :)

// Debug outputs

assign debug_port1 = pc[7:0];

assign debug_port2 = {zero, last_phase, zero, phase };

assign debug_port3 = program_debug;

assign debug_port4 = debug_word[31:24];

assign debug_port5 = debug_word[23:16];

assign debug_port6 = debug_word[15:8];

assign debug_port7 = debug_word[7:0];

reg [31:0] debug_word;

always @(*) begin

case (phase)

phase_fetch: debug_word = pc;

phase_regread: debug_word = inst;

phase_mem: debug_word = alu_result;

phase_regwrite1: debug_word = rf_wd1;

phase_regwrite2: debug_word = rf_wd2;

default: debug_word = 32'd0;

endcase

end

Lab 2 E-Mail follow ups from Professor (Updated Feb.5)

Announcements(FAQ) on Lab 2 Feb.2:

Memories (Feb.3/4)

More about C/Assenbly to Verilog(Feb.4)

PC / memory map / etc (Feb. 5)

Testcode Update(Feb.5)

More FAQs(Feb.5)