Verify backend on Verilog simulator¶
Until now, we have an llvm backend to compile C or assembly as the white part of
Fig. 59. If without global variable, the elf obj can be
dumped to hex file via llvm-objdump -d
which finished in Chapter ELF Support.
This chapter will implement Cpu0 instructions by Verilog language as the gray part of above figure. With this Verilog machine, we can run this hex program on the Cpu0 Verilog machine on PC and see the Cpu0 instructions execution result.
Create verilog simulator of Cpu0¶
Verilog language is an IEEE standard in IC design. There are a lot of books and documents for this language. Free documents exist in Web sites [1] [2] [3] [4] [5]. Verilog also called as Verilog HDL but not VHDL. VHDL is the same purpose language which compete against Verilog. About VHDL reference here [6]. Example code, lbdex/verilog/cpu0.v, is the Cpu0 design in Verilog. In Appendix A, we have downloaded and installed Icarus Verilog tool both on iMac and Linux. The cpu0.v is a simple design with only few hundreds lines of code totally. This implementation hasn’t the pipeline features, but through implement the delay slot simulation (SIMULATE_DELAY_SLOT part of code), the exact pipeline machine cycles can be calculated.
Verilog is a C like language in syntex and this book is a compiler book, so we list the cpu0.v as well as the building command without explanation as below. We expect readers can understand the Verilog code just with a little patience in reading it. There are two type of I/O according computer architecture. One is memory mapped I/O, the other is instruction I/O. Cpu0 uses memory mapped I/O where memory address 0x80000 as the output port. When meet the instruction “st $ra, cx($rb)”, where cx($rb) is 0x80000, Cpu0 displays the content as follows,
ST : begin
...
if (R[b]+c16 == `IOADDR) begin
outw(R[a]);
lbdex/verilog/cpu0.v
// https://www.francisz.cn/download/IEEE_Standard_1800-2012%20SystemVerilog.pdf
// configuable value below
`define SIMULATE_DELAY_SLOT
// cpu032I memory limit, jsub:24-bit
`define MEMSIZE 'h1000000
`define MEMEMPTY 8'hFF
`define NULL 8'h00
`define IOADDR 'hff000000 // IO mapping address
`define TIMEOUT #3000000000
// Operand width
`define INT32 2'b11 // 32 bits
`define INT24 2'b10 // 24 bits
`define INT16 2'b01 // 16 bits
`define BYTE 2'b00 // 8 bits
`define EXE 3'b000
`define RESET 3'b001
`define ABORT 3'b010
`define IRQ 3'b011
`define ERROR 3'b100
// Reference web: http://ccckmit.wikidot.com/ocs:cpu0
module cpu0(input clock, reset, input [2:0] itype, output reg [2:0] tick,
output reg [31:0] ir, pc, mar, mdr, inout [31:0] dbus,
output reg m_en, m_rw, output reg [1:0] m_size,
input cfg);
reg signed [31:0] R [0:15];
reg signed [31:0] C0R [0:1]; // co-processor 0 register
// High and Low part of 64 bit result
reg [7:0] op;
reg [3:0] a, b, c;
reg [4:0] c5;
reg signed [31:0] c12, c16, c24, Ra, Rb, Rc, pc0; // pc0: instruction pc
reg [31:0] uc16, URa, URb, URc, HI, LO, CF, tmp;
reg [63:0] cycles;
// register name
`define SP R[13] // Stack Pointer
`define LR R[14] // Link Register
`define SW R[15] // Status Word
// C0 register name
`define PC C0R[0] // Program Counter
`define EPC C0R[1] // exception PC value
// SW Flage
`define I2 `SW[16] // Hardware Interrupt 1, IO1 interrupt, status,
// 1: in interrupt
`define I1 `SW[15] // Hardware Interrupt 0, timer interrupt, status,
// 1: in interrupt
`define I0 `SW[14] // Software interrupt, status, 1: in interrupt
`define I `SW[13] // Interrupt, 1: in interrupt
`define I2E `SW[12] // Hardware Interrupt 1, IO1 interrupt, Enable
`define I1E `SW[11] // Hardware Interrupt 0, timer interrupt, Enable
`define I0E `SW[10] // Software Interrupt Enable
`define IE `SW[9] // Interrupt Enable
`define M `SW[8:6] // Mode bits, itype
`define D `SW[5] // Debug Trace
`define V `SW[3] // Overflow
`define C `SW[2] // Carry
`define Z `SW[1] // Zero
`define N `SW[0] // Negative flag
`define LE CF[0] // Endian bit, Big Endian:0, Little Endian:1
// Instruction Opcode
parameter [7:0] NOP=8'h00,LD=8'h01,ST=8'h02,LB=8'h03,LBu=8'h04,SB=8'h05,
LH=8'h06,LHu=8'h07,SH=8'h08,ADDiu=8'h09,MOVZ=8'h0A,MOVN=8'h0B,ANDi=8'h0C,
ORi=8'h0D,XORi=8'h0E,LUi=8'h0F,
ADDu=8'h11,SUBu=8'h12,ADD=8'h13,SUB=8'h14,CLZ=8'h15,CLO=8'h16,MUL=8'h17,
AND=8'h18,OR=8'h19,XOR=8'h1A,NOR=8'h1B,
ROL=8'h1C,ROR=8'h1D,SHL=8'h1E,SHR=8'h1F,
SRA=8'h20,SRAV=8'h21,SHLV=8'h22,SHRV=8'h23,ROLV=8'h24,RORV=8'h25,
`ifdef CPU0II
SLTi=8'h26,SLTiu=8'h27, SLT=8'h28,SLTu=8'h29,
`endif
CMP=8'h2A,
CMPu=8'h2B,
JEQ=8'h30,JNE=8'h31,JLT=8'h32,JGT=8'h33,JLE=8'h34,JGE=8'h35,
JMP=8'h36,
`ifdef CPU0II
BEQ=8'h37,BNE=8'h38,
`endif
JALR=8'h39,BAL=8'h3A,JSUB=8'h3B,RET=8'h3C,
MULT=8'h41,MULTu=8'h42,DIV=8'h43,DIVu=8'h44,
MFHI=8'h46,MFLO=8'h47,MTHI=8'h48,MTLO=8'h49,
MFC0=8'h50,MTC0=8'h51,C0MOV=8'h52;
reg [0:0] inExe = 0;
reg [2:0] state, next_state;
reg [2:0] st_taskInt, ns_taskInt;
parameter Reset=3'h0, Fetch=3'h1, Decode=3'h2, Execute=3'h3, MemAccess=3'h4,
WriteBack=3'h5;
integer i;
`ifdef SIMULATE_DELAY_SLOT
reg [0:0] nextInstIsDelaySlot;
reg [0:0] isDelaySlot;
reg signed [31:0] delaySlotNextPC;
`endif
//transform data from the memory to little-endian form
task changeEndian(input [31:0] value, output [31:0] changeEndian); begin
changeEndian = {value[7:0], value[15:8], value[23:16], value[31:24]};
end endtask
// Read Memory Word
task memReadStart(input [31:0] addr, input [1:0] size); begin
mar = addr; // read(m[addr])
m_rw = 1; // Access Mode: read
m_en = 1; // Enable read
m_size = size;
end endtask
// Read Memory Finish, get data
task memReadEnd(output [31:0] data); begin
mdr = dbus; // get momory, dbus = m[addr]
data = mdr; // return to data
m_en = 0; // read complete
end endtask
// Write memory -- addr: address to write, data: date to write
task memWriteStart(input [31:0] addr, input [31:0] data, input [1:0] size);
begin
mar = addr; // write(m[addr], data)
mdr = data;
m_rw = 0; // access mode: write
m_en = 1; // Enable write
m_size = size;
end endtask
task memWriteEnd; begin // Write Memory Finish
m_en = 0; // write complete
end endtask
task regSet(input [3:0] i, input [31:0] data); begin
if (i != 0) R[i] = data;
end endtask
task C0regSet(input [3:0] i, input [31:0] data); begin
if (i < 2) C0R[i] = data;
end endtask
task PCSet(input [31:0] data); begin
`ifdef SIMULATE_DELAY_SLOT
nextInstIsDelaySlot = 1;
delaySlotNextPC = data;
`else
`PC = data;
`endif
end endtask
task retValSet(input [3:0] i, input [31:0] data); begin
if (i != 0)
`ifdef SIMULATE_DELAY_SLOT
R[i] = data + 4;
`else
R[i] = data;
`endif
end endtask
task regHILOSet(input [31:0] data1, input [31:0] data2); begin
HI = data1;
LO = data2;
end endtask
// output a word to Output port (equal to display the word to terminal)
task outw(input [31:0] data); begin
if (`LE) begin // Little Endian
changeEndian(data, data);
end
if (data[7:0] != 8'h00) begin
$write("%c", data[7:0]);
if (data[15:8] != 8'h00)
$write("%c", data[15:8]);
if (data[23:16] != 8'h00)
$write("%c", data[23:16]);
if (data[31:24] != 8'h00)
$write("%c", data[31:24]);
end
end endtask
// output a character (a byte)
task outc(input [7:0] data); begin
$write("%c", data);
end endtask
task taskInterrupt(input [2:0] iMode); begin
if (inExe == 0) begin
case (iMode)
`RESET: begin
`PC = 0; tick = 0; R[0] = 0; `SW = 0; `LR = -1;
`IE = 0; `I0E = 1; `I1E = 1; `I2E = 1;
`I = 0; `I0 = 0; `I1 = 0; `I2 = 0; inExe = 1;
`LE = cfg;
cycles = 0;
end
`ABORT: begin `PC = 4; end
`IRQ: begin `PC = 8; `IE = 0; inExe = 1; end
`ERROR: begin `PC = 12; end
endcase
end
$display("taskInterrupt(%3b)", iMode);
end endtask
task taskExecute; begin
tick = tick+1;
case (state)
Fetch: begin // Tick 1 : instruction fetch, throw PC to address bus,
// memory.read(m[PC])
memReadStart(`PC, `INT32);
pc0 = `PC;
`ifdef SIMULATE_DELAY_SLOT
if (nextInstIsDelaySlot == 1) begin
isDelaySlot = 1;
nextInstIsDelaySlot = 0;
`PC = delaySlotNextPC;
end
else begin
if (isDelaySlot == 1) isDelaySlot = 0;
`PC = `PC+4;
end
`else
`PC = `PC+4;
`endif
next_state = Decode;
end
Decode: begin // Tick 2 : instruction decode, ir = m[PC]
memReadEnd(ir); // IR = dbus = m[PC]
{op,a,b,c} = ir[31:12];
c24 = $signed(ir[23:0]);
c16 = $signed(ir[15:0]);
uc16 = ir[15:0];
c12 = $signed(ir[11:0]);
c5 = ir[4:0];
Ra = R[a];
Rb = R[b];
Rc = R[c];
URa = R[a];
URb = R[b];
URc = R[c];
next_state = Execute;
end
Execute: begin // Tick 3 : instruction execution
case (op)
NOP: ;
// load and store instructions
LD: memReadStart(Rb+c16, `INT32); // LD Ra,[Rb+Cx]; Ra<=[Rb+Cx]
ST: memWriteStart(Rb+c16, Ra, `INT32); // ST Ra,[Rb+Cx]; Ra=>[Rb+Cx]
// LB Ra,[Rb+Cx]; Ra<=(byte)[Rb+Cx]
LB: memReadStart(Rb+c16, `BYTE);
// LBu Ra,[Rb+Cx]; Ra<=(byte)[Rb+Cx]
LBu: memReadStart(Rb+c16, `BYTE);
// SB Ra,[Rb+Cx]; Ra=>(byte)[Rb+Cx]
SB: memWriteStart(Rb+c16, Ra, `BYTE);
LH: memReadStart(Rb+c16, `INT16); // LH Ra,[Rb+Cx]; Ra<=(2bytes)[Rb+Cx]
LHu: memReadStart(Rb+c16, `INT16); // LHu Ra,[Rb+Cx]; Ra<=(2bytes)[Rb+Cx]
// SH Ra,[Rb+Cx]; Ra=>(2bytes)[Rb+Cx]
SH: memWriteStart(Rb+c16, Ra, `INT16);
// Conditional move
MOVZ: if (Rc==0) regSet(a, Rb); // move if Rc equal to 0
MOVN: if (Rc!=0) regSet(a, Rb); // move if Rc not equal to 0
// Mathematic
ADDiu: regSet(a, Rb+c16); // ADDiu Ra, Rb+Cx; Ra<=Rb+Cx
CMP: begin
if (Rb < Rc) `N=1; else `N=0;
// `N=(Rb-Rc<0); // why not work for bash make.sh cpu032I el Makefile.builtins?
`Z=(Rb-Rc==0);
end // CMP Rb, Rc; SW=(Rb >=< Rc)
CMPu: begin
if (URb < URc) `N=1; else `N=0;
`Z=(URb-URc==0);
end // CMPu URb, URc; SW=(URb >=< URc)
ADDu: regSet(a, Rb+Rc); // ADDu Ra,Rb,Rc; Ra<=Rb+Rc
ADD: begin regSet(a, Rb+Rc); if (a < Rb) `V = 1; else `V = 0;
if (`V) begin `I0 = 1; `I = 1; end
end
// ADD Ra,Rb,Rc; Ra<=Rb+Rc
SUBu: regSet(a, Rb-Rc); // SUBu Ra,Rb,Rc; Ra<=Rb-Rc
SUB: begin regSet(a, Rb-Rc); if (Rb < 0 && Rc > 0 && a >= 0)
`V = 1; else `V =0;
if (`V) begin `I0 = 1; `I = 1; end
end // SUB Ra,Rb,Rc; Ra<=Rb-Rc
CLZ: begin
for (i=0; (i<32)&&((Rb&32'h80000000)==32'h00000000); i=i+1) begin
Rb=Rb<<1;
end
regSet(a, i);
end
CLO: begin
for (i=0; (i<32)&&((Rb&32'h80000000)==32'h80000000); i=i+1) begin
Rb=Rb<<1;
end
regSet(a, i);
end
MUL: regSet(a, Rb*Rc); // MUL Ra,Rb,Rc; Ra<=Rb*Rc
DIVu: regHILOSet(URa%URb, URa/URb); // DIVu URa,URb; HI<=URa%URb;
// LO<=URa/URb
// without exception overflow
DIV: begin regHILOSet(Ra%Rb, Ra/Rb);
if ((Ra < 0 && Rb < 0) || (Ra == 0)) `V = 1;
else `V =0; end // DIV Ra,Rb; HI<=Ra%Rb; LO<=Ra/Rb; With overflow
AND: regSet(a, Rb&Rc); // AND Ra,Rb,Rc; Ra<=(Rb and Rc)
ANDi: regSet(a, Rb&uc16); // ANDi Ra,Rb,c16; Ra<=(Rb and c16)
OR: regSet(a, Rb|Rc); // OR Ra,Rb,Rc; Ra<=(Rb or Rc)
ORi: regSet(a, Rb|uc16); // ORi Ra,Rb,c16; Ra<=(Rb or c16)
XOR: regSet(a, Rb^Rc); // XOR Ra,Rb,Rc; Ra<=(Rb xor Rc)
NOR: regSet(a, ~(Rb|Rc)); // NOR Ra,Rb,Rc; Ra<=(Rb nor Rc)
XORi: regSet(a, Rb^uc16); // XORi Ra,Rb,c16; Ra<=(Rb xor c16)
LUi: regSet(a, uc16<<16);
SHL: regSet(a, Rb<<c5); // Shift Left; SHL Ra,Rb,Cx; Ra<=(Rb << Cx)
SRA: regSet(a, (Rb>>>c5)); // Shift Right with signed bit fill;
// https://stackoverflow.com/questions/39911655/how-to-synthesize-hardware-for-sra-instruction
SHR: regSet(a, Rb>>c5); // Shift Right with 0 fill;
// SHR Ra,Rb,Cx; Ra<=(Rb >> Cx)
SHLV: regSet(a, Rb<<Rc); // Shift Left; SHLV Ra,Rb,Rc; Ra<=(Rb << Rc)
SRAV: regSet(a, (Rb>>>Rc)); // Shift Right with signed bit fill;
SHRV: regSet(a, Rb>>Rc); // Shift Right with 0 fill;
// SHRV Ra,Rb,Rc; Ra<=(Rb >> Rc)
ROL: regSet(a, (Rb<<c5)|(Rb>>(32-c5))); // Rotate Left;
ROR: regSet(a, (Rb>>c5)|(Rb<<(32-c5))); // Rotate Right;
ROLV: begin // Can set Rc to -32<=Rc<=32 more efficently.
while (Rc < -32) Rc=Rc+32;
while (Rc > 32) Rc=Rc-32;
regSet(a, (Rb<<Rc)|(Rb>>(32-Rc))); // Rotate Left;
end
RORV: begin
while (Rc < -32) Rc=Rc+32;
while (Rc > 32) Rc=Rc-32;
regSet(a, (Rb>>Rc)|(Rb<<(32-Rc))); // Rotate Right;
end
MFLO: regSet(a, LO); // MFLO Ra; Ra<=LO
MFHI: regSet(a, HI); // MFHI Ra; Ra<=HI
MTLO: LO = Ra; // MTLO Ra; LO<=Ra
MTHI: HI = Ra; // MTHI Ra; HI<=Ra
MULT: {HI, LO}=Ra*Rb; // MULT Ra,Rb; HI<=((Ra*Rb)>>32);
// LO<=((Ra*Rb) and 0x00000000ffffffff);
// with exception overflow
MULTu: {HI, LO}=URa*URb; // MULT URa,URb; HI<=((URa*URb)>>32);
// LO<=((URa*URb) and 0x00000000ffffffff);
// without exception overflow
MFC0: regSet(a, C0R[b]); // MFC0 a, b; Ra<=C0R[Rb]
MTC0: C0regSet(a, Rb); // MTC0 a, b; C0R[a]<=Rb
C0MOV: C0regSet(a, C0R[b]); // C0MOV a, b; C0R[a]<=C0R[b]
`ifdef CPU0II
// set
SLT: if (Rb < Rc) R[a]=1; else R[a]=0;
SLTu: if (URb < URc) R[a]=1; else R[a]=0;
SLTi: if (Rb < c16) R[a]=1; else R[a]=0;
SLTiu: if (URb < uc16) R[a]=1; else R[a]=0;
// Branch Instructions
BEQ: if (Ra==Rb) PCSet(`PC+c16);
BNE: if (Ra!=Rb) PCSet(`PC+c16);
`endif
// Jump Instructions
JEQ: if (`Z) PCSet(`PC+c24); // JEQ Cx; if SW(=) PC PC+Cx
JNE: if (!`Z) PCSet(`PC+c24); // JNE Cx; if SW(!=) PC PC+Cx
JLT: if (`N) PCSet(`PC+c24); // JLT Cx; if SW(<) PC PC+Cx
JGT: if (!`N&&!`Z) PCSet(`PC+c24); // JGT Cx; if SW(>) PC PC+Cx
JLE: if (`N || `Z) PCSet(`PC+c24); // JLE Cx; if SW(<=) PC PC+Cx
JGE: if (!`N || `Z) PCSet(`PC+c24); // JGE Cx; if SW(>=) PC PC+Cx
JMP: `PC = `PC+c24; // JMP Cx; PC <= PC+Cx
JALR: begin retValSet(a, `PC); PCSet(Rb); end // JALR Ra,Rb; Ra<=PC; PC<=Rb
BAL: begin `LR = `PC; `PC = `PC+c24; end // BAL Cx; LR<=PC; PC<=PC+Cx
JSUB: begin retValSet(14, `PC); PCSet(`PC+c24); end // JSUB Cx; LR<=PC; PC<=PC+Cx
RET: begin PCSet(Ra); end // RET; PC <= Ra
default :
$display("%4dns %8x : OP code %8x not support", $stime, pc0, op);
endcase
if (`IE && `I && (`I0E && `I0 || `I1E && `I1 || `I2E && `I2)) begin
`EPC = `PC;
next_state = Fetch;
inExe = 0;
end else
next_state = MemAccess;
end
MemAccess: begin
case (op)
ST, SB, SH :
memWriteEnd(); // write memory complete
endcase
next_state = WriteBack;
end
WriteBack: begin // Read/Write finish, close memory
case (op)
LB, LBu :
memReadEnd(R[a]); //read memory complete
LH, LHu :
memReadEnd(R[a]);
LD : begin
memReadEnd(R[a]);
if (`D)
$display("%4dns %8x : %8x m[%-04x+%-04x]=%8x SW=%8x", $stime, pc0,
ir, R[b], c16, R[a], `SW);
end
endcase
case (op)
LB : begin
if (R[a] > 8'h7f) R[a]=R[a]|32'hffffff80;
end
LH : begin
if (R[a] > 16'h7fff) R[a]=R[a]|32'hffff8000;
end
endcase
case (op)
MULT, MULTu, DIV, DIVu, MTHI, MTLO :
if (`D)
$display("%4dns %8x : %8x HI=%8x LO=%8x SW=%8x", $stime, pc0, ir, HI,
LO, `SW);
ST : begin
if (`D)
$display("%4dns %8x : %8x m[%-04x+%-04x]=%8x SW=%8x", $stime, pc0,
ir, R[b], c16, R[a], `SW);
if (R[b]+c16 == `IOADDR) begin
outw(R[a]);
end
end
SB : begin
if (`D)
$display("%4dns %8x : %8x m[%-04x+%-04x]=%c SW=%8x, R[a]=%8x",
$stime, pc0, ir, R[b], c16, R[a][7:0], `SW, R[a]);
if (R[b]+c16 == `IOADDR) begin
if (`LE)
outc(R[a][7:0]);
else
outc(R[a][7:0]);
end
end
MFC0, MTC0 :
if (`D)
$display("%4dns %8x : %8x R[%02d]=%-8x C0R[%02d]=%-8x SW=%8x",
$stime, pc0, ir, a, R[a], a, C0R[a], `SW);
C0MOV :
if (`D)
$display("%4dns %8x : %8x C0R[%02d]=%-8x C0R[%02d]=%-8x SW=%8x",
$stime, pc0, ir, a, C0R[a], b, C0R[b], `SW);
default :
if (`D) // Display the written register content
$display("%4dns %8x : %8x R[%02d]=%-8x SW=%8x", $stime, pc0, ir,
a, R[a], `SW);
endcase
if (`PC < 0) begin
$display("total cpu cycles = %-d", cycles);
$display("RET to PC < 0, finished!");
$finish;
end
next_state = Fetch;
end
endcase
end endtask
always @(posedge clock) begin
if (inExe == 0 && (state == Fetch) && (`IE && `I) && (`I0E && `I0)) begin
// software int
`M = `IRQ;
taskInterrupt(`IRQ);
m_en = 0;
state = Fetch;
end else if (inExe == 0 && (state == Fetch) && (`IE && `I) &&
((`I1E && `I1) || (`I2E && `I2)) ) begin
`M = `IRQ;
taskInterrupt(`IRQ);
m_en = 0;
state = Fetch;
end else if (inExe == 0 && itype == `RESET) begin
// Condition itype == `RESET must after the other `IE condition
taskInterrupt(`RESET);
`M = `RESET;
state = Fetch;
end else begin
`ifdef TRACE
`D = 1; // Trace register content at beginning
`endif
taskExecute();
state = next_state;
end
pc = `PC;
cycles = cycles + 1;
end
endmodule
module memory0(input clock, reset, en, rw, input [1:0] m_size,
input [31:0] abus, dbus_in, output [31:0] dbus_out,
output cfg);
reg [31:0] mconfig [0:0];
reg [7:0] m [0:`MEMSIZE-1];
reg [31:0] data;
integer i;
`define LE mconfig[0][0:0] // Endian bit, Big Endian:0, Little Endian:1
initial begin
// erase memory
for (i=0; i < `MEMSIZE; i=i+1) begin
m[i] = `MEMEMPTY;
end
// load config from file to memory
$readmemh("cpu0.config", mconfig);
// load program from file to memory
$readmemh("cpu0.hex", m);
// display memory contents
`ifdef TRACE
for (i=0; i < `MEMSIZE && (m[i] != `MEMEMPTY || m[i+1] != `MEMEMPTY ||
m[i+2] != `MEMEMPTY || m[i+3] != `MEMEMPTY); i=i+4) begin
$display("%8x: %8x", i, {m[i], m[i+1], m[i+2], m[i+3]});
end
`endif
end
always @(clock or abus or en or rw or dbus_in)
begin
if (abus >= 0 && abus <= `MEMSIZE-4) begin
if (en == 1 && rw == 0) begin // r_w==0:write
data = dbus_in;
if (`LE) begin // Little Endian
case (m_size)
`BYTE: {m[abus]} = dbus_in[7:0];
`INT16: {m[abus], m[abus+1] } = {dbus_in[7:0], dbus_in[15:8]};
`INT24: {m[abus], m[abus+1], m[abus+2]} =
{dbus_in[7:0], dbus_in[15:8], dbus_in[23:16]};
`INT32: {m[abus], m[abus+1], m[abus+2], m[abus+3]} =
{dbus_in[7:0], dbus_in[15:8], dbus_in[23:16], dbus_in[31:24]};
endcase
end else begin // Big Endian
case (m_size)
`BYTE: {m[abus]} = dbus_in[7:0];
`INT16: {m[abus], m[abus+1] } = dbus_in[15:0];
`INT24: {m[abus], m[abus+1], m[abus+2]} = dbus_in[23:0];
`INT32: {m[abus], m[abus+1], m[abus+2], m[abus+3]} = dbus_in;
endcase
end
end else if (en == 1 && rw == 1) begin // r_w==1:read
if (`LE) begin // Little Endian
case (m_size)
`BYTE: data = {8'h00, 8'h00, 8'h00, m[abus]};
`INT16: data = {8'h00, 8'h00, m[abus+1], m[abus]};
`INT24: data = {8'h00, m[abus+2], m[abus+1], m[abus]};
`INT32: data = {m[abus+3], m[abus+2], m[abus+1], m[abus]};
endcase
end else begin // Big Endian
case (m_size)
`BYTE: data = {8'h00 , 8'h00, 8'h00, m[abus] };
`INT16: data = {8'h00 , 8'h00, m[abus], m[abus+1]};
`INT24: data = {8'h00 , m[abus], m[abus+1], m[abus+2]};
`INT32: data = {m[abus], m[abus+1], m[abus+2], m[abus+3]};
endcase
end
end else
data = 32'hZZZZZZZZ;
end else
data = 32'hZZZZZZZZ;
end
assign dbus_out = data;
assign cfg = mconfig[0][0:0];
endmodule
module main;
reg clock;
reg [2:0] itype;
wire [2:0] tick;
wire [31:0] pc, ir, mar, mdr, dbus;
wire m_en, m_rw;
wire [1:0] m_size;
wire cfg;
cpu0 cpu(.clock(clock), .itype(itype), .pc(pc), .tick(tick), .ir(ir),
.mar(mar), .mdr(mdr), .dbus(dbus), .m_en(m_en), .m_rw(m_rw), .m_size(m_size),
.cfg(cfg));
memory0 mem(.clock(clock), .reset(reset), .en(m_en), .rw(m_rw),
.m_size(m_size), .abus(mar), .dbus_in(mdr), .dbus_out(dbus), .cfg(cfg));
initial
begin
clock = 0;
itype = `RESET;
`TIMEOUT $finish;
end
always #10 clock=clock+1;
endmodule
lbdex/verilog/Makefile
#TRACE=-D TRACE
all:
iverilog ${TRACE} -o cpu0Is cpu0.v
iverilog ${TRACE} -D CPU0II -o cpu0IIs cpu0.v
.PHONY: clean
clean:
rm -rf cpu0.hex cpu0Is cpu0IIs
rm -f *~ cpu0.config
Since Cpu0 Verilog machine supports both big and little endian, the memory and cpu module both have a wire connectting each other. The endian information stored in ROM of memory module, and memory module send the information when it is up according the following code,
lbdex/verilog/cpu0.v
assign cfg = mconfig[0][0:0];
...
wire cfg;
cpu0 cpu(.clock(clock), .itype(itype), .pc(pc), .tick(tick), .ir(ir),
.mar(mar), .mdr(mdr), .dbus(dbus), .m_en(m_en), .m_rw(m_rw), .m_size(m_size),
.cfg(cfg));
memory0 mem(.clock(clock), .reset(reset), .en(m_en), .rw(m_rw),
.m_size(m_size), .abus(mar), .dbus_in(mdr), .dbus_out(dbus), .cfg(cfg));
Instead of setting endian tranfer in memory module, the endian transfer can also be set in CPU module, and memory moudle always return with big endian. I am not an professional engineer in FPGA/CPU hardware design. But according book “Computer Architecture: A Quantitative Approach”, some operations may have no tolerance in time of execution stage. Any endian swap will make the clock cycle time longer and affect the CPU performance. So, I set the endian transfer in memory module. In system with bus, it will be set in bus system I think.
Verify backend¶
Now let’s compile ch_run_backend.cpp as below. Since code size grows up from low to high address and stack grows up from high to low address. $sp is set at 0x7fffc because assuming cpu0.v uses 0x80000 bytes of memory.
lbdex/input/start.h
#ifndef _START_H_
#define _START_H_
#include "config.h"
#define SET_SW \
asm("andi $sw, $zero, 0"); \
asm("ori $sw, $sw, 0x1e00"); // enable all interrupts
#define initRegs() \
asm("addiu $1, $zero, 0"); \
asm("addiu $2, $zero, 0"); \
asm("addiu $3, $zero, 0"); \
asm("addiu $4, $zero, 0"); \
asm("addiu $5, $zero, 0"); \
asm("addiu $t9, $zero, 0"); \
asm("addiu $7, $zero, 0"); \
asm("addiu $8, $zero, 0"); \
asm("addiu $9, $zero, 0"); \
asm("addiu $10, $zero, 0"); \
SET_SW; \
asm("addiu $fp, $zero, 0");
#endif
lbdex/input/boot.cpp
#include "start.h"
// boot:
asm("boot:");
// asm("_start:");
asm("jmp 12"); // RESET: jmp RESET_START;
asm("jmp 4"); // ERROR: jmp ERR_HANDLE;
asm("jmp 4"); // IRQ: jmp IRQ_HANDLE;
asm("jmp -4"); // ERR_HANDLE: jmp ERR_HANDLE; (loop forever)
// RESET_START:
initRegs();
asm("addiu $gp, $ZERO, 0");
asm("addiu $lr, $ZERO, -1");
INIT_SP;
asm("mfc0 $3, $pc");
asm("addiu $3, $3, 0x8"); // Assume main() entry point is at the next next
// instruction.
asm("jr $3");
asm("nop");
lbdex/input/print.h
#ifndef _PRINT_H_
#define _PRINT_H_
#include "start.h"
void print_char(const char c);
void dump_mem(unsigned char *str, int n);
void print_string(const char *str);
void print_integer(int x);
#endif
lbdex/input/print.cpp
#include "print.h"
#include "itoa.cpp"
// For memory IO
void print_char(const char c)
{
char *p = (char*)IOADDR;
*p = c;
return;
}
void print_string(const char *str)
{
const char *p;
for (p = str; *p != '\0'; p++)
print_char(*p);
print_char(*p);
print_char('\n');
return;
}
// For memory IO
void print_integer(int x)
{
char str[INT_DIGITS + 2];
itoa(str, x);
print_string(str);
return;
}
lbdex/input/ch_nolld.h
#include "debug.h"
#include "boot.cpp"
#include "print.h"
int test_nolld();
lbdex/input/ch_nolld.cpp
#define TEST_ROXV
#define RUN_ON_VERILOG
#include "print.cpp"
#include "ch4_1_math.cpp"
#include "ch4_1_rotate.cpp"
#include "ch4_1_mult2.cpp"
#include "ch4_1_mod.cpp"
#include "ch4_1_div.cpp"
#include "ch4_2_logic.cpp"
#include "ch7_1_localpointer.cpp"
#include "ch7_1_char_short.cpp"
#include "ch7_1_bool.cpp"
#include "ch7_1_longlong.cpp"
#include "ch7_1_vector.cpp"
#include "ch8_1_ctrl.cpp"
#include "ch8_2_deluselessjmp.cpp"
#include "ch8_2_select.cpp"
#include "ch9_1_longlong.cpp"
#include "ch9_3_vararg.cpp"
#include "ch9_3_stacksave.cpp"
#include "ch9_3_bswap.cpp"
#include "ch9_3_alloc.cpp"
#include "ch11_2.cpp"
// Test build only for the following files on build-run_backend.sh since it
// needs lld linker support.
// Test in build-slink.sh
#include "ch6_1.cpp"
#include "ch9_1_struct.cpp"
#include "ch9_1_constructor.cpp"
#include "ch9_3_template.cpp"
#include "ch12_inherit.cpp"
void test_asm_build()
{
#include "ch11_1.cpp"
#ifdef CPU032II
#include "ch11_1_2.cpp"
#endif
}
int test_rotate()
{
int a = test_rotate_left1(); // rolv 4, 30 = 1
int b = test_rotate_left(); // rol 8, 30 = 2
int c = test_rotate_right(); // rorv 1, 30 = 4
return (a+b+c);
}
int test_nolld()
{
bool pass = true;
int a = 0;
a = test_math();
print_integer(a); // a = 68
if (a != 68) pass = false;
a = test_rotate();
print_integer(a); // a = 7
if (a != 7) pass = false;
a = test_mult();
print_integer(a); // a = 0
if (a != 0) pass = false;
a = test_mod();
print_integer(a); // a = 0
if (a != 0) pass = false;
a = test_div();
print_integer(a); // a = 253
if (a != 253) pass = false;
a = test_local_pointer();
print_integer(a); // a = 3
if (a != 3) pass = false;
a = (int)test_load_bool();
print_integer(a); // a = 1
if (a != 1) pass = false;
a = test_andorxornotcomplement();
print_integer(a); // a = 13
if (a != 13) pass = false;
a = test_setxx();
print_integer(a); // a = 3
if (a != 3) pass = false;
a = test_signed_char();
print_integer(a); // a = -126
if (a != -126) pass = false;
a = test_unsigned_char();
print_integer(a); // a = 130
if (a != 130) pass = false;
a = test_signed_short();
print_integer(a); // a = -32766
if (a != -32766) pass = false;
a = test_unsigned_short();
print_integer(a); // a = 32770
if (a != 32770) pass = false;
long long b = test_longlong();
print_integer((int)(b >> 32)); // 393307
if ((int)(b >> 32) != 393307) pass = false;
print_integer((int)b); // 16777218
if ((int)(b) != 16777218) pass = false;
a = test_cmplt_short();
print_integer(a); // a = -3
if (a != -3) pass = false;
a = test_cmplt_long();
print_integer(a); // a = -4
if (a != -4) pass = false;
a = test_control1();
print_integer(a); // a = 51
if (a != 51) pass = false;
a = test_DelUselessJMP();
print_integer(a); // a = 2
if (a != 2) pass = false;
a = test_movx_1();
print_integer(a); // a = 3
if (a != 3) pass = false;
a = test_movx_2();
print_integer(a); // a = 1
if (a != 1) pass = false;
print_integer(2147483647); // test mod % (mult) from itoa.cpp
print_integer(-2147483648); // test mod % (multu) from itoa.cpp
a = test_sum_longlong();
print_integer(a); // a = 9
if (a != 9) pass = false;
a = test_va_arg();
print_integer(a); // a = 12
if (a != 12) pass = false;
a = test_stacksaverestore(100);
print_integer(a); // a = 5
if (a != 5) pass = false;
a = test_bswap();
print_integer(a); // a = 0
if (a != 0) pass = false;
a = test_alloc();
print_integer(a); // a = 31
if (a != 31) pass = false;
a = test_inlineasm();
print_integer(a); // a = 49
if (a != 49) pass = false;
return pass;
}
lbdex/input/ch_run_backend.cpp
#include "ch_nolld.h"
int main()
{
bool pass = true;
pass = test_nolld();
return pass;
}
#include "ch_nolld.cpp"
lbdex/input/functions.sh
prologue() {
if [ $ARG_NUM == 0 ]; then
echo "useage: bash $sh_name cpu_type endian"
echo " cpu_type: cpu032I or cpu032II"
echo " endian: eb (big endian, default) or el (little endian)"
echo "for example:"
echo " bash build-slinker.sh cpu032I be"
exit 1;
fi
if [ $CPU != cpu032I ] && [ $CPU != cpu032II ]; then
echo "1st argument is cpu032I or cpu032II"
exit 1
fi
OS=`uname -s`
echo "OS =" ${OS}
TOOLDIR=~/llvm/test/build/bin
CLANG=~/llvm/test/build/bin/clang
CPU=$CPU
echo "CPU =" "${CPU}"
if [ "$ENDIAN" != "" ] && [ $ENDIAN != el ] && [ $ENDIAN != eb ]; then
echo "2nd argument is eb (big endian, default) or el (little endian)"
exit 1
fi
if [ $ENDIAN == eb ]; then
ENDIAN=
fi
echo "ENDIAN =" "${ENDIAN}"
bash clean.sh
}
isLittleEndian() {
echo "ENDIAN = " "$ENDIAN"
if [ "$ENDIAN" == "LittleEndian" ] ; then
LE="true"
elif [ "$ENDIAN" == "BigEndian" ] ; then
LE="false"
else
echo "!ENDIAN unknown"
exit 1
fi
}
elf2hex() {
${TOOLDIR}/llvm-objdump -elf2hex -le=$LE a.out > ../verilog/cpu0.hex
if [ $LE == "true" ] ; then
echo "1 /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
else
echo "0 /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
fi
cat ../verilog/cpu0.config
}
epilogue() {
endian=`${TOOLDIR}/llvm-readobj -h a.out|grep "DataEncoding"|awk '{print $2}'`
isLittleEndian;
elf2hex;
}
lbdex/input/build-run_backend.sh
#!/usr/bin/env bash
# for example:
# bash build-run_backend.sh cpu032I el
# bash build-run_backend.sh cpu032II eb
source functions.sh
sh_name=build-run_backend.sh
ARG_NUM=$#
CPU=$1
ENDIAN=$2
DEFFLAGS=""
if [ "$CPU" == cpu032II ] ; then
DEFFLAGS=${DEFFLAGS}" -DCPU032II"
fi
echo ${DEFFLAGS}
prologue;
# ch8_2_select_global_pic.cpp just for compile build test only, without running
# on verilog.
$CLANG ${DEFFLAGS} -target mips-unknown-linux-gnu -c ch8_2_select_global_pic.cpp \
-emit-llvm -o ch8_2_select_global_pic.bc
${TOOLDIR}/llc -march=cpu0${ENDIAN} -mcpu=${CPU} -relocation-model=pic \
-filetype=obj ch8_2_select_global_pic.bc -o ch8_2_select_global_pic.cpu0.o
$CLANG ${DEFFLAGS} -target mips-unknown-linux-gnu -c ch_run_backend.cpp \
-emit-llvm -o ch_run_backend.bc
echo "${TOOLDIR}/llc -march=cpu0${ENDIAN} -mcpu=${CPU} -relocation-model=static \
-filetype=obj -enable-cpu0-tail-calls ch_run_backend.bc -o ch_run_backend.cpu0.o"
${TOOLDIR}/llc -march=cpu0${ENDIAN} -mcpu=${CPU} -relocation-model=static \
-filetype=obj -enable-cpu0-tail-calls ch_run_backend.bc -o ch_run_backend.cpu0.o
# print must at the same line, otherwise it will spilt into 2 lines
${TOOLDIR}/llvm-objdump --section=.text -d ch_run_backend.cpu0.o | tail -n +8| awk \
'{print "/* " $1 " */\t" $2 " " $3 " " $4 " " $5 "\t/* " $6"\t" $7" " $8" " $9" " $10 "\t*/"}' \
> ../verilog/cpu0.hex
ENDIAN=`${TOOLDIR}/llvm-readobj -h ch_run_backend.cpu0.o|grep "DataEncoding"|awk '{print $2}'`
isLittleEndian;
if [ $LE == "true" ] ; then
echo "1 /* 0: big ENDIAN, 1: little ENDIAN */" > ../verilog/cpu0.config
else
echo "0 /* 0: big ENDIAN, 1: little ENDIAN */" > ../verilog/cpu0.config
fi
cat ../verilog/cpu0.config
To run program without linker implementation at this point, the boot.cpp must be
set at the beginning of code, and the main() of ch_run_backend.cpp comes
immediately after it.
Let’s run Chapter11_2/ with llvm-objdump -d
for input file
ch_run_backend.cpp to generate the hex file via build-run_bacekend.sh, then
feed hex file to cpu0Is Verilog simulator to get the output result as below.
Remind ch_run_backend.cpp have to be compiled with option
clang -target mips-unknown-linux-gnu
since the example code
ch9_3_vararg.cpp which uses the vararg needs to be compiled with this option.
Other example codes have no differences between this option and default option.
JonathantekiiMac:input Jonathan$ pwd
/Users/Jonathan/llvm/test/lbdex/input
JonathantekiiMac:input Jonathan$ bash build-run_backend.sh cpu032I eb
JonathantekiiMac:input Jonathan$ cd ../verilog cd ../verilog
JonathantekiiMac:input Jonathan$ pwd
/Users/Jonathan/llvm/test/lbdex/verilog
JonathantekiiMac:verilog Jonathan$ make
JonathantekiiMac:verilog Jonathan$ ./cpu0Is
WARNING: cpu0Is.v:386: $readmemh(cpu0.hex): Not enough words in the file for the
taskInterrupt(001)
68
7
0
0
253
3
1
13
3
-126
130
-32766
32770
393307
16777218
3
4
51
2
3
1
2147483647
-2147483648
15
5
0
31
49
total cpu cycles = 50645
RET to PC < 0, finished!
JonathantekiiMac:input Jonathan$ bash build-run_backend.sh cpu032II eb
JonathantekiiMac:input Jonathan$ cd ../verilog
JonathantekiiMac:verilog Jonathan$ ./cpu0IIs
...
total cpu cycles = 48335
RET to PC < 0, finished!
The “total cpu cycles” is calculated in this verilog simualtor so that the backend compiler and CPU performance can be reviewed. Only the CPU cycles are counted in this implemenation since I/O cycles time is unknown. As explained in chapter “Control flow statements”, cpu032II which uses instructions slt and beq has better performance than cmp and jeq in cpu032I. Instructions “jmp” has no delay slot so it is better used in dynamic linker implementation.
You can trace the memory binary code and destination register changed at every instruction execution by unmark TRACE in Makefile as below,
lbdex/verilog/Makefile
TRACE=-D TRACE
JonathantekiiMac:raw Jonathan$ ./cpu0Is
WARNING: cpu0.v:386: $readmemh(cpu0.hex): Not enough words in the file for the
requested range [0:28671].
00000000: 2600000c
00000004: 26000004
00000008: 26000004
0000000c: 26fffffc
00000010: 09100000
00000014: 09200000
...
taskInterrupt(001)
1530ns 00000054 : 02ed002c m[28620+44 ]=-1 SW=00000000
1610ns 00000058 : 02bd0028 m[28620+40 ]=0 SW=00000000
...
RET to PC < 0, finished!
As above result, cpu0.v dumps the memory first after reading input file cpu0.hex. Next, it runs instructions from address 0 and print each destination register value in the fourth column. The first column is the nano seconds of timing. The second is instruction address. The third is instruction content. Now, most example codes depicted in the previous chapters are verified by print the variable with print_integer().
Since the cpu0.v machine is created by Verilog language, suppose it can run on real FPGA device (but I never do it). The real output hardware interface/port is hardware output device dependent, such as RS232, speaker, LED, …. You should implement the I/O interface/port when you want to program FPGA and wire I/O device to the I/O port. Through running the compiled code on Verilog simulator, Cpu0 backend compiled result and CPU cycles are verified and calculated. Currently, this Cpu0 Verilog program is not a pipeline architecture, but according the instruction set it can be implemented as a pipeline model. The cycle time of Cpu0 pipeline model is more than 1/5 of “total cpu cycles” displayed as above since there are dependences exist between instructions. Though the Verilog simulator is slow in running the whole system program and not include the cycles counting in cache and I/O, it is a simple and easy way to verify your idea about CPU design at early stage with small program pattern. The overall system simulator is complex to create. Even wiki web site here [7] include tools for creating the simulator, it needs a lot of effort.
To generate cpu032I as well as little endian code, you can run with the following command. File build-run_backend.sh write the endian information to ../verilog/cpu0.config as below.
JonathantekiiMac:input Jonathan$ bash build-run_backend.sh cpu032I el
../verilog/cpu0.config
1 /* 0: big endian, 1: little endian */
The following files test more features.
lbdex/input/ch_nolld2.h
#include "debug.h"
#include "boot.cpp"
#include "print.h"
int test_nolld2();
lbdex/input/ch_nolld2.cpp
#include "print.cpp"
#include "ch9_3_alloc.cpp"
int test_nolld2()
{
bool pass = true;
int a = 0;
a = test_alloc();
print_integer(a); // a = 31
if (a != 31) pass = false;
return pass;
}
lbdex/input/ch_run_backend2.cpp
#include "ch_nolld2.h"
int main()
{
bool pass = true;
pass = test_nolld2();
return pass;
}
#include "ch_nolld2.cpp"
lbdex/input/build-run_backend2.sh
#!/usr/bin/env bash
source functions.sh
sh_name=build-run_backend.sh
ARG_NUM=$#
CPU=$1
ENDIAN=$2
DEFFLAGS=""
if [ "$arg1" == cpu032II ] ; then
DEFFLAGS=${DEFFLAGS}" -DCPU032II"
fi
echo ${DEFFLAGS}
prologue;
${CLANG} ${DEFFLAGS} -c ch_run_backend2.cpp \
-emit-llvm -o ch_run_backend2.bc
${TOOLDIR}/llc -march=cpu0${ENDIAN} -mcpu=${CPU} -relocation-model=static \
-filetype=obj ch_run_backend2.bc -o ch_run_backend2.cpu0.o
${TOOLDIR}/llvm-objdump -d ch_run_backend2.cpu0.o | tail -n +8| awk \
'{print "/* " $1 " */\t" $2 " " $3 " " $4 " " $5 "\t/* " $6"\t" $7" " $8" \
" $9" " $10 "\t*/"}' > ../verilog/cpu0.hex
ENDIAN=`${TOOLDIR}/llvm-readobj -h ch_run_backend2.cpu0.o|grep "DataEncoding"|awk '{print $2}'`
isLittleEndian;
if [ $LE == "true" ] ; then
echo "1 /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
else
echo "0 /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
fi
cat ../verilog/cpu0.config
JonathantekiiMac:input Jonathan$ bash build-run_backend.sh cpu032II el
...
JonathantekiiMac:input Jonathan$ cd ../verilog
JonathantekiiMac:verilog Jonathan$ ./cpu0IIs
...
31
...
Other llvm based tools for Cpu0 processor¶
You can find the Cpu0 ELF linker implementation based on lld which is the llvm official linker project, as well as elf2hex which modified from llvm-objdump driver at web: http://jonathan2251.github.io/lbt/index.html.