Verify backend on Verilog simulator

Until now, we have an llvm backend to compile C or assembly as the blue part of Fig. 33. If without global variable, the elf obj can be dumped to hex file via llvm-objdump -d which finished in Chapter ELF Support.

_images/18.png

Fig. 33 Cpu0 backend without linker

This chapter will implement Cpu0 instructions by Verilog language as the red part of Fig. 33. With this Verilog machine, we can run this hex program on the Cpu0 Verilog machine on PC and see the Cpu0 instructions execution result.

Create verilog simulator of Cpu0

Verilog language is an IEEE standard in IC design. There are a lot of books and documents for this language. Free documents exist in Web sites [1] [2] [3] [4] [5]. Verilog also called as Verilog HDL but not VHDL. VHDL is the same purpose language which compete against Verilog. About VHDL reference here [6]. Example code, lbdex/verilog/cpu0.v, is the Cpu0 design in Verilog. In Appendix A, we have downloaded and installed Icarus Verilog tool both on iMac and Linux. The cpu0.v is a simple design with only few hundreds lines of code totally. This implementation hasn’t the pipeline features, but through implement the delay slot simulation (SIMULATE_DELAY_SLOT part of code), the exact pipeline machine cycles can be calculated.

Verilog is a C like language in syntex and this book is a compiler book, so we list the cpu0.v as well as the building command without explanation as below. We expect readers can understand the Verilog code just with a little patience in reading it. There are two type of I/O according computer architecture. One is memory mapped I/O, the other is instruction I/O. Cpu0 uses memory mapped I/O where memory address 0x80000 as the output port. When meet the instruction “st $ra, cx($rb)”, where cx($rb) is 0x80000, Cpu0 displays the content as follows,

ST : begin
  ...
  if (R[b]+c16 == `IOADDR) begin
    outw(R[a]);

lbdex/verilog/cpu0.v

`define SIMULATE_DELAY_SLOT
`define MEMSIZE   'h80000
`define MEMEMPTY   8'hFF
`define NULL       8'h00
`define IOADDR    'h80000  // IO mapping address

// Operand width
`define INT32 2'b11     // 32 bits
`define INT24 2'b10     // 24 bits
`define INT16 2'b01     // 16 bits
`define BYTE  2'b00     // 8  bits

`define EXE 3'b000
`define RESET 3'b001
`define ABORT 3'b010
`define IRQ 3'b011
`define ERROR 3'b100

// Reference web: http://ccckmit.wikidot.com/ocs:cpu0
module cpu0(input clock, reset, input [2:0] itype, output reg [2:0] tick, 
            output reg [31:0] ir, pc, mar, mdr, inout [31:0] dbus, 
            output reg m_en, m_rw, output reg [1:0] m_size, 
            input cfg);
  reg signed [31:0] R [0:15];
  reg signed [31:0] C0R [0:1]; // co-processor 0 register
  // High and Low part of 64 bit result
  reg [7:0] op;
  reg [3:0] a, b, c;
  reg [4:0] c5;
  reg signed [31:0] c12, c16, c24, Ra, Rb, Rc, pc0; // pc0: instruction pc
  reg [31:0] uc16, URa, URb, URc, HI, LO, CF, tmp;
  reg [63:0] cycles;

  // register name
  `define SP   R[13]   // Stack Pointer
  `define LR   R[14]   // Link Register
  `define SW   R[15]   // Status Word

  // C0 register name
  `define PC   C0R[0]   // Program Counter
  `define EPC  C0R[1]  // exception PC value

  // SW Flage
  `define I2   `SW[16] // Hardware Interrupt 1, IO1 interrupt, status, 
                      // 1: in interrupt
  `define I1   `SW[15] // Hardware Interrupt 0, timer interrupt, status, 
                      // 1: in interrupt
  `define I0   `SW[14] // Software interrupt, status, 1: in interrupt
  `define I    `SW[13] // Interrupt, 1: in interrupt
  `define I2E  `SW[12]  // Hardware Interrupt 1, IO1 interrupt, Enable
  `define I1E  `SW[11]  // Hardware Interrupt 0, timer interrupt, Enable
  `define I0E  `SW[10]  // Software Interrupt Enable
  `define IE   `SW[9]  // Interrupt Enable
  `define M    `SW[8:6]  // Mode bits, itype
  `define D    `SW[5]  // Debug Trace
  `define V    `SW[3]  // Overflow
  `define C    `SW[2]  // Carry
  `define Z    `SW[1]  // Zero
  `define N    `SW[0]  // Negative flag
  
  `define LE   CF[0]  // Endian bit, Big Endian:0, Little Endian:1
  // Instruction Opcode 
  parameter [7:0] NOP=8'h00,LD=8'h01,ST=8'h02,LB=8'h03,LBu=8'h04,SB=8'h05,
  LH=8'h06,LHu=8'h07,SH=8'h08,ADDiu=8'h09,MOVZ=8'h0A,MOVN=8'h0B,ANDi=8'h0C,
  ORi=8'h0D,XORi=8'h0E,LUi=8'h0F,
  CMP=8'h10,
  ADDu=8'h11,SUBu=8'h12,ADD=8'h13,SUB=8'h14,CLZ=8'h15,CLO=8'h16,MUL=8'h17,
  AND=8'h18,OR=8'h19,XOR=8'h1A,
  ROL=8'h1B,ROR=8'h1C,SRA=8'h1D,SHL=8'h1E,SHR=8'h1F,
  SRAV=8'h20,SHLV=8'h21,SHRV=8'h22,ROLV=8'h23,RORV=8'h24,
`ifdef CPU0II
  SLTi=8'h26,SLTiu=8'h27, SLT=8'h28,SLTu=8'h29,
  BEQ=8'h37,BNE=8'h38,
`endif
  JEQ=8'h30,JNE=8'h31,JLT=8'h32,JGT=8'h33,JLE=8'h34,JGE=8'h35,
  JMP=8'h36,
  JALR=8'h39,BAL=8'h3A,JSUB=8'h3B,RET=8'h3C,
  MULT=8'h41,MULTu=8'h42,DIV=8'h43,DIVu=8'h44,
  MFHI=8'h46,MFLO=8'h47,MTHI=8'h48,MTLO=8'h49,
  MFC0=8'h50,MTC0=8'h51,C0MOV=8'h52;

  reg [0:0] inExe = 0;
  reg [2:0] state, next_state; 
  reg [2:0] st_taskInt, ns_taskInt; 
  parameter Reset=3'h0, Fetch=3'h1, Decode=3'h2, Execute=3'h3, MemAccess=3'h4, 
            WriteBack=3'h5;
  integer i;
`ifdef SIMULATE_DELAY_SLOT
  reg [0:0] nextInstIsDelaySlot;
  reg [0:0] isDelaySlot;
  reg signed [31:0] delaySlotNextPC;
`endif

  //transform data from the memory to little-endian form
  task changeEndian(input [31:0] value, output [31:0] changeEndian); begin
    changeEndian = {value[7:0], value[15:8], value[23:16], value[31:24]};
  end endtask

  // Read Memory Word
  task memReadStart(input [31:0] addr, input [1:0] size); begin 
    mar = addr;     // read(m[addr])
    m_rw = 1;     // Access Mode: read 
    m_en = 1;     // Enable read
    m_size = size;
  end endtask

  // Read Memory Finish, get data
  task memReadEnd(output [31:0] data); begin
    mdr = dbus; // get momory, dbus = m[addr]
    data = mdr; // return to data
    m_en = 0; // read complete
  end endtask

  // Write memory -- addr: address to write, data: date to write
  task memWriteStart(input [31:0] addr, input [31:0] data, input [1:0] size); 
  begin 
    mar = addr;    // write(m[addr], data)
    mdr = data;
    m_rw = 0;    // access mode: write
    m_en = 1;     // Enable write
    m_size  = size;
  end endtask

  task memWriteEnd; begin // Write Memory Finish
    m_en = 0; // write complete
  end endtask

  task regSet(input [3:0] i, input [31:0] data); begin
    if (i != 0) R[i] = data;
  end endtask

  task C0regSet(input [3:0] i, input [31:0] data); begin
    if (i < 2) C0R[i] = data;
  end endtask

  task PCSet(input [31:0] data); begin
  `ifdef SIMULATE_DELAY_SLOT
    nextInstIsDelaySlot = 1;
    delaySlotNextPC = data;
  `else
    `PC = data;
  `endif
  end endtask

  task retValSet(input [3:0] i, input [31:0] data); begin
    if (i != 0)
    `ifdef SIMULATE_DELAY_SLOT
      R[i] = data + 4;
    `else
      R[i] = data;
    `endif
  end endtask
  
  task regHILOSet(input [31:0] data1, input [31:0] data2); begin
    HI = data1;
    LO = data2;
  end endtask

  // output a word to Output port (equal to display the word to terminal)
  task outw(input [31:0] data); begin
    if (`LE) begin // Little Endian
      changeEndian(data, data);
    end 
    if (data[7:0] != 8'h00) begin
      $write("%c", data[7:0]);
      if (data[15:8] != 8'h00) 
        $write("%c", data[15:8]);
      if (data[23:16] != 8'h00) 
        $write("%c", data[23:16]);
      if (data[31:24] != 8'h00) 
        $write("%c", data[31:24]);
    end
  end endtask

  // output a character (a byte)
  task outc(input [7:0] data); begin
    $write("%c", data);
  end endtask

  task taskInterrupt(input [2:0] iMode); begin
  if (inExe == 0) begin
    case (iMode)
      `RESET: begin 
        `PC = 0; tick = 0; R[0] = 0; `SW = 0; `LR = -1;
        `IE = 0; `I0E = 1; `I1E = 1; `I2E = 1;
        `I = 0; `I0 = 0; `I1 = 0; `I2 = 0; inExe = 1;
        `LE = cfg;
        cycles = 0;
      end
      `ABORT: begin `PC = 4; end
      `IRQ:   begin `PC = 8; `IE = 0; inExe = 1; end
      `ERROR: begin `PC = 12; end
    endcase
  end
  $display("taskInterrupt(%3b)", iMode);
  end endtask

  task taskExecute; begin
    tick = tick+1;
    case (state)
    Fetch: begin  // Tick 1 : instruction fetch, throw PC to address bus, 
                  // memory.read(m[PC])
      memReadStart(`PC, `INT32);
      pc0  = `PC;
   `ifdef SIMULATE_DELAY_SLOT
     if (nextInstIsDelaySlot == 1) begin
       isDelaySlot = 1;
       nextInstIsDelaySlot = 0;
       `PC = delaySlotNextPC;
     end
     else begin
       if (isDelaySlot == 1) isDelaySlot = 0;
       `PC = `PC+4;
     end
   `else
     `PC = `PC+4;
   `endif
      next_state = Decode;
    end
    Decode: begin  // Tick 2 : instruction decode, ir = m[PC]
      memReadEnd(ir); // IR = dbus = m[PC]
      {op,a,b,c} = ir[31:12];
      c24 = $signed(ir[23:0]);
      c16 = $signed(ir[15:0]);
      uc16 = ir[15:0];
      c12 = $signed(ir[11:0]);
      c5  = ir[4:0];
      Ra = R[a];
      Rb = R[b];
      Rc = R[c];
      URa = R[a];
      URb = R[b];
      URc = R[c];
      next_state = Execute;
    end
    Execute: begin // Tick 3 : instruction execution
      case (op)
      NOP:   ;
      // load and store instructions
      LD:    memReadStart(Rb+c16, `INT32);      // LD Ra,[Rb+Cx]; Ra<=[Rb+Cx]
      ST:    memWriteStart(Rb+c16, Ra, `INT32); // ST Ra,[Rb+Cx]; Ra=>[Rb+Cx]
      // LB Ra,[Rb+Cx]; Ra<=(byte)[Rb+Cx]
      LB:    memReadStart(Rb+c16, `BYTE);
      // LBu Ra,[Rb+Cx]; Ra<=(byte)[Rb+Cx]
      LBu:   memReadStart(Rb+c16, `BYTE);
      // SB Ra,[Rb+Cx]; Ra=>(byte)[Rb+Cx]
      SB:    memWriteStart(Rb+c16, Ra, `BYTE);
      LH:    memReadStart(Rb+c16, `INT16); // LH Ra,[Rb+Cx]; Ra<=(2bytes)[Rb+Cx]
      LHu:   memReadStart(Rb+c16, `INT16); // LHu Ra,[Rb+Cx]; Ra<=(2bytes)[Rb+Cx]
      // SH Ra,[Rb+Cx]; Ra=>(2bytes)[Rb+Cx]
      SH:    memWriteStart(Rb+c16, Ra, `INT16);
      // Conditional move
      MOVZ:  if (Rc==0) regSet(a, Rb);             // move if Rc equal to 0
      MOVN:  if (Rc!=0) regSet(a, Rb);             // move if Rc not equal to 0
      // Mathematic 
      ADDiu: regSet(a, Rb+c16);                   // ADDiu Ra, Rb+Cx; Ra<=Rb+Cx
      CMP:   begin `N=(Rb-Rc<0);`Z=(Rb-Rc==0); end // CMP Rb, Rc; SW=(Rb >=< Rc)
      ADDu:  regSet(a, Rb+Rc);               // ADDu Ra,Rb,Rc; Ra<=Rb+Rc
      ADD:   begin regSet(a, Rb+Rc); if (a < Rb) `V = 1; else `V = 0; 
        if (`V) begin `I0 = 1; `I = 1; end
      end
                                             // ADD Ra,Rb,Rc; Ra<=Rb+Rc
      SUBu:  regSet(a, Rb-Rc);               // SUBu Ra,Rb,Rc; Ra<=Rb-Rc
      SUB:   begin regSet(a, Rb-Rc); if (Rb < 0 && Rc > 0 && a >= 0) 
             `V = 1; else `V =0; 
        if (`V) begin `I0 = 1; `I = 1; end
      end         // SUB Ra,Rb,Rc; Ra<=Rb-Rc
      CLZ:   begin
        for (i=0; (i<32)&&((Rb&32'h80000000)==32'h00000000); i=i+1) begin
            Rb=Rb<<1;
        end
        regSet(a, i);
      end
      CLO:   begin
        for (i=0; (i<32)&&((Rb&32'h80000000)==32'h80000000); i=i+1) begin
            Rb=Rb<<1;
        end
        regSet(a, i);
      end
      MUL:   regSet(a, Rb*Rc);               // MUL Ra,Rb,Rc;     Ra<=Rb*Rc
      DIVu:  regHILOSet(URa%URb, URa/URb);   // DIVu URa,URb; HI<=URa%URb; 
                                             // LO<=URa/URb
                                             // without exception overflow
      DIV:   begin regHILOSet(Ra%Rb, Ra/Rb); 
             if ((Ra < 0 && Rb < 0) || (Ra == 0)) `V = 1; 
             else `V =0; end  // DIV Ra,Rb; HI<=Ra%Rb; LO<=Ra/Rb; With overflow
      AND:   regSet(a, Rb&Rc);               // AND Ra,Rb,Rc; Ra<=(Rb and Rc)
      ANDi:  regSet(a, Rb&uc16);             // ANDi Ra,Rb,c16; Ra<=(Rb and c16)
      OR:    regSet(a, Rb|Rc);               // OR Ra,Rb,Rc; Ra<=(Rb or Rc)
      ORi:   regSet(a, Rb|uc16);             // ORi Ra,Rb,c16; Ra<=(Rb or c16)
      XOR:   regSet(a, Rb^Rc);               // XOR Ra,Rb,Rc; Ra<=(Rb xor Rc)
      XORi:  regSet(a, Rb^uc16);             // XORi Ra,Rb,c16; Ra<=(Rb xor c16)
      LUi:   regSet(a, uc16<<16);
      SHL:   regSet(a, Rb<<c5);     // Shift Left; SHL Ra,Rb,Cx; Ra<=(Rb << Cx)
      SRA:   regSet(a, (Rb&'h80000000)|(Rb>>c5)); 
                                // Shift Right with signed bit fill;
                                // SHR Ra,Rb,Cx; Ra<=(Rb&0x80000000)|(Rb>>Cx)
      SHR:   regSet(a, Rb>>c5);     // Shift Right with 0 fill; 
                                    // SHR Ra,Rb,Cx; Ra<=(Rb >> Cx)
      SHLV:  regSet(a, Rb<<Rc);     // Shift Left; SHLV Ra,Rb,Rc; Ra<=(Rb << Rc)
      SRAV:  regSet(a, (Rb&'h80000000)|(Rb>>Rc)); 
                                // Shift Right with signed bit fill;
                                // SHRV Ra,Rb,Rc; Ra<=(Rb&0x80000000)|(Rb>>Rc)
      SHRV:  regSet(a, Rb>>Rc);     // Shift Right with 0 fill; 
                                    // SHRV Ra,Rb,Rc; Ra<=(Rb >> Rc)
      ROL:   regSet(a, (Rb<<c5)|(Rb>>(32-c5)));     // Rotate Left;
      ROR:   regSet(a, (Rb>>c5)|(Rb<<(32-c5)));     // Rotate Right;
      ROLV:  begin // Can set Rc to -32<=Rc<=32 more efficently.
        while (Rc < -32) Rc=Rc+32;
        while (Rc > 32) Rc=Rc-32;
        regSet(a, (Rb<<Rc)|(Rb>>(32-Rc)));     // Rotate Left;
      end
      RORV:  begin 
        while (Rc < -32) Rc=Rc+32;
        while (Rc > 32) Rc=Rc-32;
        regSet(a, (Rb>>Rc)|(Rb<<(32-Rc)));     // Rotate Right;
      end
      MFLO:  regSet(a, LO);         // MFLO Ra; Ra<=LO
      MFHI:  regSet(a, HI);         // MFHI Ra; Ra<=HI
      MTLO:  LO = Ra;               // MTLO Ra; LO<=Ra
      MTHI:  HI = Ra;               // MTHI Ra; HI<=Ra
      MULT:  {HI, LO}=Ra*Rb;        // MULT Ra,Rb; HI<=((Ra*Rb)>>32); 
                                    // LO<=((Ra*Rb) and 0x00000000ffffffff);
                                    // with exception overflow
      MULTu: {HI, LO}=URa*URb;      // MULT URa,URb; HI<=((URa*URb)>>32); 
                                    // LO<=((URa*URb) and 0x00000000ffffffff);
                                    // without exception overflow
      MFC0:  regSet(a, C0R[b]);     // MFC0 a, b; Ra<=C0R[Rb]
      MTC0:  C0regSet(a, Rb);       // MTC0 a, b; C0R[a]<=Rb
      C0MOV: C0regSet(a, C0R[b]);   // C0MOV a, b; C0R[a]<=C0R[b]
   `ifdef CPU0II
      // set
      SLT:   if (Rb < Rc) R[a]=1; else R[a]=0;
      SLTu:  if (Rb < Rc) R[a]=1; else R[a]=0;
      SLTi:  if (Rb < c16) R[a]=1; else R[a]=0;
      SLTiu: if (Rb < c16) R[a]=1; else R[a]=0;
      // Branch Instructions
      BEQ:   if (Ra==Rb) PCSet(`PC+c16);
      BNE:   if (Ra!=Rb) PCSet(`PC+c16);
    `endif
      // Jump Instructions
      JEQ:   if (`Z) PCSet(`PC+c24);            // JEQ Cx; if SW(=) PC  PC+Cx
      JNE:   if (!`Z) PCSet(`PC+c24);           // JNE Cx; if SW(!=) PC PC+Cx
      JLT:   if (`N) PCSet(`PC+c24);            // JLT Cx; if SW(<) PC  PC+Cx
      JGT:   if (!`N&&!`Z) PCSet(`PC+c24);      // JGT Cx; if SW(>) PC  PC+Cx
      JLE:   if (`N || `Z) PCSet(`PC+c24);      // JLE Cx; if SW(<=) PC PC+Cx    
      JGE:   if (!`N || `Z) PCSet(`PC+c24);     // JGE Cx; if SW(>=) PC PC+Cx
      JMP:   `PC = `PC+c24;                     // JMP Cx; PC <= PC+Cx
      JALR:  begin retValSet(a, `PC); PCSet(Rb); end    // JALR Ra,Rb; Ra<=PC; PC<=Rb
      BAL:   begin `LR = `PC; `PC = `PC+c24; end // BAL Cx; LR<=PC; PC<=PC+Cx
      JSUB:  begin retValSet(14, `PC); PCSet(`PC+c24); end // JSUB Cx; LR<=PC; PC<=PC+Cx
      RET:   begin PCSet(Ra); end               // RET; PC <= Ra
      default : 
        $display("%4dns %8x : OP code %8x not support", $stime, pc0, op);
      endcase
      if (`IE && `I && (`I0E && `I0 || `I1E && `I1 || `I2E && `I2)) begin
        `EPC = `PC;
        next_state = Fetch;
        inExe = 0;
      end else
        next_state = MemAccess;
    end
    MemAccess: begin
      case (op)
      ST, SB, SH  :
        memWriteEnd();                // write memory complete
      endcase
      next_state = WriteBack;
    end
    WriteBack: begin // Read/Write finish, close memory
      case (op)
      LB, LBu  :
        memReadEnd(R[a]);        //read memory complete
      LH, LHu  :
        memReadEnd(R[a]);
      LD  : begin
        memReadEnd(R[a]);
        if (`D)
          $display("%4dns %8x : %8x m[%-04x+%-04x]=%8x  SW=%8x", $stime, pc0, 
                   ir, R[b], c16, R[a], `SW);
      end
      endcase
      case (op)
      LB  : begin 
        if (R[a] > 8'h7f) R[a]=R[a]|32'hffffff80;
      end
      LH  : begin 
        if (R[a] > 16'h7fff) R[a]=R[a]|32'hffff8000;
      end
      endcase
      case (op)
      MULT, MULTu, DIV, DIVu, MTHI, MTLO :
        if (`D)
          $display("%4dns %8x : %8x HI=%8x LO=%8x SW=%8x", $stime, pc0, ir, HI, 
                   LO, `SW);
      ST : begin
        if (`D)
          $display("%4dns %8x : %8x m[%-04x+%-04x]=%8x  SW=%8x", $stime, pc0, 
                   ir, R[b], c16, R[a], `SW);
        if (R[b]+c16 == `IOADDR) begin
          outw(R[a]);
        end
      end
      SB : begin
        if (`D)
          $display("%4dns %8x : %8x m[%-04x+%-04x]=%c  SW=%8x, R[a]=%8x", 
                   $stime, pc0, ir, R[b], c16, R[a][7:0], `SW, R[a]);
        if (R[b]+c16 == `IOADDR) begin
          if (`LE)
            outc(R[a][7:0]);
          else
            outc(R[a][7:0]);
        end
      end
      MFC0, MTC0 :
        if (`D)
          $display("%4dns %8x : %8x R[%02d]=%-8x  C0R[%02d]=%-8x SW=%8x", 
                   $stime, pc0, ir, a, R[a], a, C0R[a], `SW);
      C0MOV :
        if (`D)
          $display("%4dns %8x : %8x C0R[%02d]=%-8x C0R[%02d]=%-8x SW=%8x", 
                   $stime, pc0, ir, a, C0R[a], b, C0R[b], `SW);
      default :
        if (`D) // Display the written register content
          $display("%4dns %8x : %8x R[%02d]=%-8x SW=%8x", $stime, pc0, ir, 
                   a, R[a], `SW);
      endcase
      if (`PC < 0) begin
        $display("total cpu cycles = %-d", cycles);
        $display("RET to PC < 0, finished!");
        $finish;
      end
      next_state = Fetch;
    end
    endcase
  end endtask

  always @(posedge clock) begin
    if (inExe == 0 && (state == Fetch) && (`IE && `I) && (`I0E && `I0)) begin
    // software int
      `M = `IRQ;
      taskInterrupt(`IRQ);
      m_en = 0;
      state = Fetch;
    end else if (inExe == 0 && (state == Fetch) && (`IE && `I) && 
                 ((`I1E && `I1) || (`I2E && `I2)) ) begin
      `M = `IRQ;
      taskInterrupt(`IRQ);
      m_en = 0;
      state = Fetch;
    end else if (inExe == 0 && itype == `RESET) begin
    // Condition itype == `RESET must after the other `IE condition
      taskInterrupt(`RESET);
      `M = `RESET;
      state = Fetch;
    end else begin
    `ifdef TRACE
      `D = 1; // Trace register content at beginning
    `endif
      taskExecute();
      state = next_state;
    end
    pc = `PC;
    cycles = cycles + 1;
  end
endmodule

module memory0(input clock, reset, en, rw, input [1:0] m_size, 
               input [31:0] abus, dbus_in, output [31:0] dbus_out, 
               output cfg);
  reg [31:0] mconfig [0:0];
  reg [7:0] m [0:`MEMSIZE-1];
  reg [31:0] data;

  integer i;

  `define LE  mconfig[0][0:0]   // Endian bit, Big Endian:0, Little Endian:1

  initial begin
  // erase memory
    for (i=0; i < `MEMSIZE; i=i+1) begin
       m[i] = `MEMEMPTY;
    end
  // load config from file to memory
    $readmemh("cpu0.config", mconfig);
  // load program from file to memory
    $readmemh("cpu0.hex", m);
  // display memory contents
    `ifdef TRACE
      for (i=0; i < `MEMSIZE && (m[i] != `MEMEMPTY || m[i+1] != `MEMEMPTY || 
         m[i+2] != `MEMEMPTY || m[i+3] != `MEMEMPTY); i=i+4) begin
        $display("%8x: %8x", i, {m[i], m[i+1], m[i+2], m[i+3]});
      end
    `endif
  end

  always @(clock or abus or en or rw or dbus_in) 
  begin
    if (abus >= 0 && abus <= `MEMSIZE-4) begin
      if (en == 1 && rw == 0) begin // r_w==0:write
        data = dbus_in;
        if (`LE) begin // Little Endian
          case (m_size)
          `BYTE:  {m[abus]} = dbus_in[7:0];
          `INT16: {m[abus], m[abus+1] } = {dbus_in[7:0], dbus_in[15:8]};
          `INT24: {m[abus], m[abus+1], m[abus+2]} = 
                  {dbus_in[7:0], dbus_in[15:8], dbus_in[23:16]};
          `INT32: {m[abus], m[abus+1], m[abus+2], m[abus+3]} = 
                  {dbus_in[7:0], dbus_in[15:8], dbus_in[23:16], dbus_in[31:24]};
          endcase
        end else begin // Big Endian
          case (m_size)
          `BYTE:  {m[abus]} = dbus_in[7:0];
          `INT16: {m[abus], m[abus+1] } = dbus_in[15:0];
          `INT24: {m[abus], m[abus+1], m[abus+2]} = dbus_in[23:0];
          `INT32: {m[abus], m[abus+1], m[abus+2], m[abus+3]} = dbus_in;
          endcase
        end
      end else if (en == 1 && rw == 1) begin // r_w==1:read
        if (`LE) begin // Little Endian
          case (m_size)
          `BYTE:  data = {8'h00,     8'h00,     8'h00,     m[abus]};
          `INT16: data = {8'h00,     8'h00,     m[abus+1], m[abus]};
          `INT24: data = {8'h00,     m[abus+2], m[abus+1], m[abus]};
          `INT32: data = {m[abus+3], m[abus+2], m[abus+1], m[abus]};
          endcase
        end else begin // Big Endian
          case (m_size)
          `BYTE:  data = {8'h00  , 8'h00,     8'h00,     m[abus]  };
          `INT16: data = {8'h00  , 8'h00,     m[abus],   m[abus+1]};
          `INT24: data = {8'h00  , m[abus],   m[abus+1], m[abus+2]};
          `INT32: data = {m[abus], m[abus+1], m[abus+2], m[abus+3]};
          endcase
        end
      end else
        data = 32'hZZZZZZZZ;
    end else 
      data = 32'hZZZZZZZZ;
  end
  assign dbus_out = data;
  assign cfg = mconfig[0][0:0];
endmodule

module main;
  reg clock;
  reg [2:0] itype;
  wire [2:0] tick;
  wire [31:0] pc, ir, mar, mdr, dbus;
  wire m_en, m_rw;
  wire [1:0] m_size;
  wire cfg;

  cpu0 cpu(.clock(clock), .itype(itype), .pc(pc), .tick(tick), .ir(ir),
  .mar(mar), .mdr(mdr), .dbus(dbus), .m_en(m_en), .m_rw(m_rw), .m_size(m_size),
  .cfg(cfg));

  memory0 mem(.clock(clock), .reset(reset), .en(m_en), .rw(m_rw), 
  .m_size(m_size), .abus(mar), .dbus_in(mdr), .dbus_out(dbus), .cfg(cfg));

  initial
  begin
    clock = 0;
    itype = `RESET;
    #300000000 $finish;
  end

  always #10 clock=clock+1;

endmodule

lbdex/verilog/Makefile

#TRACE=-D TRACE
all:
	iverilog ${TRACE} -o cpu0Is cpu0.v 
	iverilog ${TRACE} -D CPU0II -o cpu0IIs cpu0.v 

.PHONY: clean
clean:
	rm -rf cpu0.hex cpu0Is cpu0IIs 
	rm -f *~ cpu0.config

Since Cpu0 Verilog machine supports both big and little endian, the memory and cpu module both have a wire connectting each other. The endian information stored in ROM of memory module, and memory module send the information when it is up according the following code,

lbdex/verilog/cpu0.v

assign cfg = mconfig[0][0:0];
...
wire cfg;

cpu0 cpu(.clock(clock), .itype(itype), .pc(pc), .tick(tick), .ir(ir),
.mar(mar), .mdr(mdr), .dbus(dbus), .m_en(m_en), .m_rw(m_rw), .m_size(m_size),
.cfg(cfg));

memory0 mem(.clock(clock), .reset(reset), .en(m_en), .rw(m_rw),
.m_size(m_size), .abus(mar), .dbus_in(mdr), .dbus_out(dbus), .cfg(cfg));

Instead of setting endian tranfer in memory module, the endian transfer can also be set in CPU module, and memory moudle always return with big endian. I am not an professional engineer in FPGA/CPU hardware design. But according book “Computer Architecture: A Quantitative Approach”, some operations may have no tolerance in time of execution stage. Any endian swap will make the clock cycle time longer and affect the CPU performance. So, I set the endian transfer in memory module. In system with bus, it will be set in bus system I think.

Verify backend

Now let’s compile ch_run_backend.cpp as below. Since code size grows up from low to high address and stack grows up from high to low address. $sp is set at 0x7fffc because assuming cpu0.v uses 0x80000 bytes of memory.

lbdex/input/start.h

#define SET_SW \
asm("andi $sw, $zero, 0"); \
asm("ori  $sw, $sw, 0x1e00"); // enable all interrupts

#define initRegs() \
  asm("addiu $1,  $zero, 0"); \
  asm("addiu $2,  $zero, 0"); \
  asm("addiu $3,  $zero, 0"); \
  asm("addiu $4,  $zero, 0"); \
  asm("addiu $5,  $zero, 0"); \
  asm("addiu $t9, $zero, 0"); \
  asm("addiu $7,  $zero, 0"); \
  asm("addiu $8,  $zero, 0"); \
  asm("addiu $9,  $zero, 0"); \
  asm("addiu $10, $zero, 0"); \
  SET_SW;                     \
  asm("addiu $fp, $zero, 0");

lbdex/input/boot.cpp


#include "start.h"

// boot:
  asm("boot:");
//  asm("_start:");
  asm("jmp 12"); // RESET: jmp RESET_START;
  asm("jmp 4");  // ERROR: jmp ERR_HANDLE;
  asm("jmp 4");  // IRQ: jmp IRQ_HANDLE;
  asm("jmp -4"); // ERR_HANDLE: jmp ERR_HANDLE; (loop forever)

// RESET_START:
  initRegs();
  asm("addiu $gp, $ZERO, 0");
  asm("addiu $lr, $ZERO, -1");
  
  asm("lui $sp, 0x7");
  asm("addiu $sp, $sp, 0xfffc");
  asm("mfc0 $3, $pc");
  asm("addiu $3, $3, 0x8"); // Assume main() entry point is at the next next 
                             // instruction.
  asm("jr $3");
  asm("nop");

lbdex/input/print.h

#ifndef _PRINT_H_
#define _PRINT_H_

#define OUT_MEM 0x80000

void print_char(const char c);
void dump_mem(unsigned char *str, int n);
void print_string(const char *str);
void print_integer(int x);
#endif

lbdex/input/print.cpp

#include "print.h"
#include "itoa.cpp"

// For memory IO
void print_char(const char c)
{
  char *p = (char*)OUT_MEM;
  *p = c;

  return;
}

void print_string(const char *str)
{
  const char *p;

  for (p = str; *p != '\0'; p++)
    print_char(*p);
  print_char(*p);
  print_char('\n');

  return;
}

// For memory IO
void print_integer(int x)
{
  char str[INT_DIGITS + 2];
  itoa(str, x);
  print_string(str);

  return;
}

lbdex/input/ch_nolld.h


#include "debug.h"
#include "boot.cpp"

#include "print.h"

int test_nolld();

lbdex/input/ch_nolld.cpp

#define TEST_ROXV
#define RUN_ON_VERILOG

#include "print.cpp"

#include "ch4_1_math.cpp"
#include "ch4_1_rotate.cpp"
#include "ch4_1_mult2.cpp"
#include "ch4_1_mod.cpp"
#include "ch4_1_div.cpp"
#include "ch4_2_logic.cpp"
#include "ch7_1_localpointer.cpp"
#include "ch7_1_char_short.cpp"
#include "ch7_1_bool.cpp"
#include "ch7_1_longlong.cpp"
#include "ch7_1_vector.cpp"
#include "ch8_1_ctrl.cpp"
#include "ch8_2_deluselessjmp.cpp"
#include "ch8_2_select.cpp"
#include "ch9_3_vararg.cpp"
#include "ch9_3_stacksave.cpp"
#include "ch9_3_bswap.cpp"
#include "ch9_3_alloc.cpp"
#include "ch11_2.cpp"

// Test build only for the following files on build-run_backend.sh since it 
// needs lld linker support.
// Test in build-slink.sh
#include "ch6_1.cpp"
#include "ch9_1_struct.cpp"
#include "ch9_1_constructor.cpp"
#include "ch9_3_template.cpp"
#include "ch12_inherit.cpp"

void test_asm_build()
{
  #include "ch11_1.cpp"
#ifdef CPU032II
  #include "ch11_1_2.cpp"
#endif
}

int test_rotate()
{
  int a = test_rotate_left1(); // rolv 4, 30 = 1
  int b = test_rotate_left(); // rol 8, 30  = 2
  int c = test_rotate_right(); // rorv 1, 30 = 4
  
  return (a+b+c);
}

int test_nolld()
{
  bool pass = true;
  int a = 0;

  a = test_math();
  print_integer(a);  // a = 74
  if (a != 74) pass = false;
  a = test_rotate();
  print_integer(a);  // a = 7
  if (a != 7) pass = false;
  a = test_mult();
  print_integer(a);  // a = 0
  if (a != 0) pass = false;
  a = test_mod();
  print_integer(a);  // a = 0
  if (a != 0) pass = false;
  a = test_div();
  print_integer(a);  // a = 253
  if (a != 253) pass = false;
  a = test_local_pointer();
  print_integer(a);  // a = 3
  if (a != 3) pass = false;
  a = (int)test_load_bool();
  print_integer(a);  // a = 1
  if (a != 1) pass = false;
  a = test_andorxornot();
  print_integer(a); // a = 14
  if (a != 14) pass = false;
  a = test_setxx();
  print_integer(a); // a = 3
  if (a != 3) pass = false;
  a = test_signed_char();
  print_integer(a); // a = -126
  if (a != -126) pass = false;
  a = test_unsigned_char();
  print_integer(a); // a = 130
  if (a != 130) pass = false;
  a = test_signed_short();
  print_integer(a); // a = -32766
  if (a != -32766) pass = false;
  a = test_unsigned_short();
  print_integer(a); // a = 32770
  if (a != 32770) pass = false;
  long long b = test_longlong();
  print_integer((int)(b >> 32)); // 393307
  if ((int)(b >> 32) != 393307) pass = false;
  print_integer((int)b); // 16777222
  if ((int)(b) != 16777222) pass = false;
  a = test_cmplt_short();
  print_integer(a); // a = 2
  if (a != 2) pass = false;
  a = test_cmplt_long();
  print_integer(a); // a = 4
  if (a != 4) pass = false;
  a = test_control1();
  print_integer(a);	// a = 51
  if (a != 51) pass = false;
  a = test_DelUselessJMP();
  print_integer(a); // a = 2
  if (a != 2) pass = false;
  a = test_movx_1();
  print_integer(a); // a = 3
  if (a != 3) pass = false;
  a = test_movx_2();
  print_integer(a); // a = 1
  if (a != 1) pass = false;
  print_integer(2147483647); // test mod % (mult) from itoa.cpp
  print_integer(-2147483648); // test mod % (multu) from itoa.cpp
  a = test_vararg();
  print_integer(a); // a = 15
  if (a != 15) pass = false;
  a = test_stacksaverestore(100);
  print_integer(a); // a = 5
  if (a != 5) pass = false;
  a = test_bswap();
  print_integer(a); // a = 0
  if (a != 0) pass = false;
  a = test_alloc();
  print_integer(a); // a = 31
  if (a != 31) pass = false;
  a = test_inlineasm();
  print_integer(a); // a = 49
  if (a != 49) pass = false;

  return pass;
}

lbdex/input/ch_run_backend.cpp


#include "ch_nolld.h"

int main()
{
  bool pass = true;
  pass = test_nolld();

  return pass;
}

#include "ch_nolld.cpp"

lbdex/input/functions.sh

prologue() {
  if [ $argNum == 0 ]; then
    echo "useage: bash $sh_name cpu_type endian"
    echo "  cpu_type: cpu032I or cpu032II"
    echo "  endian: be (big endian, default) or le (little endian)"
    echo "for example:"
    echo "  bash build-slinker.sh cpu032I be"
    exit 1;
  fi
  if [ $arg1 != cpu032I ] && [ $arg1 != cpu032II ]; then
    echo "1st argument is cpu032I or cpu032II"
    exit 1
  fi

  OS=`uname -s`
  echo "OS =" ${OS}

  if [ "$OS" == "Linux" ]; then
    TOOLDIR=~/llvm/test/cmake_debug_build/bin
  else
    TOOLDIR=~/llvm/test/cmake_debug_build/Debug/bin
  fi

  CPU=$arg1
  echo "CPU =" "${CPU}"

  if [ "$arg2" != "" ] && [ $arg2 != le ] && [ $arg2 != be ]; then
    echo "2nd argument is be (big endian, default) or le (little endian)"
    exit 1
  fi
  if [ "$arg2" == "" ] || [ $arg2 == be ]; then
    endian=
  else
    endian=el
  fi
  echo "endian =" "${endian}"

  bash clean.sh
}

isLittleEndian() {
  echo "endian = " "$endian"
  if [ "$endian" == "LittleEndian" ] ; then
    le="true"
  elif [ "$endian" == "BigEndian" ] ; then
    le="false"
  else
    echo "!endian unknown"
    exit 1
  fi
}

elf2hex() {
  ${TOOLDIR}/llvm-objdump -elf2hex -le=${le} a.out > ../verilog/cpu0.hex
  if [ ${le} == "true" ] ; then
    echo "1   /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
  else
    echo "0   /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
  fi
  cat ../verilog/cpu0.config
}

epilogue() {
  endian=`${TOOLDIR}/llvm-readobj -h a.out|grep "DataEncoding"|awk '{print $2}'`
  isLittleEndian;
  elf2hex;
}

lbdex/input/build-run_backend.sh

#!/usr/bin/env bash

source functions.sh

sh_name=build-run_backend.sh
argNum=$#
arg1=$1
arg2=$2

DEFFLAGS=""
if [ "$arg1" == cpu032II ] ; then
  DEFFLAGS=${DEFFLAGS}" -DCPU032II"
fi
echo ${DEFFLAGS}

prologue;

# ch8_2_select_global_pic.cpp just for compile build test only, without running 
# on verilog.
clang ${DEFFLAGS} -target mips-unknown-linux-gnu -c ch8_2_select_global_pic.cpp \
-emit-llvm -o ch8_2_select_global_pic.bc
${TOOLDIR}/llc -march=cpu0${endian} -mcpu=${CPU} -relocation-model=pic \
-filetype=obj ch8_2_select_global_pic.bc -o ch8_2_select_global_pic.cpu0.o

clang ${DEFFLAGS} -target mips-unknown-linux-gnu -c ch_run_backend.cpp \
-emit-llvm -o ch_run_backend.bc
${TOOLDIR}/llc -march=cpu0${endian} -mcpu=${CPU} -relocation-model=static \
-filetype=obj -enable-cpu0-tail-calls ch_run_backend.bc -o ch_run_backend.cpu0.o
# print must at the same line, otherwise it will spilt into 2 lines
${TOOLDIR}/llvm-objdump -d ch_run_backend.cpu0.o | tail -n +8| awk \
'{print "/* " $1 " */\t" $2 " " $3 " " $4 " " $5 "\t/* " $6"\t" $7" " $8" " $9" " $10 "\t*/"}' \
 > ../verilog/cpu0.hex

if [ "$arg2" == le ] ; then
  echo "1   /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
else
  echo "0   /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
fi
cat ../verilog/cpu0.config

To run program without linker implementation at this point, the boot.cpp must be set at the beginning of code, and the main() of ch_run_backend.cpp is immediately after it. Let’s run Chapter11_2/ with llvm-objdump -d for input file ch_run_backend.cpp to generate the hex file via build-run_bacekend.sh, then feed hex file to cpu0Is Verilog simulator to get the output result as below. Remind ch_run_backend.cpp have to be compiled with option clang -target mips-unknown-linux-gnu since the example code ch9_3_vararg.cpp which uses the vararg needs to be compiled with this option. Other example codes have no differences between this option and default option.

JonathantekiiMac:input Jonathan$ pwd
/Users/Jonathan/llvm/test/lbdex/input
JonathantekiiMac:input Jonathan$ bash build-run_backend.sh cpu032I be
JonathantekiiMac:input Jonathan$ cd ../verilog cd ../verilog
JonathantekiiMac:input Jonathan$ pwd
/Users/Jonathan/llvm/test/lbdex/verilog
JonathantekiiMac:verilog Jonathan$ make
JonathantekiiMac:verilog Jonathan$ ./cpu0Is
WARNING: cpu0Is.v:386: $readmemh(cpu0.hex): Not enough words in the file for the
taskInterrupt(001)
74
7
0
0
253
3
1
14
3
-126
130
-32766
32770
393307
16777222
2
4
51
2
2147483647
-2147483648
7
120
15
5
0
31
49
total cpu cycles = 50645
RET to PC < 0, finished!

JonathantekiiMac:input Jonathan$ bash build-run_backend.sh cpu032II be
JonathantekiiMac:input Jonathan$ cd ../verilog
JonathantekiiMac:verilog Jonathan$ ./cpu0IIs
...
total cpu cycles = 48335
RET to PC < 0, finished!

The “total cpu cycles” is calculated in this verilog simualtor so that the backend compiler and CPU performance can be reviewed. Only the CPU cycles are counted in this implemenation since I/O cycles time is unknown. As explained in chapter “Control flow statements”, cpu032II which uses instructions slt and beq has better performance than cmp and jeq in cpu032I. Instructions “jmp” has no delay slot so it is better used in dynamic linker implementation.

You can trace the memory binary code and destination register changed at every instruction execution by unmark TRACE in Makefile as below,

lbdex/verilog/Makefile

TRACE=-D TRACE
JonathantekiiMac:raw Jonathan$ ./cpu0Is
WARNING: cpu0.v:386: $readmemh(cpu0.hex): Not enough words in the file for the
requested range [0:28671].
00000000: 2600000c
00000004: 26000004
00000008: 26000004
0000000c: 26fffffc
00000010: 09100000
00000014: 09200000
...
taskInterrupt(001)
1530ns 00000054 : 02ed002c m[28620+44  ]=-1          SW=00000000
1610ns 00000058 : 02bd0028 m[28620+40  ]=0           SW=00000000
...
RET to PC < 0, finished!

As above result, cpu0.v dumps the memory first after reads input file cpu0.hex. Next, it runs instructions from address 0 and print each destination register value in the fourth column. The first column is the nano seconds of timing. The second is instruction address. The third is instruction content. Now, most example codes depicted in the previous chapters are verified by print the variable with print_integer().

Since the cpu0.v machine is created by Verilog language, suppose it can run on real FPGA device. The real output hardware interface/port is hardware output device dependent, such as RS232, speaker, LED, .... You should implement the I/O interface/port when you want to program FPGA and wire I/O device to the I/O port. Through running the compiled code on Verilog simulator, Cpu0 backend compiled result and CPU cycles are verified and calculated. Though the Verilog simulator is slow for running the whole system program and not include the cycles counting in cache and I/O, it is a simple and easy way to verify your idea about CPU design at begging stage with small program pattern. The overall system simulator is complex to create. Even wiki web site here [7] include tools for creating the simulator, it needs a lot of effort.

To generate cpu032I as well as little endian code, you can run with the following command. File build-run_backend.sh write the endian information to ../verilog/cpu0.config as below.

JonathantekiiMac:input Jonathan$ bash build-run_backend.sh cpu032I le

../verilog/cpu0.config

1   /* 0: big endian, 1: little endian */

The following files test more features.

lbdex/input/ch_nolld2.h


#include "debug.h"
#include "boot.cpp"

#include "print.h"

int test_nolld2();

lbdex/input/ch_nolld2.cpp


#include "print.cpp"

#include "ch9_3_alloc.cpp"

int test_nolld2()
{
  bool pass = true;
  int a = 0;

  a = test_alloc();
  print_integer(a);  // a = 31
  if (a != 31) pass = false;
  return pass;
}

lbdex/input/ch_run_backend2.cpp


#include "ch_nolld2.h"

int main()
{
  bool pass = true;
  pass = test_nolld2();

  return pass;
}

#include "ch_nolld2.cpp"

lbdex/input/build-run_backend2.sh

#!/usr/bin/env bash

source functions.sh

sh_name=build-run_backend.sh
argNum=$#
arg1=$1
arg2=$2

DEFFLAGS=""
if [ "$arg1" == cpu032II ] ; then
  DEFFLAGS=${DEFFLAGS}" -DCPU032II"
fi
echo ${DEFFLAGS}

prologue;

clang ${DEFFLAGS} -c ch_run_backend2.cpp \
-emit-llvm -o ch_run_backend2.bc
${TOOLDIR}/llc -march=cpu0${endian} -mcpu=${CPU} -relocation-model=static \
-filetype=obj ch_run_backend2.bc -o ch_run_backend2.cpu0.o
${TOOLDIR}/llvm-objdump -d ch_run_backend2.cpu0.o | tail -n +8| awk \
'{print "/* " $1 " */\t" $2 " " $3 " " $4 " " $5 "\t/* " $6"\t" $7" " $8" \
" $9" " $10 "\t*/"}' > ../verilog/cpu0.hex

if [ "$arg2" == le ] ; then
  echo "1   /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
else
  echo "0   /* 0: big endian, 1: little endian */" > ../verilog/cpu0.config
fi
cat ../verilog/cpu0.config
JonathantekiiMac:input Jonathan$ bash build-run_backend.sh cpu032II le
...
JonathantekiiMac:input Jonathan$ cd ../verilog
JonathantekiiMac:verilog Jonathan$ ./cpu0IIs
...
31
...