Control flow statements

This chapter illustrates the corresponding IR for control flow statements, such as “if else”, “while” and “for” loop statements in C, and how to translate these control flow statements of llvm IR into Cpu0 instructions in section I. In section Remove useless JMP, an optimization pass of control flow for backend is introduced. It’s a simple tutorial program to let readers know how to add a backend optimization pass and program it. section Conditional instruction, includes the Conditional Instructions Handling since clang will generate specific IRs, select and select_cc, to support the backend optimiation in control flow statement.

Pipeline architecture

The following figures are from book “Computer Architecture A Quantitative Approach Fourth Edition”.

_images/5_stage_pipeline.png

Fig. 30 5 stages of pipeline

  • IF: Instruction fetch cycle; ID: Instruction decode/register fetch cycle; EX: Execution/effective address cycle; MEM: Memory access; WB: Write-back cycle.

_images/super_pipeline.png

Fig. 31 Super pipeline

Multibanked Caches as Fig. 32 to Increase Cache Bandwidth.

_images/cache_banks.png

Fig. 32 Interleaved cache banks

Block size in cache L1 is usualy 16-256 bytes. Euipped with with multibanked Caches can provide super-pipeline as Fig. 31 and superscalar (multi-issues pipeline) archtectures for fetching (4*block-size/instruction-size) instructions in a cycle.

Control flow statement

Run ch8_1_1.cpp with clang will get result as follows,

lbdex/input/ch8_1_1.cpp

int test_ifctrl()
{
  unsigned int a = 0;
  
  if (a == 0) {
    a++; // a = 1
  }
  
  return a;
}
  ...
  %0 = load i32* %a, align 4
  %cmp = icmp eq i32 %0, 0
  br i1 %cmp, label %if.then, label %if.end

; <label>:3:                                      ; preds = %0
  %1 = load i32* %a, align 4
  %inc = add i32 %1, 1
  store i32 %inc, i32* %a, align 4
  br label %if.end
  ...

The “icmp ne” stands for integer compare NotEqual, “slt” stands for Set Less Than, “sle” stands for Set Less or Equal. Run version Chapter8_1/ with llc  -view-isel-dags or -debug option, you can see the if statement is translated into (br (brcond (%1, setcc(%2, Constant<c>, setne)), BasicBlock_02), BasicBlock_01). Ignore %1, then we will get the form (br (brcond (setcc(%2, Constant<c>, setne)), BasicBlock_02), BasicBlock_01). For explanation, listing the IR DAG as follows,

Optimized legalized selection DAG: BB#0 '_Z11test_ifctrlv:entry'
  SelectionDAG has 12 nodes:
    ...
    t5: i32,ch = load<Volatile LD4[%a]> t4, FrameIndex:i32<0>, undef:i32
        t16: i32 = setcc t5, Constant:i32<0>, setne:ch
      t11: ch = brcond t5:1, t16, BasicBlock:ch<if.end 0x10305a338>
    t13: ch = br t11, BasicBlock:ch<if.then 0x10305a288>

We want to translate them into Cpu0 instruction DAGs as follows,

addiu %3, ZERO, Constant<c>
cmp %2, %3
jne BasicBlock_02
jmp BasicBlock_01

For the last IR br, we translate unconditional branch (br BasicBlock_01) into jmp BasicBlock_01 by the following pattern definition,

lbdex/chapters/Chapter8_1/Cpu0InstrInfo.td

// Unconditional branch, such as JMP
let Predicates = [Ch8_1] in {
class UncondBranch<bits<8> op, string instr_asm>:
  FJ<op, (outs), (ins jmptarget:$addr),
             !strconcat(instr_asm, "\t$addr"), [(br bb:$addr)], IIBranch> {
  let isBranch = 1;
  let isTerminator = 1;
  let isBarrier = 1;
  let hasDelaySlot = 0;
}
}
...
def JMP     : UncondBranch<0x26, "jmp">;

The pattern [(br bb:$imm24)] in class UncondBranch is translated into jmp machine instruction. The translation for the pair Cpu0 instructions, cmp and jne, is not happened before this chapter. To solve this chained IR to machine instructions translation, we define the following pattern,

lbdex/chapters/Chapter8_1/Cpu0InstrInfo.td

// brcond patterns
multiclass BrcondPatsCmp<RegisterClass RC, Instruction JEQOp, Instruction JNEOp,
  Instruction JLTOp, Instruction JGTOp, Instruction JLEOp, Instruction JGEOp,
  Instruction CMPOp> {
...
def : Pat<(brcond (i32 (setne RC:$lhs, RC:$rhs)), bb:$dst),
          (JNEOp (CMPOp RC:$lhs, RC:$rhs), bb:$dst)>;
...
def : Pat<(brcond RC:$cond, bb:$dst),
          (JNEOp (CMPOp RC:$cond, ZEROReg), bb:$dst)>;
...
}

Since the aboved BrcondPats pattern uses RC (Register Class) as operand, the following ADDiu pattern defined in Chapter2 will generate instruction addiu before the instruction cmp for the first IR, setcc(%2, Constant<c>, setne), as above.

lbdex/chapters/Chapter2/Cpu0InstrInfo.td

// Small immediates
def : Pat<(i32 immSExt16:$in),
          (ADDiu ZERO, imm:$in)>;

The definition of BrcondPats supports setne, seteq, setlt, …, register operand compare and setult, setugt, …, for unsigned int type. In addition to seteq and setne, we define setueq and setune by refering Mips code, even though we don’t find how to generate setune IR from C language. We have tried to define unsigned int type, but clang still generates setne instead of setune. The order of Pattern Search is from the order of their appearing in context. The last pattern (brcond RC:$cond, bb:$dst) means branch to $dst if $cond != 0. So we set the corresponding translation to (JNEOp (CMPOp RC:$cond, ZEROReg), bb:$dst).

The CMP instruction will set the result to register SW, and then JNE check the condition based on SW status as Fig. 33. Since SW belongs to a different register class, it will be correct even an instruction is inserted between CMP and JNE as follows,

_images/11.png

Fig. 33 JNE (CMP $r2, $r3),

cmp %2, %3
addiu $r1, $r2, 3   // $r1 register never be allocated to $SW because in
                    //  class ArithLogicI, GPROut is the output register
                    //  class and the GPROut is defined without $SW in
                    //  Cpu0RegisterInforGPROutForOther.td
jne BasicBlock_02

The reserved registers setting by the following function code we defined before,

lbdex/chapters/Chapter3_1/Cpu0RegisterInfo.cpp

BitVector Cpu0RegisterInfo::
getReservedRegs(const MachineFunction &MF) const {
//@getReservedRegs body {
  static const uint16_t ReservedCPURegs[] = {
    Cpu0::ZERO, Cpu0::AT, Cpu0::SP, Cpu0::LR, /*Cpu0::SW, */Cpu0::PC
  };
  BitVector Reserved(getNumRegs());

  for (unsigned I = 0; I < array_lengthof(ReservedCPURegs); ++I)
    Reserved.set(ReservedCPURegs[I]);

  return Reserved;
}

Although the following definition in Cpu0RegisterInfo.td has no real effect in Reserved Registers, it’s better to comment the Reserved Registers in it for readability. Setting SW in both register classes CPURegs and SR to allow access SW by RISC instructions like andi, and allow programmers use traditional assembly instruction cmp. The copyPhysReg() is called when both DestReg and SrcReg are belonging to different Register Classes.

lbdex/chapters/Chapter2/Cpu0RegisterInfo.td

//===----------------------------------------------------------------------===//

def CPURegs : RegisterClass<"Cpu0", [i32], 32, (add
  // Reserved
  ZERO, AT, 
  // Return Values and Arguments
  V0, V1, A0, A1, 
  // Not preserved across procedure calls
  T9, T0, T1,
  // Callee save
  S0, S1,
  // Reserved
  GP, FP, 
  SP, LR, SW)>;

def SR     : RegisterClass<"Cpu0", [i32], 32, (add SW)>;

lbdex/chapters/Chapter2/Cpu0RegisterInfoGPROutForOther.td


//===----------------------------------------------------------------------===//
// Register Classes
//===----------------------------------------------------------------------===//

def GPROut : RegisterClass<"Cpu0", [i32], 32, (add (sub CPURegs, SW))>;

Chapter8_1/ include support for control flow statement. Run with it as well as the following llc option, you will get the obj file. Dump it’s content by gobjdump or hexdump after as follows,

  118-165-79-206:input Jonathan$ /Users/Jonathan/llvm/test/
  build/bin/llc -march=cpu0 -mcpu=cpu032I -relocation-model=pic
  -filetype=asm ch8_1_1.bc -o -
  ...
  ld  $4, 36($fp)
  cmp $sw, $4, $3
  jne $BB0_2
  jmp $BB0_1
$BB0_1:                                 # %if.then
  ld  $4, 36($fp)
  addiu $4, $4, 1
  st  $4, 36($fp)
$BB0_2:                                 # %if.end
  ld  $4, 32($fp)
  ...
118-165-79-206:input Jonathan$ /Users/Jonathan/llvm/test/
build/bin/llc -march=cpu0 -mcpu=cpu032I -relocation-model=pic
-filetype=obj ch8_1_1.bc -o ch8_1_1.cpu0.o

118-165-79-206:input Jonathan$ hexdump ch8_1_1.cpu0.o
    // jmp offset is 0x10=16 bytes which is correct
0000080 ...................................... 10 43 00 00
0000090 31 00 00 10 36 00 00 00 ..........................

The immediate value of jne (op 0x31) is 16; The offset between jne and $BB0_2 is 20 (5 words = 5*4 bytes). Suppose the jne address is X, then the label $BB0_2 is X+20. Cpu0’s instruction set is designed as a RISC CPU with 5 stages of pipeline just like 5 stages of Mips. Cpu0 do branch instruction execution at decode stage which like mips too. After the jne instruction fetched, the PC (Program Counter) is X+4 since cpu0 update PC at fetch stage. The $BB0_2 address is equal to PC+16 for the jne branch instruction execute at decode stage. List and explain this again as follows,

              // Fetch instruction stage for jne instruction. The fetch stage
              // can be divided into 2 cycles. First cycle fetch the
              // instruction. Second cycle adjust PC = PC+4.
  jne $BB0_2  // Do jne compare in decode stage. PC = X+4 at this stage.
              // When jne immediate value is 16, PC = PC+16. It will fetch
              //  X+20 which equal to label $BB0_2 instruction, ld $4, 32($sp).
  nop
$BB0_1:                                 # %if.then
  ld  $4, 36($fp)
  addiu $4, $4, 1
  st  $4, 36($fp)
$BB0_2:                                 # %if.end
  ld  $4, 32($fp)

If Cpu0 do “jne” in execution stage, then we should set PC=PC+12, offset of ($BB0_2, jne $BB02) – 8, in this example.

In reality, the conditional branch is important in performance of CPU design. According bench mark information, every 7 instructions will meet 1 branch instruction in average. The cpu032I spends 2 instructions in conditional branch, (jne(cmp…)), while cpu032II use one instruction (bne) as follws,

JonathantekiiMac:input Jonathan$ /Users/Jonathan/llvm/test/
build/bin/llc -march=cpu0 -mcpu=cpu032I -relocation-model=pic
-filetype=asm ch8_1_1.bc -o -
  ...
      cmp     $sw, $4, $3
      jne     $sw, $BB0_2
      jmp     $BB0_1
$BB0_1:

JonathantekiiMac:input Jonathan$ /Users/Jonathan/llvm/test/
build/bin/llc -march=cpu0 -mcpu=cpu032II -relocation-model=pic
-filetype=asm ch8_1_1.bc -o -
  ...
      bne     $4, $zero, $BB0_2
      jmp     $BB0_1
$BB0_1:

Beside brcond explained in this section, above code also include DAG opcode br_jt and label JumpTable which occurs during DAG translation for some kind of program.

The ch8_1_ctrl.cpp include “nest if” “for loop”, “while loop”, “continue”, “break” and “goto”. The ch8_1_br_jt.cpp is for br_jt and JumpTable test. The ch8_1_blockaddr.cpp is for blockaddress and indirectbr test. You can run with them if you like to test more.

List the control flow statements of C, IR, DAG and Cpu0 instructions as the following table.

Table 33 Control flow statements of C, IR, DAG and Cpu0 instructions

C

if, else, for, while, goto, switch, break

IR

(icmp + (eq, ne, sgt, sge, slt, sle)0 + br

DAG

(seteq, setne, setgt, setge, setlt, setle) + brcond,

(setueq, setune, setugt, setuge, setult, setule) + brcond

cpu032I

CMP + (JEQ, JNE, JGT, JGE, JLT, JLE)

cpu032II

(SLT, SLTu, SLTi, SLTiu) + (BEG, BNE)

Long branch support

As last section, cpu032II uses beq and bne to improve performance but the jump offset reduces from 24 bits to 16 bits. If program exists more than 16 bits, cpu032II will fail to generate code. Mips backend has solution and Cpu0 hire the solution from it.

To support long branch the following code added in Chapter8_1.

lbdex/chapters/Chapter8_2/CMakeLists.txt

  Cpu0LongBranch.cpp

lbdex/chapters/Chapter8_2/Cpu0.h

  FunctionPass *createCpu0LongBranchPass(Cpu0TargetMachine &TM);

lbdex/chapters/Chapter8_2/MCTargetDesc/Cpu0MCCodeEmitter.cpp

unsigned Cpu0MCCodeEmitter::
getJumpTargetOpValue(const MCInst &MI, unsigned OpNo,
                     SmallVectorImpl<MCFixup> &Fixups,
                     const MCSubtargetInfo &STI) const {
  if (Opcode == Cpu0::JMP || Opcode == Cpu0::BAL)
  ...
}

lbdex/chapters/Chapter8_2/Cpu0AsmPrinter.h

  bool isLongBranchPseudo(int Opcode) const;

lbdex/chapters/Chapter8_2/Cpu0AsmPrinter.cpp

//- emitInstruction() must exists or will have run time error.
void Cpu0AsmPrinter::emitInstruction(const MachineInstr *MI) {
    if (I->isPseudo() && !isLongBranchPseudo(I->getOpcode()))
  ...
}
bool Cpu0AsmPrinter::isLongBranchPseudo(int Opcode) const {
  return (Opcode == Cpu0::LONG_BRANCH_LUi
          || Opcode == Cpu0::LONG_BRANCH_ADDiu);
}

lbdex/chapters/Chapter8_2/Cpu0InstrInfo.h

  virtual unsigned getOppositeBranchOpc(unsigned Opc) const = 0;

lbdex/chapters/Chapter8_2/Cpu0InstrInfo.td

let Predicates = [Ch8_2] in {
// We need these two pseudo instructions to avoid offset calculation for long
// branches.  See the comment in file Cpu0LongBranch.cpp for detailed
// explanation.

// Expands to: lui $dst, %hi($tgt - $baltgt)
def LONG_BRANCH_LUi : Cpu0Pseudo<(outs GPROut:$dst),
  (ins jmptarget:$tgt, jmptarget:$baltgt), "", []>;

// Expands to: addiu $dst, $src, %lo($tgt - $baltgt)
def LONG_BRANCH_ADDiu : Cpu0Pseudo<(outs GPROut:$dst),
  (ins GPROut:$src, jmptarget:$tgt, jmptarget:$baltgt), "", []>;
}
let isBranch = 1, isTerminator = 1, isBarrier = 1,
    hasDelaySlot = 0, Defs = [LR] in
def BAL: FJ<0x3A, (outs), (ins jmptarget:$addr), "bal\t$addr", [], IIBranch>, 
             Requires<[HasSlt]>;

lbdex/chapters/Chapter8_2/Cpu0LongBranch.cpp

//===-- Cpu0LongBranch.cpp - Emit long branches ---------------------------===//
//
//                     The LLVM Compiler Infrastructure
//
// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.
//
//===----------------------------------------------------------------------===//
//
// This pass expands a branch or jump instruction into a long branch if its
// offset is too large to fit into its immediate field.
//
// FIXME: Fix pc-region jump instructions which cross 256MB segment boundaries.
//===----------------------------------------------------------------------===//

// In 9.0.0 rename MipsLongBranch.cpp to MipsBranchExpansion.cpp

#include "Cpu0.h"

#if CH >= CH8_2

#include "MCTargetDesc/Cpu0BaseInfo.h"
#include "Cpu0MachineFunction.h"
#include "Cpu0TargetMachine.h"
#include "llvm/ADT/Statistic.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/CodeGen/TargetRegisterInfo.h"
#include "llvm/IR/Function.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/MathExtras.h"
#include "llvm/Target/TargetMachine.h"

using namespace llvm;

#define DEBUG_TYPE "cpu0-long-branch"

STATISTIC(LongBranches, "Number of long branches.");

static cl::opt<bool> ForceLongBranch(
  "force-cpu0-long-branch",
  cl::init(false),
  cl::desc("CPU0: Expand all branches to long format."),
  cl::Hidden);

namespace {
  typedef MachineBasicBlock::iterator Iter;
  typedef MachineBasicBlock::reverse_iterator ReverseIter;

  struct MBBInfo {
    uint64_t Size, Address;
    bool HasLongBranch;
    MachineInstr *Br;

    MBBInfo() : Size(0), HasLongBranch(false), Br(nullptr) {}
  };

  class Cpu0LongBranch : public MachineFunctionPass {

  public:
    static char ID;
    Cpu0LongBranch(TargetMachine &tm)
        : MachineFunctionPass(ID), TM(tm), IsPIC(TM.isPositionIndependent()),
          ABI(static_cast<const Cpu0TargetMachine &>(TM).getABI()) {}

    StringRef getPassName() const override {
      return "Cpu0 Long Branch";
    }

    bool runOnMachineFunction(MachineFunction &F) override;

  private:
    void splitMBB(MachineBasicBlock *MBB);
    void initMBBInfo();
    int64_t computeOffset(const MachineInstr *Br);
    void replaceBranch(MachineBasicBlock &MBB, Iter Br, const DebugLoc &DL,
                       MachineBasicBlock *MBBOpnd);
    void expandToLongBranch(MBBInfo &Info);

    const TargetMachine &TM;
    MachineFunction *MF;
    SmallVector<MBBInfo, 16> MBBInfos;
    bool IsPIC;
    Cpu0ABIInfo ABI;
    unsigned LongBranchSeqSize;
  };

  char Cpu0LongBranch::ID = 0;
} // end of anonymous namespace

/// createCpu0LongBranchPass - Returns a pass that converts branches to long
/// branches.
FunctionPass *llvm::createCpu0LongBranchPass(Cpu0TargetMachine &tm) {
  return new Cpu0LongBranch(tm);
}

/// Iterate over list of Br's operands and search for a MachineBasicBlock
/// operand.
static MachineBasicBlock *getTargetMBB(const MachineInstr &Br) {
  for (unsigned I = 0, E = Br.getDesc().getNumOperands(); I < E; ++I) {
    const MachineOperand &MO = Br.getOperand(I);

    if (MO.isMBB())
      return MO.getMBB();
  }

  llvm_unreachable("This instruction does not have an MBB operand.");
}

// Traverse the list of instructions backwards until a non-debug instruction is
// found or it reaches E.
static ReverseIter getNonDebugInstr(ReverseIter B, const ReverseIter &E) {
  for (; B != E; ++B)
    if (!B->isDebugValue())
      return B;

  return E;
}

// Split MBB if it has two direct jumps/branches.
void Cpu0LongBranch::splitMBB(MachineBasicBlock *MBB) {
  ReverseIter End = MBB->rend();
  ReverseIter LastBr = getNonDebugInstr(MBB->rbegin(), End);

  // Return if MBB has no branch instructions.
  if ((LastBr == End) ||
      (!LastBr->isConditionalBranch() && !LastBr->isUnconditionalBranch()))
    return;

  ReverseIter FirstBr = getNonDebugInstr(std::next(LastBr), End);

  // MBB has only one branch instruction if FirstBr is not a branch
  // instruction.
  if ((FirstBr == End) ||
      (!FirstBr->isConditionalBranch() && !FirstBr->isUnconditionalBranch()))
    return;

  assert(!FirstBr->isIndirectBranch() && "Unexpected indirect branch found.");

  // Create a new MBB. Move instructions in MBB to the newly created MBB.
  MachineBasicBlock *NewMBB =
    MF->CreateMachineBasicBlock(MBB->getBasicBlock());

  // Insert NewMBB and fix control flow.
  MachineBasicBlock *Tgt = getTargetMBB(*FirstBr);
  NewMBB->transferSuccessors(MBB);
  if (Tgt != getTargetMBB(*LastBr))
    NewMBB->removeSuccessor(Tgt, true);
  MBB->addSuccessor(NewMBB);
  MBB->addSuccessor(Tgt);
  MF->insert(std::next(MachineFunction::iterator(MBB)), NewMBB);

  NewMBB->splice(NewMBB->end(), MBB, LastBr.getReverse(), MBB->end());
}

// Fill MBBInfos.
void Cpu0LongBranch::initMBBInfo() {
  // Split the MBBs if they have two branches. Each basic block should have at
  // most one branch after this loop is executed.
  for (auto &MBB : *MF)
    splitMBB(&MBB);

  MF->RenumberBlocks();
  MBBInfos.clear();
  MBBInfos.resize(MF->size());

  const Cpu0InstrInfo *TII =
      static_cast<const Cpu0InstrInfo *>(MF->getSubtarget().getInstrInfo());
  for (unsigned I = 0, E = MBBInfos.size(); I < E; ++I) {
    MachineBasicBlock *MBB = MF->getBlockNumbered(I);

    // Compute size of MBB.
    for (MachineBasicBlock::instr_iterator MI = MBB->instr_begin();
         MI != MBB->instr_end(); ++MI)
      MBBInfos[I].Size += TII->GetInstSizeInBytes(*MI);

    // Search for MBB's branch instruction.
    ReverseIter End = MBB->rend();
    ReverseIter Br = getNonDebugInstr(MBB->rbegin(), End);

    if ((Br != End) && !Br->isIndirectBranch() &&
        (Br->isConditionalBranch() || (Br->isUnconditionalBranch() && IsPIC)))
      MBBInfos[I].Br = &(*Br.getReverse());
  }
}

// Compute offset of branch in number of bytes.
int64_t Cpu0LongBranch::computeOffset(const MachineInstr *Br) {
  int64_t Offset = 0;
  int ThisMBB = Br->getParent()->getNumber();
  int TargetMBB = getTargetMBB(*Br)->getNumber();

  // Compute offset of a forward branch.
  if (ThisMBB < TargetMBB) {
    for (int N = ThisMBB + 1; N < TargetMBB; ++N)
      Offset += MBBInfos[N].Size;

    return Offset + 4;
  }

  // Compute offset of a backward branch.
  for (int N = ThisMBB; N >= TargetMBB; --N)
    Offset += MBBInfos[N].Size;

  return -Offset + 4;
}

// Replace Br with a branch which has the opposite condition code and a
// MachineBasicBlock operand MBBOpnd.
void Cpu0LongBranch::replaceBranch(MachineBasicBlock &MBB, Iter Br,
                                   const DebugLoc &DL,
                                   MachineBasicBlock *MBBOpnd) {
  const Cpu0InstrInfo *TII = static_cast<const Cpu0InstrInfo *>(
      MBB.getParent()->getSubtarget().getInstrInfo());
  unsigned NewOpc = TII->getOppositeBranchOpc(Br->getOpcode());
  const MCInstrDesc &NewDesc = TII->get(NewOpc);

  MachineInstrBuilder MIB = BuildMI(MBB, Br, DL, NewDesc);

  for (unsigned I = 0, E = Br->getDesc().getNumOperands(); I < E; ++I) {
    MachineOperand &MO = Br->getOperand(I);

    if (!MO.isReg()) {
      assert(MO.isMBB() && "MBB operand expected.");
      break;
    }

    MIB.addReg(MO.getReg());
  }

  MIB.addMBB(MBBOpnd);

  if (Br->hasDelaySlot()) {
    // Bundle the instruction in the delay slot to the newly created branch
    // and erase the original branch.
    assert(Br->isBundledWithSucc());
    MachineBasicBlock::instr_iterator II = Br.getInstrIterator();
    MIBundleBuilder(&*MIB).append((++II)->removeFromBundle());
  }
  Br->eraseFromParent();
}

// Expand branch instructions to long branches.
// TODO: This function has to be fixed for beqz16 and bnez16, because it
// currently assumes that all branches have 16-bit offsets, and will produce
// wrong code if branches whose allowed offsets are [-128, -126, ..., 126]
// are present.
void Cpu0LongBranch::expandToLongBranch(MBBInfo &I) {
  MachineBasicBlock::iterator Pos;
  MachineBasicBlock *MBB = I.Br->getParent(), *TgtMBB = getTargetMBB(*I.Br);
  DebugLoc DL = I.Br->getDebugLoc();
  const BasicBlock *BB = MBB->getBasicBlock();
  MachineFunction::iterator FallThroughMBB = ++MachineFunction::iterator(MBB);
  MachineBasicBlock *LongBrMBB = MF->CreateMachineBasicBlock(BB);
  const Cpu0Subtarget &Subtarget =
      static_cast<const Cpu0Subtarget &>(MF->getSubtarget());
  const Cpu0InstrInfo *TII =
      static_cast<const Cpu0InstrInfo *>(Subtarget.getInstrInfo());

  MF->insert(FallThroughMBB, LongBrMBB);
  MBB->replaceSuccessor(TgtMBB, LongBrMBB);

  if (IsPIC) {
    MachineBasicBlock *BalTgtMBB = MF->CreateMachineBasicBlock(BB);
    MF->insert(FallThroughMBB, BalTgtMBB);
    LongBrMBB->addSuccessor(BalTgtMBB);
    BalTgtMBB->addSuccessor(TgtMBB);

    unsigned BalOp = Cpu0::BAL;

    // $longbr:
    //  addiu $sp, $sp, -8
    //  st $lr, 0($sp)
    //  lui $at, %hi($tgt - $baltgt)
    //  addiu $lr, $lr, %lo($tgt - $baltgt)
    //  bal $baltgt
    //  nop
    // $baltgt:
    //  addu $at, $lr, $at
    //  addiu $sp, $sp, 8
    //  ld $lr, 0($sp)
    //  jr $at
    //  nop
    // $fallthrough:
    //

    Pos = LongBrMBB->begin();

    BuildMI(*LongBrMBB, Pos, DL, TII->get(Cpu0::ADDiu), Cpu0::SP)
      .addReg(Cpu0::SP).addImm(-8);
    BuildMI(*LongBrMBB, Pos, DL, TII->get(Cpu0::ST)).addReg(Cpu0::LR)
      .addReg(Cpu0::SP).addImm(0);

    // LUi and ADDiu instructions create 32-bit offset of the target basic
    // block from the target of BAL instruction.  We cannot use immediate
    // value for this offset because it cannot be determined accurately when
    // the program has inline assembly statements.  We therefore use the
    // relocation expressions %hi($tgt-$baltgt) and %lo($tgt-$baltgt) which
    // are resolved during the fixup, so the values will always be correct.
    //
    // Since we cannot create %hi($tgt-$baltgt) and %lo($tgt-$baltgt)
    // expressions at this point (it is possible only at the MC layer),
    // we replace LUi and ADDiu with pseudo instructions
    // LONG_BRANCH_LUi and LONG_BRANCH_ADDiu, and add both basic
    // blocks as operands to these instructions.  When lowering these pseudo
    // instructions to LUi and ADDiu in the MC layer, we will create
    // %hi($tgt-$baltgt) and %lo($tgt-$baltgt) expressions and add them as
    // operands to lowered instructions.

    BuildMI(*LongBrMBB, Pos, DL, TII->get(Cpu0::LONG_BRANCH_LUi), Cpu0::AT)
      .addMBB(TgtMBB).addMBB(BalTgtMBB);
    BuildMI(*LongBrMBB, Pos, DL, TII->get(Cpu0::LONG_BRANCH_ADDiu), Cpu0::AT)
      .addReg(Cpu0::AT).addMBB(TgtMBB).addMBB(BalTgtMBB);
    MIBundleBuilder(*LongBrMBB, Pos)
        .append(BuildMI(*MF, DL, TII->get(BalOp)).addMBB(BalTgtMBB));

    Pos = BalTgtMBB->begin();

    BuildMI(*BalTgtMBB, Pos, DL, TII->get(Cpu0::ADDu), Cpu0::AT)
      .addReg(Cpu0::LR).addReg(Cpu0::AT);
    BuildMI(*BalTgtMBB, Pos, DL, TII->get(Cpu0::LD), Cpu0::LR)
      .addReg(Cpu0::SP).addImm(0);
    BuildMI(*BalTgtMBB, Pos, DL, TII->get(Cpu0::ADDiu), Cpu0::SP)
      .addReg(Cpu0::SP).addImm(8);

    MIBundleBuilder(*BalTgtMBB, Pos)
      .append(BuildMI(*MF, DL, TII->get(Cpu0::JR)).addReg(Cpu0::AT))
      .append(BuildMI(*MF, DL, TII->get(Cpu0::NOP)));

    assert(LongBrMBB->size() + BalTgtMBB->size() == LongBranchSeqSize);
  } else {
    // $longbr:
    //  jmp $tgt
    //  nop
    // $fallthrough:
    //
    Pos = LongBrMBB->begin();
    LongBrMBB->addSuccessor(TgtMBB);
    MIBundleBuilder(*LongBrMBB, Pos)
      .append(BuildMI(*MF, DL, TII->get(Cpu0::JMP)).addMBB(TgtMBB))
      .append(BuildMI(*MF, DL, TII->get(Cpu0::NOP)));

    assert(LongBrMBB->size() == LongBranchSeqSize);
  }

  if (I.Br->isUnconditionalBranch()) {
    // Change branch destination.
    assert(I.Br->getDesc().getNumOperands() == 1);
    I.Br->RemoveOperand(0);
    I.Br->addOperand(MachineOperand::CreateMBB(LongBrMBB));
  } else
    // Change branch destination and reverse condition.
    replaceBranch(*MBB, I.Br, DL, &*FallThroughMBB);
}

static void emitGPDisp(MachineFunction &F, const Cpu0InstrInfo *TII) {
  MachineBasicBlock &MBB = F.front();
  MachineBasicBlock::iterator I = MBB.begin();
  DebugLoc DL = MBB.findDebugLoc(MBB.begin());
  BuildMI(MBB, I, DL, TII->get(Cpu0::LUi), Cpu0::V0)
    .addExternalSymbol("_gp_disp", Cpu0II::MO_ABS_HI);
  BuildMI(MBB, I, DL, TII->get(Cpu0::ADDiu), Cpu0::V0)
    .addReg(Cpu0::V0).addExternalSymbol("_gp_disp", Cpu0II::MO_ABS_LO);
  MBB.removeLiveIn(Cpu0::V0);
}

bool Cpu0LongBranch::runOnMachineFunction(MachineFunction &F) {
  const Cpu0Subtarget &STI =
      static_cast<const Cpu0Subtarget &>(F.getSubtarget());
  const Cpu0InstrInfo *TII =
      static_cast<const Cpu0InstrInfo *>(STI.getInstrInfo());
  LongBranchSeqSize =
      !IsPIC ? 2 : 10;

  if (!STI.enableLongBranchPass())
    return false;
  if (IsPIC && static_cast<const Cpu0TargetMachine &>(TM).getABI().IsO32() &&
      F.getInfo<Cpu0FunctionInfo>()->globalBaseRegSet())
    emitGPDisp(F, TII);

  MF = &F;
  initMBBInfo();

  SmallVectorImpl<MBBInfo>::iterator I, E = MBBInfos.end();
  bool EverMadeChange = false, MadeChange = true;

  while (MadeChange) {
    MadeChange = false;

    for (I = MBBInfos.begin(); I != E; ++I) {
      // Skip if this MBB doesn't have a branch or the branch has already been
      // converted to a long branch.
      if (!I->Br || I->HasLongBranch)
        continue;

      int ShVal = 4;
      int64_t Offset = computeOffset(I->Br) / ShVal;

      // Check if offset fits into 16-bit immediate field of branches.
      if (!ForceLongBranch && isInt<16>(Offset))
        continue;

      I->HasLongBranch = true;
      I->Size += LongBranchSeqSize * 4;
      ++LongBranches;
      EverMadeChange = MadeChange = true;
    }
  }

  if (!EverMadeChange)
    return true;

  // Compute basic block addresses.
  if (IsPIC) {
    uint64_t Address = 0;

    for (I = MBBInfos.begin(); I != E; Address += I->Size, ++I)
      I->Address = Address;
  }

  // Do the expansion.
  for (I = MBBInfos.begin(); I != E; ++I)
    if (I->HasLongBranch)
      expandToLongBranch(*I);

  MF->RenumberBlocks();

  return true;
}

#endif //#if CH >= CH8_2

lbdex/chapters/Chapter8_2/Cpu0MCInstLower.h

  MCOperand createSub(MachineBasicBlock *BB1, MachineBasicBlock *BB2,
                      Cpu0MCExpr::Cpu0ExprKind Kind) const;
  void lowerLongBranchLUi(const MachineInstr *MI, MCInst &OutMI) const;
  void lowerLongBranchADDiu(const MachineInstr *MI, MCInst &OutMI,
                            int Opcode,
                            Cpu0MCExpr::Cpu0ExprKind Kind) const;
  bool lowerLongBranch(const MachineInstr *MI, MCInst &OutMI) const;

lbdex/chapters/Chapter8_2/Cpu0MCInstLower.cpp

MCOperand Cpu0MCInstLower::createSub(MachineBasicBlock *BB1,
                                     MachineBasicBlock *BB2,
                                     Cpu0MCExpr::Cpu0ExprKind Kind) const {
  const MCSymbolRefExpr *Sym1 = MCSymbolRefExpr::create(BB1->getSymbol(), *Ctx);
  const MCSymbolRefExpr *Sym2 = MCSymbolRefExpr::create(BB2->getSymbol(), *Ctx);
  const MCBinaryExpr *Sub = MCBinaryExpr::createSub(Sym1, Sym2, *Ctx);

  return MCOperand::createExpr(Cpu0MCExpr::create(Kind, Sub, *Ctx));
}

void Cpu0MCInstLower::
lowerLongBranchLUi(const MachineInstr *MI, MCInst &OutMI) const {
  OutMI.setOpcode(Cpu0::LUi);

  // Lower register operand.
  OutMI.addOperand(LowerOperand(MI->getOperand(0)));

  // Create %hi($tgt-$baltgt).
  OutMI.addOperand(createSub(MI->getOperand(1).getMBB(),
                             MI->getOperand(2).getMBB(),
                             Cpu0MCExpr::CEK_ABS_HI));
}

void Cpu0MCInstLower::
lowerLongBranchADDiu(const MachineInstr *MI, MCInst &OutMI, int Opcode,
                     Cpu0MCExpr::Cpu0ExprKind Kind) const {
  OutMI.setOpcode(Opcode);

  // Lower two register operands.
  for (unsigned I = 0, E = 2; I != E; ++I) {
    const MachineOperand &MO = MI->getOperand(I);
    OutMI.addOperand(LowerOperand(MO));
  }

  // Create %lo($tgt-$baltgt) or %hi($tgt-$baltgt).
  OutMI.addOperand(createSub(MI->getOperand(2).getMBB(),
                             MI->getOperand(3).getMBB(), Kind));
}

bool Cpu0MCInstLower::lowerLongBranch(const MachineInstr *MI,
                                      MCInst &OutMI) const {
  switch (MI->getOpcode()) {
  default:
    return false;
  case Cpu0::LONG_BRANCH_LUi:
    lowerLongBranchLUi(MI, OutMI);
    return true;
  case Cpu0::LONG_BRANCH_ADDiu:
    lowerLongBranchADDiu(MI, OutMI, Cpu0::ADDiu,
                         Cpu0MCExpr::CEK_ABS_LO);
    return true;
  }
}
void Cpu0MCInstLower::Lower(const MachineInstr *MI, MCInst &OutMI) const {
  if (lowerLongBranch(MI, OutMI))
    return;
  ...
}

lbdex/chapters/Chapter8_2/Cpu0SEInstrInfo.h

  unsigned getOppositeBranchOpc(unsigned Opc) const override;

lbdex/chapters/Chapter8_2/Cpu0SEInstrInfo.cpp

/// getOppositeBranchOpc - Return the inverse of the specified
/// opcode, e.g. turning BEQ to BNE.
unsigned Cpu0SEInstrInfo::getOppositeBranchOpc(unsigned Opc) const {
  switch (Opc) {
  default:           llvm_unreachable("Illegal opcode!");
  case Cpu0::BEQ:    return Cpu0::BNE;
  case Cpu0::BNE:    return Cpu0::BEQ;
  }
}

lbdex/chapters/Chapter8_2/Cpu0TargetMachine.cpp

  void addPreEmitPass() override;
// Implemented by targets that want to run passes immediately before
// machine code is emitted. return true if -print-machineinstrs should
// print out the code after the passes.
void Cpu0PassConfig::addPreEmitPass() {
  Cpu0TargetMachine &TM = getCpu0TargetMachine();
  addPass(createCpu0LongBranchPass(TM));
  return;
}

The code of Chapter8_2 will compile the following example as follows,

lbdex/input/ch8_2_longbranch.cpp

int test_longbranch()
{
  volatile int a = 2;
  volatile int b = 1;
  int result = 0;

  if (a < b)
    result = 1;
  return result;
}

118-165-78-10:input Jonathan$ ~/llvm/test/build/bin/llc
-march=cpu0 -mcpu=cpu032II -relocation-model=pic -filetype=asm
-force-cpu0-long-branch ch8_2_longbranch.bc -o -
  ...
        .text
        .section .mdebug.abiO32
        .previous
        .file "ch8_2_longbranch.bc"
        .globl        _Z15test_longbranchv
        .align        2
        .type _Z15test_longbranchv,@function
        .ent  _Z15test_longbranchv    # @_Z15test_longbranchv
_Z15test_longbranchv:
        .frame        $fp,16,$lr
        .mask         0x00001000,-4
        .set  noreorder
        .set  nomacro
# BB#0:
        addiu $sp, $sp, -16
        st    $fp, 12($sp)            # 4-byte Folded Spill
        move   $fp, $sp
        addiu $2, $zero, 1
        st    $2, 8($fp)
        addiu $3, $zero, 2
        st    $3, 4($fp)
        addiu $3, $zero, 0
        st    $3, 0($fp)
        ld    $3, 8($fp)
        ld    $4, 4($fp)
        slt   $3, $3, $4
        bne   $3, $zero, .LBB0_3
        nop
# BB#1:
        addiu $sp, $sp, -8
        st    $lr, 0($sp)
        lui   $1, %hi(.LBB0_4-.LBB0_2)
        addiu $1, $1, %lo(.LBB0_4-.LBB0_2)
        bal   .LBB0_2
.LBB0_2:
        addu  $1, $lr, $1
        ld    $lr, 0($sp)
        addiu $sp, $sp, 8
        jr    $1
        nop
.LBB0_3:
        st    $2, 0($fp)
.LBB0_4:
        ld    $2, 0($fp)
        move   $sp, $fp
        ld    $fp, 12($sp)            # 4-byte Folded Reload
        addiu $sp, $sp, 16
        ret   $lr
        nop
        .set  macro
        .set  reorder
        .end  _Z15test_longbranchv
$func_end0:
        .size _Z15test_longbranchv, ($func_end0)-_Z15test_longbranchv

Cpu0 backend Optimization: Remove useless JMP

LLVM uses functional pass both in code generation and optimization. Following the 3 tiers of compiler architecture, LLVM do most optimization in middle tier of LLVM IR, SSA form. Beyond middle tier optimization, there are opportunities in optimization which depend on backend features. The “fill delay slot” in Mips is an example of backend optimization used in pipeline RISC machine. You can port it from Mips if your backend is a pipeline RISC with delay slot. In this section, we apply the “delete useless jmp” in Cpu0 backend optimization. This algorithm is simple and effective to be a perfect tutorial in optimization. Through this example, you can understand how to add an optimization pass and coding your complicated optimization algorithm on your backend in real project.

Chapter8_2/ supports “delete useless jmp” optimization algorithm which add codes as follows,

lbdex/chapters/Chapter8_2/CMakeLists.txt

  Cpu0DelUselessJMP.cpp

lbdex/chapters/Chapter8_2/Cpu0.h

  FunctionPass *createCpu0DelJmpPass(Cpu0TargetMachine &TM);

lbdex/chapters/Chapter8_2/Cpu0TargetMachine.cpp

// Implemented by targets that want to run passes immediately before
// machine code is emitted. return true if -print-machineinstrs should
// print out the code after the passes.
void Cpu0PassConfig::addPreEmitPass() {
  Cpu0TargetMachine &TM = getCpu0TargetMachine();
  addPass(createCpu0DelJmpPass(TM));
}

lbdex/chapters/Chapter8_2/Cpu0DelUselessJMP.cpp

//===-- Cpu0DelUselessJMP.cpp - Cpu0 DelJmp -------------------------------===//
//
//                     The LLVM Compiler Infrastructure
//
// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.
//
//===----------------------------------------------------------------------===//
//
// Simple pass to fills delay slots with useful instructions.
//
//===----------------------------------------------------------------------===//

#include "Cpu0.h"
#if CH >= CH8_2

#include "Cpu0TargetMachine.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Target/TargetMachine.h"
#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/Statistic.h"

using namespace llvm;

#define DEBUG_TYPE "del-jmp"

STATISTIC(NumDelJmp, "Number of useless jmp deleted");

static cl::opt<bool> EnableDelJmp(
  "enable-cpu0-del-useless-jmp",
  cl::init(true),
  cl::desc("Delete useless jmp instructions: jmp 0."),
  cl::Hidden);

namespace {
  struct DelJmp : public MachineFunctionPass {
    static char ID;
    DelJmp(TargetMachine &tm)
      : MachineFunctionPass(ID) { }

    StringRef getPassName() const override {
      return "Cpu0 Del Useless jmp";
    }

    bool runOnMachineBasicBlock(MachineBasicBlock &MBB, MachineBasicBlock &MBBN);
    bool runOnMachineFunction(MachineFunction &F) override {
      bool Changed = false;
      if (EnableDelJmp) {
        MachineFunction::iterator FJ = F.begin();
        if (FJ != F.end())
          FJ++;
        if (FJ == F.end())
          return Changed;
        for (MachineFunction::iterator FI = F.begin(), FE = F.end();
             FJ != FE; ++FI, ++FJ)
          // In STL style, F.end() is the dummy BasicBlock() like '\0' in 
          //  C string. 
          // FJ is the next BasicBlock of FI; When FI range from F.begin() to 
          //  the PreviousBasicBlock of F.end() call runOnMachineBasicBlock().
          Changed |= runOnMachineBasicBlock(*FI, *FJ);
      }
      return Changed;
    }

  };
  char DelJmp::ID = 0;
} // end of anonymous namespace

bool DelJmp::
runOnMachineBasicBlock(MachineBasicBlock &MBB, MachineBasicBlock &MBBN) {
  bool Changed = false;

  MachineBasicBlock::iterator I = MBB.end();
  if (I != MBB.begin())
    I--;	// set I to the last instruction
  else
    return Changed;
    
  if (I->getOpcode() == Cpu0::JMP && I->getOperand(0).getMBB() == &MBBN) {
    // I is the instruction of "jmp #offset=0", as follows,
    //     jmp	$BB0_3
    // $BB0_3:
    //     ld	$4, 28($sp)
    ++NumDelJmp;
    MBB.erase(I);	// delete the "JMP 0" instruction
    Changed = true;	// Notify LLVM kernel Changed
  }
  return Changed;

}

/// createCpu0DelJmpPass - Returns a pass that DelJmp in Cpu0 MachineFunctions
FunctionPass *llvm::createCpu0DelJmpPass(Cpu0TargetMachine &tm) {
  return new DelJmp(tm);
}

#endif

As above code, except Cpu0DelUselessJMP.cpp, other files are changed for registering class DelJmp as a functional pass. As the comment of above code, MBB is the current block and MBBN is the next block. For each last instruction of every MBB, we check whether or not it is the JMP instruction and its Operand is the next basic block. By getMBB() in MachineOperand, you can get the MBB address. For the member functions of MachineOperand, please check include/llvm/CodeGen/MachineOperand.h Now, let’s run Chapter8_2/ with ch8_2_deluselessjmp.cpp for explanation.

lbdex/input/ch8_2_deluselessjmp.cpp

int test_DelUselessJMP()
{
  int a = 1; int b = -2; int c = 3;
  
  if (a == 0) {
    a++;
  }
  if (b == 0) {
    a = a + 3;
    b++;
  } else if (b < 0) {
    a = a + b;
    b--;
  }
  if (c > 0) {
    a = a + c;
    c++;
  }
  
  return a;
}

118-165-78-10:input Jonathan$ clang -target mips-unknown-linux-gnu
-c ch8_2_deluselessjmp.cpp -emit-llvm -o ch8_2_deluselessjmp.bc
118-165-78-10:input Jonathan$ /Users/Jonathan/llvm/test/build/
bin/llc -march=cpu0 -relocation-model=static -filetype=asm -stats
ch8_2_deluselessjmp.bc -o -
  ...
        cmp   $sw, $4, $3
        jne   $sw, $BB0_2
        nop
# BB#1:
  ...
        cmp   $sw, $3, $2
        jlt   $sw, $BB0_8
        nop
# BB#7:
...
===-------------------------------------------------------------------------===
                          ... Statistics Collected ...
===-------------------------------------------------------------------------===
 ...
 2 del-jmp        - Number of useless jmp deleted
 ...

The terminal displays “Number of useless jmp deleted” by llc -stats option because we set the “STATISTIC(NumDelJmp, “Number of useless jmp deleted”)” in code. It deletes 2 jmp instructions from block “# BB#0” and “$BB0_6”. You can check it by llc -enable-cpu0-del-useless-jmp=false option to see the difference to non-optimization version. If you run with ch8_1_1.cpp, you will find 10 jmp instructions are deleted from 120 lines of assembly code, which meaning 8% improvement in speed and code size [1].

Fill Branch Delay Slot

Cpu0 instruction set is designed to be a classical RISC pipeline machine. Classical RISC machine has many perfect features [3] [4]. I change Cpu0 backend to a 5 stages of classical RISC pipeline machine with one delay slot like some of Mips model (The original Cpu0 from its author, is a 3 stages of RISC machine). With this change, the backend needs filling the NOP instruction in the branch delay slot. In order to make this tutorial simple for learning, Cpu0 backend code not fill the branch delay slot with any useful instruction for optimization. Readers can reference the MipsDelaySlotFiller.cpp to know how to insert useful instructions in backend optimization. Following code added in Chapter8_2 for NOP fill in Branch Delay Slot.

lbdex/chapters/Chapter8_2/CMakeLists.txt

  Cpu0DelaySlotFiller.cpp

lbdex/chapters/Chapter8_2/Cpu0.h

  FunctionPass *createCpu0DelaySlotFillerPass(Cpu0TargetMachine &TM);

lbdex/chapters/Chapter8_2/Cpu0TargetMachine.cpp

// Implemented by targets that want to run passes immediately before
// machine code is emitted. return true if -print-machineinstrs should
// print out the code after the passes.
void Cpu0PassConfig::addPreEmitPass() {
  Cpu0TargetMachine &TM = getCpu0TargetMachine();
  addPass(createCpu0DelaySlotFillerPass(TM));
}

lbdex/chapters/Chapter8_2/Cpu0DelaySlotFiller.cpp

//===-- Cpu0DelaySlotFiller.cpp - Cpu0 Delay Slot Filler ------------------===//
//
//                     The LLVM Compiler Infrastructure
//
// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.
//
//===----------------------------------------------------------------------===//
//
// Simple pass to fill delay slots with useful instructions.
//
//===----------------------------------------------------------------------===//

#include "Cpu0.h"
#if CH >= CH8_2

#include "Cpu0InstrInfo.h"
#include "Cpu0TargetMachine.h"
#include "llvm/ADT/BitVector.h"
#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/ValueTracking.h"
#include "llvm/CodeGen/MachineBranchProbabilityInfo.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/PseudoSourceValue.h"
#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Target/TargetMachine.h"
#include "llvm/CodeGen/TargetRegisterInfo.h"

using namespace llvm;

#define DEBUG_TYPE "delay-slot-filler"

STATISTIC(FilledSlots, "Number of delay slots filled");

namespace {
  typedef MachineBasicBlock::iterator Iter;
  typedef MachineBasicBlock::reverse_iterator ReverseIter;

  class Filler : public MachineFunctionPass {
  public:
    Filler(TargetMachine &tm)
      : MachineFunctionPass(ID) { }

    StringRef getPassName() const override {
      return "Cpu0 Delay Slot Filler";
    }

    bool runOnMachineFunction(MachineFunction &F) override {
      bool Changed = false;
      for (MachineFunction::iterator FI = F.begin(), FE = F.end();
           FI != FE; ++FI)
        Changed |= runOnMachineBasicBlock(*FI);
      return Changed;
    }
  private:
    bool runOnMachineBasicBlock(MachineBasicBlock &MBB);

    static char ID;
  };
  char Filler::ID = 0;
} // end of anonymous namespace

static bool hasUnoccupiedSlot(const MachineInstr *MI) {
  return MI->hasDelaySlot() && !MI->isBundledWithSucc();
}

/// runOnMachineBasicBlock - Fill in delay slots for the given basic block.
/// We assume there is only one delay slot per delayed instruction.
bool Filler::runOnMachineBasicBlock(MachineBasicBlock &MBB) {
  bool Changed = false;
  const Cpu0Subtarget &STI = MBB.getParent()->getSubtarget<Cpu0Subtarget>();
  const Cpu0InstrInfo *TII = STI.getInstrInfo();

  for (Iter I = MBB.begin(); I != MBB.end(); ++I) {
    if (!hasUnoccupiedSlot(&*I))
      continue;

    ++FilledSlots;
    Changed = true;

    // Bundle the NOP to the instruction with the delay slot.
    BuildMI(MBB, std::next(I), I->getDebugLoc(), TII->get(Cpu0::NOP));
    MIBundleBuilder(MBB, I, std::next(I, 2));
  }

  return Changed;
}

/// createCpu0DelaySlotFillerPass - Returns a pass that fills in delay
/// slots in Cpu0 MachineFunctions
FunctionPass *llvm::createCpu0DelaySlotFillerPass(Cpu0TargetMachine &tm) {
  return new Filler(tm);
}

#endif

To make the basic block label remains same, statement MIBundleBuilder() needs to be inserted after the statement BuildMI(…, NOP) of Cpu0DelaySlotFiller.cpp. MIBundleBuilder() make both the branch instruction and NOP bundled into one instruction (first part is branch instruction and second part is NOP).

lbdex/chapters/Chapter3_2/Cpu0AsmPrinter.cpp

//- emitInstruction() must exists or will have run time error.
void Cpu0AsmPrinter::emitInstruction(const MachineInstr *MI) {
  //  Print out both ordinary instruction and boudle instruction
  MachineBasicBlock::const_instr_iterator I = MI->getIterator();
  MachineBasicBlock::const_instr_iterator E = MI->getParent()->instr_end();

  do {
    if (I->isPseudo() && !isLongBranchPseudo(I->getOpcode()))
      llvm_unreachable("Pseudo opcode found in emitInstruction()");

    MCInst TmpInst0;
    MCInstLowering.Lower(&*I, TmpInst0);
    OutStreamer->emitInstruction(TmpInst0, getSubtargetInfo());
  } while ((++I != E) && I->isInsideBundle()); // Delay slot check
}

In order to print the NOP, the Cpu0AsmPrinter.cpp of Chapter3_2 prints all bundle instructions in loop. Without the loop, only the first part of the bundle instruction (branch instruction only) is printed. In llvm 3.1 the basice block label remains same even if you didn’t do the bundle after it. But for some reasons, it changed in llvm at some later version and you need doing “bundle” in order to keep block label unchanged at later llvm phase.

Conditional instruction

lbdex/input/ch8_2_select.cpp


// The following files will generate IR select even compile with clang -O0.
int test_movx_1()
{
  volatile int a = 1;
  int c = 0;

  c = !a ? 1:3;

  return c;
}

int test_movx_2()
{
  volatile int a = 1;
  int c = 0;

  c = a ? 1:3;

  return c;
}

Run Chapter8_1 with ch8_2_select.cpp will get the following result.

114-37-150-209:input Jonathan$ clang -O1 -target mips-unknown-linux-gnu
-c ch8_2_select.cpp -emit-llvm -o ch8_2_select.bc
114-37-150-209:input Jonathan$ ~/llvm/test/build/bin/
llvm-dis ch8_2_select.bc -o -
...
; Function Attrs: nounwind uwtable
define i32 @_Z11test_movx_1v() #0 {
  %a = alloca i32, align 4
  %c = alloca i32, align 4
  store volatile i32 1, i32* %a, align 4
  store i32 0, i32* %c, align 4
  %1 = load volatile i32* %a, align 4
  %2 = icmp ne i32 %1, 0
  %3 = xor i1 %2, true
  %4 = select i1 %3, i32 1, i32 3
  store i32 %4, i32* %c, align 4
  %5 = load i32* %c, align 4
  ret i32 %5
}

; Function Attrs: nounwind uwtable
define i32 @_Z11test_movx_2v() #0 {
  %a = alloca i32, align 4
  %c = alloca i32, align 4
  store volatile i32 1, i32* %a, align 4
  store i32 0, i32* %c, align 4
  %1 = load volatile i32* %a, align 4
  %2 = icmp ne i32 %1, 0
  %3 = select i1 %2, i32 1, i32 3
  store i32 %3, i32* %c, align 4
  %4 = load i32* %c, align 4
  ret i32 %4
}
...

114-37-150-209:input Jonathan$ ~/llvm/test/build/bin/llc
-march=cpu0 -mcpu=cpu032I -relocation-model=static -filetype=asm
ch8_2_select.bc -o -
...
LLVM ERROR: Cannot select: 0x39f47c0: i32 = select_cc ...

As above llvm IR, ch8_2_select.bc, clang generates select IR for small basic control block (if statement only include one assign statement). This select IR is the result of optimization for CPUs with conditional instructions support. And from above error message, obviously IR select is changed to select_cc during DAG optimization stages.

Chapter8_2 supports select with the following code added and changed.

lbdex/chapters/Chapter8_2/Cpu0InstrInfo.td

let Predicates = [Ch8_2] in {
include "Cpu0CondMov.td"
} // let Predicates = [Ch8_2]

lbdex/chapters/Chapter8_2/Cpu0CondMov.td

//===-- Cpu0CondMov.td - Describe Cpu0 Conditional Moves --*- tablegen -*--===//
//
//                     The LLVM Compiler Infrastructure
//
// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.
//
//===----------------------------------------------------------------------===//
//
// This is the Conditional Moves implementation.
//
//===----------------------------------------------------------------------===//

// Conditional moves:
// These instructions are expanded in
// Cpu0ISelLowering::EmitInstrWithCustomInserter if target does not have
// conditional move instructions.
// cond:int, data:int
class CondMovIntInt<RegisterClass CRC, RegisterClass DRC, bits<8> op,
                    string instr_asm> :
  FA<op, (outs DRC:$ra), (ins DRC:$rb, CRC:$rc, DRC:$F),
     !strconcat(instr_asm, "\t$ra, $rb, $rc"), [], IIAlu> {
  let shamt = 0;
  let Constraints = "$F = $ra";
}

// select patterns
multiclass MovzPats0Slt<RegisterClass CRC, RegisterClass DRC,
                     Instruction MOVZInst, Instruction SLTOp,
                     Instruction SLTuOp, Instruction SLTiOp,
                     Instruction SLTiuOp> {
  def : Pat<(select (i32 (setge CRC:$lhs, CRC:$rhs)), DRC:$T, DRC:$F),
            (MOVZInst DRC:$T, (SLTOp CRC:$lhs, CRC:$rhs), DRC:$F)>;
  def : Pat<(select (i32 (setuge CRC:$lhs, CRC:$rhs)), DRC:$T, DRC:$F),
            (MOVZInst DRC:$T, (SLTuOp CRC:$lhs, CRC:$rhs), DRC:$F)>;
  def : Pat<(select (i32 (setge CRC:$lhs, immSExt16:$rhs)), DRC:$T, DRC:$F),
            (MOVZInst DRC:$T, (SLTiOp CRC:$lhs, immSExt16:$rhs), DRC:$F)>;
  def : Pat<(select (i32 (setuge CRC:$lh, immSExt16:$rh)), DRC:$T, DRC:$F),
            (MOVZInst DRC:$T, (SLTiuOp CRC:$lh, immSExt16:$rh), DRC:$F)>;
  def : Pat<(select (i32 (setle CRC:$lhs, CRC:$rhs)), DRC:$T, DRC:$F),
            (MOVZInst DRC:$T, (SLTOp CRC:$rhs, CRC:$lhs), DRC:$F)>;
  def : Pat<(select (i32 (setule CRC:$lhs, CRC:$rhs)), DRC:$T, DRC:$F),
            (MOVZInst DRC:$T, (SLTuOp CRC:$rhs, CRC:$lhs), DRC:$F)>;
}

multiclass MovzPats1<RegisterClass CRC, RegisterClass DRC,
                     Instruction MOVZInst, Instruction XOROp> {
  def : Pat<(select (i32 (seteq CRC:$lhs, CRC:$rhs)), DRC:$T, DRC:$F),
            (MOVZInst DRC:$T, (XOROp CRC:$lhs, CRC:$rhs), DRC:$F)>;
  def : Pat<(select (i32 (seteq CRC:$lhs, 0)), DRC:$T, DRC:$F),
            (MOVZInst DRC:$T, CRC:$lhs, DRC:$F)>;
}

multiclass MovnPats<RegisterClass CRC, RegisterClass DRC, Instruction MOVNInst,
                    Instruction XOROp> {
  def : Pat<(select (i32 (setne CRC:$lhs, CRC:$rhs)), DRC:$T, DRC:$F),
            (MOVNInst DRC:$T, (XOROp CRC:$lhs, CRC:$rhs), DRC:$F)>;
  def : Pat<(select CRC:$cond, DRC:$T, DRC:$F),
            (MOVNInst DRC:$T, CRC:$cond, DRC:$F)>;
  def : Pat<(select (i32 (setne CRC:$lhs, 0)),DRC:$T, DRC:$F),
            (MOVNInst DRC:$T, CRC:$lhs, DRC:$F)>;
}

// Instantiation of instructions.
def MOVZ_I_I     : CondMovIntInt<CPURegs, CPURegs, 0x0a, "movz">;

def MOVN_I_I     : CondMovIntInt<CPURegs, CPURegs, 0x0b, "movn">;

// Instantiation of conditional move patterns.
let Predicates = [HasSlt] in {
defm : MovzPats0Slt<CPURegs, CPURegs, MOVZ_I_I, SLT, SLTu, SLTi, SLTiu>;
}

defm : MovzPats1<CPURegs, CPURegs, MOVZ_I_I, XOR>;

defm : MovnPats<CPURegs, CPURegs, MOVN_I_I, XOR>;

lbdex/chapters/Chapter8_2/Cpu0ISelLowering.h

    SDValue lowerSELECT(SDValue Op, SelectionDAG &DAG) const;

lbdex/chapters/Chapter8_2/Cpu0ISelLowering.cpp

Cpu0TargetLowering::Cpu0TargetLowering(const Cpu0TargetMachine &TM,
                                       const Cpu0Subtarget &STI)
    : TargetLowering(TM), Subtarget(STI), ABI(TM.getABI()) {

  setOperationAction(ISD::SELECT,             MVT::i32,   Custom);
  setOperationAction(ISD::SELECT_CC,         MVT::i32, Expand);
  setOperationAction(ISD::SELECT_CC,         MVT::Other, Expand);
}
SDValue Cpu0TargetLowering::
LowerOperation(SDValue Op, SelectionDAG &DAG) const
{
  switch (Op.getOpcode())
  {
  case ISD::SELECT:             return lowerSELECT(Op, DAG);
  }
  ...
}
SDValue Cpu0TargetLowering::
lowerSELECT(SDValue Op, SelectionDAG &DAG) const
{
  return Op;
}

Set ISD::SELECT_CC to “Expand” will stop llvm optimization from merging “setcc” and “select” into one IR “select_cc” [2]. Next the LowerOperation() return Op code directly for ISD::SELECT. Finally the pattern defined in Cpu0CondMov.td will translate the select IR into conditional instruction, movz or movn. Let’s run Chapter8_2 with ch8_2_select.cpp to get the following result. Again, the cpu032II uses slt instead of cmp has a little improved in instructions number.

114-37-150-209:input Jonathan$ ~/llvm/test/build/bin/llc
-march=cpu0 -mcpu=cpu032I -relocation-model=static -filetype=asm ch8_2_select.bc -o -
...
      .type   _Z11test_movx_1v,@function
      ...
      addiu   $2, $zero, 3
      movz    $2, $3, $4
      ...
      .type   _Z11test_movx_2v,@function
      ...
      addiu   $2, $zero, 3
      movn    $2, $3, $4
      ...

The clang uses select IR in small basic block to reduce the branch cost in pipeline machine since the branch will make the pipeline “stall”. But it needs the conditional instruction support [3]. If your backend has no conditional instruction and needs clang compiler with optimization option O1 above level, you can change clang to force it generating traditional branch basic block instead of generating IR select. RISC CPU came from the advantage of pipeline and add more and more instruction when time passed. Compare Mips and ARM, the Mips has only movz and movn two instructions while ARM has many. We create Cpu0 instructions as a simple instructions RISC pipeline machine for compiler toolchain tutorial. However the cmp instruction is hired because many programmer is used to it in past and now (ARM use it). This instruction matches the thinking in assembly programming, but the slt instruction is more efficient in RISC pipleline. If you designed a backend aimed for C/C++ highlevel language, you may consider slt instead of cmp since assembly code are rare used in programming and beside, the assembly programmer can accept slt not difficultly since usually they are professional.

File ch8_2_select2.cpp will generate IR select if compile with clang -O1.

lbdex/input/ch8_2_select2.cpp


// The following files will generate IR select when compile with clang -O1 but 
// clang -O0 won't generate IR select.
volatile int a = 1;
volatile int b = 2;

int test_movx_3()
{
  int c = 0;

  if (a < b)
    return 1;
  else
    return 2;
}

int test_movx_4()
{
  int c = 0;

  if (a)
    c = 1;
  else
    c = 3;

  return c;
}

List the conditional statements of C, IR, DAG and Cpu0 instructions as the following table.

Table 34 Conditional statements of C, IR, DAG and Cpu0 instructions

C

if (a < b) c = 1; else c = 3;

c = a ? 1:3;

IR

icmp + (eq, ne, sgt, sge, slt, sle) + br

DAG

((seteq, setne, setgt, setge, setlt, setle) + setcc) + select

Cpu0

movz, movn

File ch8_2_select_global_pic.cpp mentioned in Chapter Global variables can be tested now as follows,

lbdex/input/ch8_2_select_global_pic.cpp

volatile int a1 = 1;
volatile int b1 = 2;

int gI1 = 100;
int gJ1 = 50;

int test_select_global_pic()
{
  if (a1 < b1)
    return gI1;
  else
    return gJ1;
}
JonathantekiiMac:input Jonathan$ clang -O1 -target mips-unknown-linux-gnu
-c ch8_2_select_global_pic.cpp -emit-llvm -o ch8_2_select_global_pic.bc
JonathantekiiMac:input Jonathan$ ~/llvm/test/build/bin/
llvm-dis ch8_2_select_global_pic.bc -o -
...
@a1 = global i32 1, align 4
@b1 = global i32 2, align 4
@gI1 = global i32 100, align 4
@gJ1 = global i32 50, align 4

; Function Attrs: nounwind
define i32 @_Z18test_select_globalv() #0 {
  %1 = load volatile i32* @a1, align 4, !tbaa !1
  %2 = load volatile i32* @b1, align 4, !tbaa !1
  %3 = icmp slt i32 %1, %2
  %gI1.val = load i32* @gI1, align 4
  %gJ1.val = load i32* @gJ1, align 4
  %.0 = select i1 %3, i32 %gI1.val, i32 %gJ1.val
  ret i32 %.0
}
...
JonathantekiiMac:input Jonathan$ ~/llvm/test/build/bin/
llc -march=cpu0 -mcpu=cpu032I -relocation-model=pic -filetype=asm ch8_2_select_global_pic.bc -o -
  .section .mdebug.abi32
  .previous
  .file "ch8_2_select_global_pic.bc"
  .text
  .globl  _Z18test_select_globalv
  .align  2
  .type _Z18test_select_globalv,@function
  .ent  _Z18test_select_globalv # @_Z18test_select_globalv
_Z18test_select_globalv:
  .frame  $sp,0,$lr
  .mask   0x00000000,0
  .set  noreorder
  .cpload $t9
  .set  nomacro
# BB#0:
  lui $2, %got_hi(a1)
  addu  $2, $2, $gp
  ld  $2, %got_lo(a1)($2)
  ld  $2, 0($2)
  lui $3, %got_hi(b1)
  addu  $3, $3, $gp
  ld  $3, %got_lo(b1)($3)
  ld  $3, 0($3)
  cmp $sw, $2, $3
  andi  $2, $sw, 1
  lui $3, %got_hi(gJ1)
  addu  $3, $3, $gp
  ori $3, $3, %got_lo(gJ1)
  lui $4, %got_hi(gI1)
  addu  $4, $4, $gp
  ori $4, $4, %got_lo(gI1)
  movn  $3, $4, $2
  ld  $2, 0($3)
  ld  $2, 0($2)
  ret $lr
  .set  macro
  .set  reorder
  .end  _Z18test_select_globalv
$tmp0:
  .size _Z18test_select_globalv, ($tmp0)-_Z18test_select_globalv

  .type a1,@object              # @a1
  .data
  .globl  a1
  .align  2
a1:
  .4byte  1                       # 0x1
  .size a1, 4

  .type b1,@object              # @b1
  .globl  b1
  .align  2
b1:
  .4byte  2                       # 0x2
  .size b1, 4

  .type gI1,@object             # @gI1
  .globl  gI1
  .align  2
gI1:
  .4byte  100                     # 0x64
  .size gI1, 4

  .type gJ1,@object             # @gJ1
  .globl  gJ1
  .align  2
gJ1:
  .4byte  50                      # 0x32
  .size gJ1, 4

Phi node

Since phi node is popular used in SSA form [5], llvm applies phi node in IR for optimization work either. Phi node exists for “live variable analysis”. An example for C is here [6]. As mentioned in wiki web site of reference above, through finding dominance frontiers, compiler knows where to insert phi functions. The following input let you know the benefits of phi node as follows,

lbdex/input/ch8_2_phinode.cpp

int test_phinode(int a , int b, int c)
{
  int d = 2;
  
  if (a == 0) {
    a = a+1; // a = 1
  }
  else if (b != 0) {
    a = a-1;
  }
  else if (c == 0) {
    a = a+2;
  }
  d = a + b;
  
  return d;
}

Compile it with debug build clang for O3 as follows,

114-43-212-251:input Jonathan$ ~/llvm/debug/build/bin/clang
-O3 -target mips-unknown-linux-gnu -c ch8_2_phinode.cpp -emit-llvm -o
ch8_2_phinode.bc
114-43-212-251:input Jonathan$ ~/llvm/test/build/bin/llvm-dis
ch8_2_phinode.bc -o -
...
define i32 @_Z12test_phinodeiii(i32 signext %a, i32 signext %b, i32 signext %c) local_unnamed_addr #0 {
entry:
  %cmp = icmp eq i32 %a, 0
  br i1 %cmp, label %if.end7, label %if.else

if.else:                                          ; preds = %entry
  %cmp1 = icmp eq i32 %b, 0
  br i1 %cmp1, label %if.else3, label %if.then2

if.then2:                                         ; preds = %if.else
  %dec = add nsw i32 %a, -1
  br label %if.end7

if.else3:                                         ; preds = %if.else
  %cmp4 = icmp eq i32 %c, 0
  %add = add nsw i32 %a, 2
  %add.a = select i1 %cmp4, i32 %add, i32 %a
  br label %if.end7

if.end7:                                          ; preds = %entry, %if.else3, %if.then2
  %a.addr.0 = phi i32 [ %dec, %if.then2 ], [ %add.a, %if.else3 ], [ 1, %entry ]
  %add8 = add nsw i32 %a.addr.0, %b
  ret i32 %add8
}

Because SSA form, the llvm ir for destination variable a in different basic block (if then, else) must use different name. But how does the source variable a in “d = a + b;” be named? The basic block “a = a-1;” and “a = a+2;” have different names. The basic block “a = a-1;” uses %dec and the basic block “a = a+2;” uses “%add” as destination variable name in SSA llvm ir. In order to solve the source variable name from different basic blocks in SSA form, the phi structure is created as above. The compiler option O0 as the following doesn’t apply phi node. Instead, it uses store to solve the source variable name from different basic block.

114-43-212-251:input Jonathan$ ~/llvm/debug/build/bin/clang
-O0 -target mips-unknown-linux-gnu -c ch8_2_phinode.cpp -emit-llvm -o
ch8_2_phinode.bc
114-43-212-251:input Jonathan$ ~/llvm/test/build/bin/llvm-dis
ch8_2_phinode.bc -o -
...
define i32 @_Z12test_phinodeiii(i32 signext %a, i32 signext %b, i32 signext %c) #0 {
entry:
  %a.addr = alloca i32, align 4
  %b.addr = alloca i32, align 4
  %c.addr = alloca i32, align 4
  %d = alloca i32, align 4
  store i32 %a, i32* %a.addr, align 4
  store i32 %b, i32* %b.addr, align 4
  store i32 %c, i32* %c.addr, align 4
  store i32 2, i32* %d, align 4
  %0 = load i32, i32* %a.addr, align 4
  %cmp = icmp eq i32 %0, 0
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  %1 = load i32, i32* %a.addr, align 4
  %inc = add nsw i32 %1, 1
  store i32 %inc, i32* %a.addr, align 4
  br label %if.end7

if.else:                                          ; preds = %entry
  %2 = load i32, i32* %b.addr, align 4
  %cmp1 = icmp ne i32 %2, 0
  br i1 %cmp1, label %if.then2, label %if.else3

if.then2:                                         ; preds = %if.else
  %3 = load i32, i32* %a.addr, align 4
  %dec = add nsw i32 %3, -1
  store i32 %dec, i32* %a.addr, align 4
  br label %if.end6

if.else3:                                         ; preds = %if.else
  %4 = load i32, i32* %c.addr, align 4
  %cmp4 = icmp eq i32 %4, 0
  br i1 %cmp4, label %if.then5, label %if.end

if.then5:                                         ; preds = %if.else3
  %5 = load i32, i32* %a.addr, align 4
  %add = add nsw i32 %5, 2
  store i32 %add, i32* %a.addr, align 4
  br label %if.end

if.end:                                           ; preds = %if.then5, %if.else3
  br label %if.end6

if.end6:                                          ; preds = %if.end, %if.then2
  br label %if.end7

if.end7:                                          ; preds = %if.end6, %if.then
  %6 = load i32, i32* %a.addr, align 4
  %7 = load i32, i32* %b.addr, align 4
  %add8 = add nsw i32 %6, %7
  store i32 %add8, i32* %d, align 4
  %8 = load i32, i32* %d, align 4
  ret i32 %8
}

Compile with clang -O3 generate phi function. The phi function can assign virtual register value directly from multi basic blocks. Compile with clang -O0 doesn’t generate phi, it assigns virtual register value by loading stack slot where the stack slot is saved in each of multi basic blocks before. In this example the pointer of %a.addr point to the stack slot, and “store i32 %inc, i32* %a.addr, align 4”, “store i32 %dec, i32* %a.addr, align 4”, “store i32 %add, i32* %a.addr, align 4” in label if.then:, if.then2: and if.then5:, respectively. In other words, it needs 3 store instructions. It’s possible that compiler finds that the a == 0 is always true after optimization analysis through phi node. If so, the phi node version will bring better result because clang -O0 version uses load and store with pointer %a.addr which may cut the optimization opportunity. Compiler books discuss the Control Flow Graph (CFG) analysis through dominance frontiers calculation for setting phi node. Then compiler apply the global optimization on CFG with phi node, and remove phi node by replacing with “load store” at the end.

If you are interested in more details than the wiki web site, please refer book here [7] for phi node, or book here [8] for the dominator tree analysis if you have this book.

RISC CPU knowledge

As mentioned in the previous section, Cpu0 is a RISC (Reduced Instruction Set Computer) CPU with 5 stages of pipeline. RISC CPU is full in the world, even the X86 of CISC (Complex Instruction Set Computer) is RISC inside (It translates CISC instruction into micro-instructions which do pipeline as RISC). Knowledge with RISC concept may make you satisfied in compiler design. List these two excellent books we have read for reference. Sure, there are many books in Computer Architecture and some of them contain real RISC CPU knowledge needed, but these two are excellent and popular.

Computer Organization and Design: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)

Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer Architecture and Design)

The book of “Computer Organization and Design: The Hardware/Software Interface” (there are 4 editions at the book is written) is for the introduction, while “Computer Architecture: A Quantitative Approach” is more complicate and deep in CPU architecture (there are 5 editions at the book is written).

Above two books use Mips CPU as an example since Mips is more RISC-like than other market CPUs. ARM serials of CPU dominate the embedded market especially in mobile phone and other portable devices. The following book is good which I am reading now.

ARM System Developer’s Guide: Designing and Optimizing System Software (The Morgan Kaufmann Series in Computer Architecture and Design).