# Instruction Selection and Scheduling # The Problem # Writing a compiler is a lot of work - Would like to reuse components whenever possible - Would like to automate construction of components - Front end construction is largely automated - Middle is largely hand crafted - (Parts of) back end can be automated # Definitions #### Instruction selection - Mapping <u>IR</u> into assembly code - Assumes a fixed storage mapping & code shape - Combining operations, using address modes # Instruction scheduling - Reordering operations to hide latencies - Assumes a fixed program (set of operations) - Changes demand for registers # Register allocation - Deciding which values will reside in registers - Changes the storage mapping, may add false sharing - Concerns about placement of data & memory operations Modern computers (still) have many ways to do anything Consider register-to-register copy in ILOC - Obvious operation is i2i r<sub>i</sub> ⇒ r<sub>i</sub> - Many others exist | addI $r_{i}, 0 \Rightarrow r_{j}$ | subI $r_{i}, 0 \Rightarrow r_{j}$ | lshiftI $r_i, 0 \Rightarrow r_j$ | |-----------------------------------|-----------------------------------|----------------------------------| | multI $r_i, 1 \Rightarrow r_j$ | $divI r_i, 1 \Rightarrow r_j$ | rshiftI $r_i, 0 \Rightarrow r_j$ | | orI $r_{i}, 0 \Rightarrow r_{j}$ | $xorI r_i, 0 \Rightarrow r_j$ | and others | - Human would ignore all of these - Algorithm must look at all of them & find low-cost encoding - → Take context into account (busy functional unit?) # The Goal Want to automate generation of instruction selectors Machine description should also help with scheduling & allocation # The Big Picture # Need pattern matching techniques - Must produce good code - Must run quickly A treewalk code generator runs quickly How good was the code? #### Tree #### Treewalk Code $$\begin{array}{lll} \text{loadI} & 4 & \Rightarrow r_5 \\ \text{load} & AO & r_{\text{arp}}, r_5 \Rightarrow r_6 \\ \text{loadI} & 8 & \Rightarrow r_7 \\ \text{load} & AO & r_{\text{arp}}, r_7 \Rightarrow r_8 \\ \text{mult} & r_6, r_8 \Rightarrow r_9 \end{array}$$ #### Desired Code $$\begin{array}{ll} \text{loadAI} & r_{\text{arp}}\text{,4} \Rightarrow r_5 \\ \text{loadAI} & r_{\text{arp}}\text{,8} \Rightarrow r_6 \\ \text{mult} & r_5\text{,r}_6 \Rightarrow r_7 \end{array}$$ # The Big Picture (some metric for good) Need pattern matching techniques - Must produce good code - Must run quickly A treewalk code generator runs quickly How good was the code? Treewalk Code Desired Code Tree loadI X $\begin{array}{ccc} loadAO & r_{arp}, r_5 \Rightarrow r_6 \\ loadI & 8 & \Rightarrow r_7 \end{array}$ $\begin{array}{ll} \text{loadAI} & r_{\text{arp}}\text{,4} \Rightarrow r_5 \\ \text{loadAI} & r_{\text{arp}}\text{,8} \Rightarrow r_6 \end{array}$ load AO $r_{arp}, r_7 \Rightarrow r_8$ mult $r_5, r_6 \Rightarrow r_7$ **IDENT IDENT** mult $r_6, r_8 \Rightarrow r_9$ <a, ARP, 4> <<u>b</u>,ARP,8> Pretty easy to fix. See 1st digression in Ch. 7 (pg 317) Tree-oriented IR suggests pattern matching on trees - Tree-patterns as input, matcher as output - Each pattern maps to a target-machine instruction sequence - Use bottom-up rewrite systems Linear IR suggests using some sort of string matching - Strings as input, matcher as output - Each string maps to a target-machine instruction sequence - Use text matching or peephole matching In practice, both work well; matchers are quite different - Basic idea - Compiler can discover local improvements locally - → Look at a small set of adjacent operations - → Move a "peephole" over code & search for improvement - Classic example: store followed by load #### Original code storeAI $$r_1 \Rightarrow r_{arp}, 8$$ loadAI $r_{arp}, 8 \Rightarrow r_{15}$ # Improved code storeAI $$r_1 \Rightarrow r_{arp}$$ ,8 i2i $r_1 \Rightarrow r_{15}$ - Basic idea - Compiler can discover local improvements locally - → Look at a small set of adjacent operations - → Move a "peephole" over code & search for improvement - Classic example: store followed by load - Simple algebraic identities ### Original code $$\begin{array}{ll} \text{addI} & r_2\text{,0} \Rightarrow r_7 \\ \text{mult} & r_4\text{,}r_7 \Rightarrow r_{10} \end{array}$$ # Improved code $$\text{mult} \quad r_4, r_2 \Rightarrow r_{10}$$ # Peephole Matching - Basic idea - Compiler can discover local improvements locally - → Look at a small set of adjacent operations - → Move a "peephole" over code & search for improvement - Classic example: store followed by load - Simple algebraic identities - Jump to a jump # Original code $$\begin{array}{ccc} & \text{jumpI} & \rightarrow L_{10} \\ L_{10} \colon \text{jumpI} & \rightarrow L_{11} \end{array}$$ # Improved code $$L_{10}$$ : jumpI $\rightarrow L_{11}$ # Implementing it - Early systems used limited set of hand-coded patterns - Window size ensured quick processing # Modern peephole instruction selectors Break problem into three tasks # Expander - Turns IR code into a low-level IR (LLIR) - Operation-by-operation, template-driven rewriting - LLIR form includes all direct effects (e.g., setting cc) - Significant, albeit constant, expansion of size # Peephole Matching # Simplifier - Looks at LLIR through window and rewrites is - Uses forward substitution, algebraic simplification, local constant propagation, and dead-effect elimination - Performs local optimization within window - This is the heart of the peephole system - → Benefit of peephole optimization shows up in this step #### Matcher - Compares simplified LLIR against a library of patterns - Picks low-cost pattern that captures effects - Must preserve LLIR effects, may add new ones (e.g., set cc) - Generates the assembly code output # Original IR Code | OP | Arg <sub>1</sub> | Arg <sub>2</sub> | Result | |------|------------------|------------------|--------| | mult | 2 | У | †1 | | sub | × | †1 | W | $w = r_{20}$ # LLIR Code $$r_{10} \leftarrow 2$$ $r_{11} \leftarrow \mathbf{@y}$ $r_{12} \leftarrow r_{arp} + r_{11}$ $r_{13} \leftarrow \mathbf{MEM}(r_{12})$ $r_{14} \leftarrow r_{10} \times r_{13}$ $r_{15} \leftarrow \mathbf{@x}$ $r_{16} \leftarrow r_{arp} + r_{15}$ $r_{17} \leftarrow \mathbf{MEM}(r_{16})$ $r_{18} \leftarrow r_{17} - r_{14}$ $r_{19} \leftarrow \mathbf{@w}$ $r_{20} \leftarrow r_{arp} + r_{19}$ $\mathbf{MEM}(r_{20}) \leftarrow r_{18}$ #### LLIR Code $$r_{10} \leftarrow 2$$ $r_{11} \leftarrow \text{@y}$ $r_{12} \leftarrow r_{\text{arp}} + r_{11}$ $r_{13} \leftarrow \text{MEM}(r_{12})$ $r_{14} \leftarrow r_{10} \times r_{13}$ $r_{15} \leftarrow \text{@x}$ $r_{16} \leftarrow r_{\text{arp}} + r_{15}$ $r_{17} \leftarrow \text{MEM}(r_{16})$ $r_{18} \leftarrow r_{17} - r_{14}$ $r_{19} \leftarrow \text{@w}$ $r_{20} \leftarrow r_{\text{arp}} + r_{19}$ $\text{MEM}(r_{20}) \leftarrow r_{18}$ # $\begin{aligned} \text{LLIR Code} \\ r_{13} &\leftarrow \text{MEM}(r_{\text{arp}} + \text{@y}) \\ r_{14} &\leftarrow 2 \times r_{13} \\ r_{17} &\leftarrow \text{MEM}(r_{\text{arp}} + \text{@x}) \\ r_{18} &\leftarrow r_{17} - r_{14} \end{aligned}$ $\text{MEM}(r_{\text{arp}} + \text{@w}) \leftarrow r_{18}$ $$\begin{array}{c} \text{LLIR Code} \\ r_{13} \leftarrow \text{MEM}(r_{\text{arp}} + \text{@y}) \\ r_{14} \leftarrow 2 \times r_{13} \\ r_{17} \leftarrow \text{MEM}(r_{\text{arp}} + \text{@x}) \\ r_{18} \leftarrow r_{17} - r_{14} \end{array} \qquad \begin{array}{c} \text{Match} \\ \text{loadAI} \quad r_{\text{arp}}, \text{@y} \Rightarrow r_{13} \\ \text{multI} \quad 2 \times r_{13} \Rightarrow r_{14} \\ \text{loadAI} \quad r_{\text{arp}}, \text{@x} \Rightarrow r_{17} \\ \text{sub} \quad r_{17} - r_{14} \Rightarrow r_{18} \\ \text{storeAI} \quad r_{18} \quad \Rightarrow r_{\text{arp}}, \text{@w} \end{array}$$ - Introduced all memory operations & temporary names - Turned out pretty good code # Making It All Work #### Details - LLIR is largely machine independent - Target machine described as LLIR → ASM pattern - Actual pattern matching - → Use a hand-coded pattern matcher (gcc) - Several important compilers use this technology - It seems to produce good portable instruction selectors Key strength appears to be late low-level optimization # Definitions #### Instruction selection - Mapping <u>IR</u> into assembly code - Assumes a fixed storage mapping & code shape - Combining operations, using address modes # Instruction scheduling - Reordering operations to hide latencies - Assumes a fixed program (set of operations) - Changes demand for registers # Register allocation - Deciding which values will reside in registers - Changes the storage mapping, may add false sharing - Concerns about placement of data & memory operations # What Makes Code Run Fast? - Many operations have non-zero latencies - Modern machines can issue several operations per cycle - Execution time is order-dependent (and has been since the 60's) # Assumed latencies (conservative) | <b>Operation</b> | Cycles | |------------------|--------| | load | 3 | | store | 3 | | loadl | 1 | | add | 1 | | mult | 2 | | fadd | 1 | | fmult | 2 | | shift | 1 | | branch | 0 to 8 | - Loads & stores may or may not block - Non-blocking ⇒fill those issue slots - Branch costs vary with path taken - Scheduler should hide the latencies $$w \leftarrow w * 2 * x * y * z$$ | Cyc | les | <u>Simple so</u> | <u>chedule</u> | Cyc | les So | <u>chedule l</u> | <u>oads early</u> | |-----|------------|------------------|----------------|-----|------------|------------------|-------------------| | 1 | loadAl | r0,@w | ⇒ r1 | 1 | loadAl | r0,@w | ⇒ r1 | | 4 | add | r1,r1 | ⇒ r1 | 2 | loadAl | r0,@x | ⇒ r2 | | 5 | loadAl | r0,@x | ⇒ r2 | 3 | loadAl | r0,@y | ⇒ r3 | | 8 | mult | r1,r2 | ⇒ r1 | 4 | add | r1,r1 | ⇒r1 | | 9 | loadAl | r0,@y | ⇒ r2 | 5 | mult | r1,r2 | ⇒r1 | | 12 | mult | r1,r2 | ⇒ r1 | 6 | loadAl | r0,@z | ⇒ r2 | | 13 | loadAl | r0,@z | ⇒ r2 | 7 | mult | r1,r3 | ⇒ r1 | | 16 | mult | r1,r2 | ⇒ r1 | 9 | mult | r1,r2 | ⇒ r1 | | 18 | storeAl | r1 | ⇒ r0,@w | 11 | storeAl | r1 | ⇒ r0,@w | | 21 | r1 is free | | , • | 14 | r1 is free | | | | | 2 regi | sters, 20 | | | 3 regis | sters, 13 | 3 | Reordering operations to improve some metric is called instruction scheduling # Instruction Scheduling # (Engineer's View) #### The Problem Given a code fragment for some target machine and the latencies for each individual operation, reorder the operations to minimize execution time # The Concept #### The task - Produce correct code - Minimize wasted cycles - Avoid spilling registers - Operate efficiently # Instruction Scheduling (The Abstract View) To capture properties of the code, build a dependence graph G - Nodes $n \in G$ are operations with type(n) and delay(n) - An edge $e = (n_1, n_2) \in G$ if & only if $n_2$ uses the result of $n_1$ | a: | loadAl | r0,@w | ⇒ r1 | |----|---------|-------|---------| | b: | add | r1,r1 | ⇒ r1 | | C: | loadAl | r0,@x | ⇒ r2 | | d: | mult | r1,r2 | ⇒ r1 | | e: | loadAl | r0,@y | ⇒ r2 | | f: | mult | r1,r2 | ⇒ r1 | | g: | loadAl | r0,@z | ⇒ r2 | | h: | mult | r1,r2 | ⇒ r1 | | i: | storeAl | r1 | ⇒ r0,@w | The Dependence Graph #### Critical Points - All operands must be available - Multiple operations can be <u>ready</u> - Moving operations can lengthen register lifetimes - Placing uses near definitions can shorten register lifetimes - Operands can have multiple predecessors Together, these issues make scheduling <u>hard</u> (NP-Complete) Local scheduling is the simple case - Restricted to straight-line code - Consistent and predictable latencies # The big picture - 1. Build a dependence graph, P - 2. Compute a *priority function* over the nodes in P - 3. Use list scheduling to construct a schedule, one cycle at a time - a. Use a queue of operations that are ready - b. At each cycle - I. Choose a ready operation and schedule it - II. Update the ready queue # Local list scheduling - The dominant algorithm for twenty years - A greedy, heuristic, local technique ``` Cycle ← 1 Ready \leftarrow leaves of P Active \leftarrow \emptyset while (Ready \cup Active \neq \emptyset) if (Ready \neq \emptyset) then remove an op from Ready S(op) \leftarrow Cycle Active \leftarrow Active \cup op Cycle ← Cycle + 1 for each op \in Active if (S(op) + delay(op) \le Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready \leftarrow Ready \cup s ``` Removal in priority order op has completed execution If successor's operands are ready, put it on Ready # 1. Build the dependence graph | a: | loadAl | r0,@w | ⇒ r1 | |----|---------|-------|---------| | b: | add | r1,r1 | ⇒ r1 | | c: | loadAl | r0,@x | ⇒ r2 | | d: | mult | r1,r2 | ⇒ r1 | | e: | loadAl | r0,@y | ⇒ r2 | | f: | mult | r1,r2 | ⇒ r1 | | g: | loadAl | r0,@z | ⇒ r2 | | h: | mult | r1,r2 | ⇒ r1 | | i: | storeAl | r1 | ⇒ r0,@w | | | | | | The Dependence Graph # Scheduling Example - 1. Build the dependence graph - 2. Determine priorities: longest latency-weighted path | a: | loadAl | r0,@w | ⇒ r1 | |----|---------|-------|---------| | b: | add | r1,r1 | ⇒ r1 | | c: | loadAl | r0,@x | ⇒ r2 | | d: | mult | r1,r2 | ⇒ r1 | | e: | loadAl | r0,@y | ⇒ r2 | | f: | mult | r1,r2 | ⇒ r1 | | g: | loadAl | r0,@z | ⇒ r2 | | h: | mult | r1,r2 | ⇒ r1 | | i: | storeAl | r1 | ⇒ r0,@w | The Dependence Graph # Scheduling Example - 1. Build the dependence graph - 2. Determine priorities: longest latency-weighted path - 3. Perform list scheduling - 1) a: loadAl r0,@w r0,@x 2) c: loadAl ⇒ r2 3) e: loadAl r0,@y $r1,r1 \Rightarrow r1$ 4) b: add r1,r2 ⇒ r1 5) d: mult 6) g: loadAl $r0,@z \Rightarrow r2$ 7) f: mult $r1,r3 \Rightarrow r1$ 9) h: mult r1,r2 ⇒ r1 11) i: storeAl r1 ⇒ r0,@w The Dependence Graph # List scheduling breaks down into two distinct classes #### Forward list scheduling - Start with available operations - Work forward in time - Ready ⇒ all operands available # Backward list scheduling - Start with no successors - Work backward in time - Ready ⇒ result >= all uses