Lecure 4/5

---
Administrative


---

Language syntax

- Tokens: anything separated by "White space" from its neighboring context 
    in a program (such as keywords, identifiers, operator symbols, literals, .etc)
  o free format (vs fixed format like in Fortran)
  o comments
    > un-nested (due to the use of regular expression for scanning) 
    > nested


- Lexer or Scanner (determine whether a token is valid) 
  o use regular expressions

    1. group input into tokens 
    2. eliminate white spaces
    3. eliminate comments

       e.g., nested comments (*    (*   *)    *) like this is allowed in ML
       balanced parentheses can NOT be expressed as regular expression

  o example:

     digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
     unsigned_integer -> digit digit*

    - alternation
    - concatenation
    - Kleen star

  o pragmatics
    > Do not code "keywords" such as "do", "while" and etc. into
    a DFA. Instead, the scanner looks a token up in a hash table to 
    see if the token is a keywork.
    > longest possible token:
      e.g., 3.14 or 3..5 the "dot-dot problem" in Pascal.
      When one legitimate token is a prefix of another.
      It requires looking ahead one character to distinguish b/w 3.14 and 3..5


- Parser (determine whether a string of tokens are put together grammatically correct)
  o use Context-free grammar (why called context-free?)

   e.g.,

   <exp> := <int>
          | <exp> * <exp>
          | <exp> - <exp>
          | ( <exp> )

   Note: 
     o Recursion is allowed (that explains why CFG is more powerful 
          than regular expression); 
     o The notation is extended Backus Naur Form (BNF). 
       > Non-Terminals
       > Terminals
       > Meta language symbols, such as "|"

     o Distinction b/w a meta language and the language it describes

  This grammar has ambiguity. For example, precedence

     1 * 2 - 3 can be parsed as (1 * 2) - 3 or 1 * (2 -3).

  Left recursion: if there is a non-terminal A and a derivation
    A =>+ As where s is a string of symbols (both terminals and non-terminals)

  Precedence: which kind of operators is tigher in binding

  Associativity:  (1-2)-3 does not equal (1-(2-3); 
      even (1+2)+3 does not equal 1+(2+3) when there is rounding for 
      real numbers or overflow.

  Revised version of the grammar.

<aexp> := <int>
        | (<exp>)

<bexp> := <aexp>
        | <bexp> * <bexp>

<cexp> := <bexp>
        | <cexp> - <cexp>

<exp> := <cexp>

  Now, 1 * 2 - 3 won't be interpreted as 1*(2-3) because * is treated right
  associated, namely, 1*2*3 as 1*(2*3).
  So, <bexp> := <aexp> | <aexp> * <bexp> is not left recursive anymore.

  "-" is treated left associated, namely, 1-2-3 as (1-2)-3

<cexp> := <bexp> | <cexp> - <bexp>

  This is still left recursive. 
  Modified further:

  <cexp> := <bexp> <subs> <exp>
  <subs> := episilon | - <bexp> <subs>


- Recursive descent parser

  One fn for each non-terminal.
  Each fn takes a token list as argument, and returns 
  an abstract syntax tree (AST) and a list of unconsumed tokens.

  The parser will take the longest match (i.e., consume as many tokens as possible).

  There is problem with such "greedy" strategy. For example, grammar is 
  a regular expression: exp = (a|ab)bc

  and text is a string: string = abc

  if the parser takes "ab", which is the longest match, then the parser 
  will fail at "c".  To overcome this problem, backtrackingis needed to complement 

 o An example (recursive descent parser)

  The grammar 

    <aexp> := <num> | (<cexp>)
    <bexp> := <aexp> | ~<bexp>
    <cexp> := <bexp><subs>
    <subs> := epsilon | -<bexp><subs>

  Note: (negation ~ is unary operator)

Abstract syntax tree

 datatype Exp = Num of int
                | Minus of Exp*Exp
                | Neg of Exp;

Note that "(" and ")" do not appear in AST after parsing is done.

1-~2
Minus(Num 1, (Neg 2))

To build lexer, we need to define another datatype.

 datatype Token = Num of int
                  | MINUS | NEG | LPAR | RPAR

 exception Unknow Char

 fun digit c = Char.ord c -
               Char.ord #"o"

 fun scan [] = []
   | scan (c::cs) =
      (case c of
           #"-" => MINUS :: scan cs
         | #"~" => NEG :: scan cs

         | #"~" => NEG :: scan cs
          ... case for left, right parenthesis omited here.
         | _ => if Char.isSpace c
                then scan cs
                else if Char.isdigit c then
                scanNum (digit c) cs
                else raise Unknown Char)
   (* Note "|" for case and for fun may cause confusions to ML, 
      so use parentheses.  We need to define the helper function 
      scanNum.  "and" is used to introduce mutually recursive 
      functions.
   `*)

  and scanNum n [] = [Num n]  (* case for running out of character *)
    | scanNum n(c :: cs) =
        if Char.isDigit c then
          scanNum (10*n + digit c) cs
        else Num n scan (c :: cs)

  fun lexer s = scan (explode s)

where s is a string. Or

  val lexer2 = scan o explode

lower case "o" is reserved in ML, which means function composition.

(f o g) x = f(g x)

Recursive descent parser
   Token list -> Exp * Token list

 fun parseA (NUM n::ts) = (Num n, ts)
   | praseA (LPAR :: ts) =
        ( case parseC ts of
               (exp, RPAR :: ts') => (exp, ts')
           | _ => raise Fail
        )
   | parceA _ = raise Fail
 and parseB (NEG :: ts) =
            let
                val (exp, ts') = parseB ts
            in
                (Neg exp, ts')
            end
   | parseB ts = parseA ts

 and parseC ts =
        praseSubs (praseB ts)
 and parseSubs (exp1, MINUS ::ts) =
            let
                val (exp2, ts') = parseB ts
            in
                parseSubs (MINUS (exp1, exp2), ts')
            end
    | parseSubs (exp, ts) = exp, ts)

(* or
    | parseSubs answer = answer
*)

 fun parse S =
      (case parseC (lexer s) of
             (exp, []) => exp
         (* watch remaining token list, e.g., _ *)
         | _ => raise Fail
      )