Lecure 4/5 --- Administrative --- Language syntax - Tokens: anything separated by "White space" from its neighboring context in a program (such as keywords, identifiers, operator symbols, literals, .etc) o free format (vs fixed format like in Fortran) o comments > un-nested (due to the use of regular expression for scanning) > nested - Lexer or Scanner (determine whether a token is valid) o use regular expressions 1. group input into tokens 2. eliminate white spaces 3. eliminate comments e.g., nested comments (* (* *) *) like this is allowed in ML balanced parentheses can NOT be expressed as regular expression o example: digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 unsigned_integer -> digit digit* - alternation - concatenation - Kleen star o pragmatics > Do not code "keywords" such as "do", "while" and etc. into a DFA. Instead, the scanner looks a token up in a hash table to see if the token is a keywork. > longest possible token: e.g., 3.14 or 3..5 the "dot-dot problem" in Pascal. When one legitimate token is a prefix of another. It requires looking ahead one character to distinguish b/w 3.14 and 3..5 - Parser (determine whether a string of tokens are put together grammatically correct) o use Context-free grammar (why called context-free?) e.g., := | * | - | ( ) Note: o Recursion is allowed (that explains why CFG is more powerful than regular expression); o The notation is extended Backus Naur Form (BNF). > Non-Terminals > Terminals > Meta language symbols, such as "|" o Distinction b/w a meta language and the language it describes This grammar has ambiguity. For example, precedence 1 * 2 - 3 can be parsed as (1 * 2) - 3 or 1 * (2 -3). Left recursion: if there is a non-terminal A and a derivation A =>+ As where s is a string of symbols (both terminals and non-terminals) Precedence: which kind of operators is tigher in binding Associativity: (1-2)-3 does not equal (1-(2-3); even (1+2)+3 does not equal 1+(2+3) when there is rounding for real numbers or overflow. Revised version of the grammar. := | () := | * := | - := Now, 1 * 2 - 3 won't be interpreted as 1*(2-3) because * is treated right associated, namely, 1*2*3 as 1*(2*3). So, := | * is not left recursive anymore. "-" is treated left associated, namely, 1-2-3 as (1-2)-3 := | - This is still left recursive. Modified further: := := episilon | - - Recursive descent parser One fn for each non-terminal. Each fn takes a token list as argument, and returns an abstract syntax tree (AST) and a list of unconsumed tokens. The parser will take the longest match (i.e., consume as many tokens as possible). There is problem with such "greedy" strategy. For example, grammar is a regular expression: exp = (a|ab)bc and text is a string: string = abc if the parser takes "ab", which is the longest match, then the parser will fail at "c". To overcome this problem, backtrackingis needed to complement o An example (recursive descent parser) The grammar := | () := | ~ := := epsilon | - Note: (negation ~ is unary operator) Abstract syntax tree datatype Exp = Num of int | Minus of Exp*Exp | Neg of Exp; Note that "(" and ")" do not appear in AST after parsing is done. 1-~2 Minus(Num 1, (Neg 2)) To build lexer, we need to define another datatype. datatype Token = Num of int | MINUS | NEG | LPAR | RPAR exception Unknow Char fun digit c = Char.ord c - Char.ord #"o" fun scan [] = [] | scan (c::cs) = (case c of #"-" => MINUS :: scan cs | #"~" => NEG :: scan cs | #"~" => NEG :: scan cs ... case for left, right parenthesis omited here. | _ => if Char.isSpace c then scan cs else if Char.isdigit c then scanNum (digit c) cs else raise Unknown Char) (* Note "|" for case and for fun may cause confusions to ML, so use parentheses. We need to define the helper function scanNum. "and" is used to introduce mutually recursive functions. `*) and scanNum n [] = [Num n] (* case for running out of character *) | scanNum n(c :: cs) = if Char.isDigit c then scanNum (10*n + digit c) cs else Num n scan (c :: cs) fun lexer s = scan (explode s) where s is a string. Or val lexer2 = scan o explode lower case "o" is reserved in ML, which means function composition. (f o g) x = f(g x) Recursive descent parser Token list -> Exp * Token list fun parseA (NUM n::ts) = (Num n, ts) | praseA (LPAR :: ts) = ( case parseC ts of (exp, RPAR :: ts') => (exp, ts') | _ => raise Fail ) | parceA _ = raise Fail and parseB (NEG :: ts) = let val (exp, ts') = parseB ts in (Neg exp, ts') end | parseB ts = parseA ts and parseC ts = praseSubs (praseB ts) and parseSubs (exp1, MINUS ::ts) = let val (exp2, ts') = parseB ts in parseSubs (MINUS (exp1, exp2), ts') end | parseSubs (exp, ts) = exp, ts) (* or | parseSubs answer = answer *) fun parse S = (case parseC (lexer s) of (exp, []) => exp (* watch remaining token list, e.g., _ *) | _ => raise Fail )