This section describes how scanners and LALR(1) parsers work to aid potential users of a software package I created called Com.Bricologica.LALRLib (LALRLib). This section assumes general software development knowledge. I'll cover scanning, parsing, and specific issues I came across as I sought to solve the specific problems I encountered. I make no guarantee that any solution is proper. I used Crafting a Compiler with C by LeBlanc and Johnson. I was lucky enough to have learned the initial bits from Prof. Johnson himself in a 300-level CS class at the University of Wisconsin in the early 90's while I was pursuing a degree in electrical engineering. First, what are scanners and what are parsers? A source file (such as a file written in C, or Java, or Ruby or any other myriad of languages) is typically a string of text characters. It is very linear—just a string of characters in a file. A scanner looks for groupings of these characters such as words and numbers, or symbol combinations like, say, "::". So a scanner takes a line of individual text characters and turns them into tokens which are groupings of characters and have meaning. The tokens are still a linear stream, but with meaning corresponding to the characters scanned. A parser takes that linear stream of tokens and looks for hierarchical structure and forms trees of tokens and nodes where everything is in the proper place in the hierarchy. So, in a nutshell: Letter/numbers/symbols/etc. -> SCANNER -> tokens -> PARSER -> parse tree Consider the following code snippet which caps x at 100:
if ( x > 100 )
{
    x = 100;
}
This seems to have hierarchical structure because of the carriage returns and spacing. But, the file is just a continuous string of characters. The word 'if', and the number '100' are made of two and three characters respectively. But they have meaning only as grouping. There is also whitespace—spaces and carriage returns. These are characters but are thrown away by the scanner rather than turned into tokens. After running this through a C scanner, the tokens would be:
  • if keyword
  • left parenthesis
  • identifier 'x'
  • right angle bracket
  • integer literal '100'
  • left curtly bracket
  • identifier 'x'
  • assignment operator
  • integer literal '100'
  • semicolon
  • right curly bracket
    • This is a token list. The parser will turn it into a tree. A tree might look something like this tree where each part owns everything indented under it:
      statement
      	if-statement
      		condition
      			expression
      				operation
      					type
      						greater-than
      					left-operand
      						variable (x)
      					right-operand
      						integer-literal (100)
      		true-statement
      			assignment
      				assignee
      					variable (x)
      				expression
      					integer-literal (100)
      The structure is contained within the tree which now contains new nodes not present in the token stream but inferred. The 'if' keyword and '=' have disappeared, absorbed into nodes. So have the parentheses, operator, semicolon, and curly brackets which were consumed while determining the hierarchical tree structure. The 'x' and '100' are still there, however. So the scanner and parser can take a text file with code and create a tree that captures the essence of what the source file contained. This document explains all the steps in that process.