Introduction

This section describes how scanners and LALR(1) parsers work to aid potential users of a software package I created called Com.Bricologica.LALRLib (LALRLib). This section assumes general software development knowledge. I'll cover scanning, parsing, and specific issues I came across as I sought to solve the specific problems I encountered. I make no guarantee that any solution is proper. I used Crafting a Compiler with C by LeBlanc and Johnson. I was lucky enough to have learned the initial bits from Prof. Johnson himself in a 300-level CS class at the University of Wisconsin in the early 90's while I was pursuing a degree in electrical engineering. First, what are scanners and what are parsers? A source file (such as a file written in C, or Java, or Ruby or any other myriad of languages) is typically a string of text characters. It is very linear—just a string of characters in a file. A scanner looks for groupings of these characters such as words and numbers, or symbol combinations like, say, "::". So a scanner takes a line of individual text characters and turns them into tokens which are groupings of characters and have meaning. The tokens are still a linear stream, but with meaning corresponding to the characters scanned. A parser takes that linear stream of tokens and looks for hierarchical structure and forms trees of tokens and nodes where everything is in the proper place in the hierarchy. So, in a nutshell: Letter/numbers/symbols/etc. -> SCANNER -> tokens -> PARSER -> parse tree Consider the following code snippet which caps x at 100:

if ( x > 100 )
{
    x = 100;
}

This seems to have hierarchical structure because of the carriage returns and spacing. But, the file is just a continuous string of characters. The word 'if', and the number '100' are made of two and three characters respectively. But they have meaning only as grouping. There is also whitespace—spaces and carriage returns. These are characters but are thrown away by the scanner rather than turned into tokens. After running this through a C scanner, the tokens would be:

if keyword
left parenthesis
identifier 'x'
right angle bracket
integer literal '100'
left curtly bracket
identifier 'x'
assignment operator
integer literal '100'
semicolon
right curly bracket

statement
	if-statement
		condition
			expression
				operation
					type
						greater-than
					left-operand
						variable (x)
					right-operand
						integer-literal (100)
		true-statement
			assignment
				assignee
					variable (x)
				expression
					integer-literal (100)