Home | Reviews | GUIpedia | Forum | Fun500


pharoahTutorial: Writing a parser
A parser is a program that understands text written in a particular format. In order to write an interpreter or compiler, you'll need to code a working parser. Since I'm currently taking a compilers class, I'm learning a lot about the theory and practice behind designing these things, and I'd like to make some of that accessible to everyone here writing scripting engines and the like. I'll try to add to or update this tutorial every day until it's done. In order to prevent this thread from becoming cluttered and difficult to read, please DON'T POST HERE. Instead, use the other comment thread I've started. Please ask any questions you might have so I can improve my explanation.
2012-03-2411:58 PM

pharoah1. Lexical analysis
A parser recognizes patterns in the source file and matches them against patterns that it understands. Everything but the simplest parser will probably have a scanner, or lexical analyzer, feeding it tokens from the text. Tokens are what we consider to be single units of information, you can think of them loosely as words. For example, we might say that the expression "result = (x + y) * z" has tokens "result", "=", "(", "x", "+", "y", ")", "*", and "z". The scanner is typically a function that returns the next token from the source each time it is called, and it will usually ignore things that the language doesn't consider important, such as spaces and tabs. A scanner will typically have a set of token types, where each possible token of that type will be assigned to the same identifier. In the example above, "result", "x", "y", and "z" aren't recognized keywords in the BASIC language, so they'd all be in one token class, possibly one named ID or IDENTIFIER. Any reserved keywords in the language will have their own tokens, and so will important characters like parentheses, the equals sign, the colon, mathematical operators, and the like. The scanner must recognize when it's dealing with a quoted string literal and return the whole thing as one token, or your parser will start trying to interpret the string as code. To illustrate the process of writing a lexical analyzer, we’ll develop a simple “toy language” that’s simple but useful. Our language will be for an integer calculator which supports addition, multiplication, and parentheses, so we’ll call it AMP. A typical AMP expression might look like this: (10 + 20 + 30) * 5 * 9 We want our scanner to recognize integers, parentheses, plus, and asterisk, and ignore whitespace. In addition, we want a special token for end of file. To do all this, we need just six tokens. Here are the names we’ll use: INT PLUS STAR LPAREN RPAREN EOF Since we have a set number of tokens, we can store the token type as an integer, which will have a number of advantages. The first, clearly, is that the information being stored about each token will take up very little space. The second is that we can use these numbers as array indices, which we’ll touch on later. For now, let’s define some constants so we don’t have to remember the numbers: CONST token.int = 1 CONST token.plus = 2 CONST token.star = 3 CONST token.lparen = 4 CONST token.rparen = 5 CONST token.eof = 6 You may have noticed at this point that we lose information about the tokens by assigning a single name to a whole set of them. Specifically, we can’t tell the number we’re getting when we read a token.int. There are a few ways to deal with this, but we’ll use the simple and extensible approach of creating a new type: TYPE tokenType class AS INTEGER text AS STRING * 10 'Long enough for our purposes END TYPE Thus, our TokenType.class will hold the kind of token, while our TokenType.text will hold the actual string that was recognized by the scanner. With this framework out of the way, we’re ready to start writing the scanner itself. What will this scanner look like? It’ll be a couple of subs which take in a binary file number and a NextToken and fill the NextToken with information (the reason we have to do things this way is that QBasic functions cannot return user-defined types). The main sub in the scanner is called NextToken. As its name implies, this procedure will simply read from the given file until it either recognizes a token or hits something invalid. We’ll ignore all whitespace for the purposes of this example: SUB nextToken (filenum%, tok AS tokenType) tok.class = 0 tok.text = "" DIM char AS STRING * 1 DO GET filenum%, , char SELECT CASE char CASE CHR$(0): tok.class = token.eof EXIT SUB CASE "(": tok.class = token.lparen tok.text = char EXIT DO CASE ")": tok.class = token.rparen tok.text = char EXIT DO CASE "+": tok.class = token.plus tok.text = char EXIT DO CASE "*": tok.class = token.star tok.text = char EXIT DO CASE CHR$(13), CHR$(10), CHR$(9), " ": 'Whitespace 'Do nothing CASE ELSE: 'Check for a possible leading minus digit$ = "" IF char = "-" THEN digit$ = "-" GET filenum%, , char END IF 'We use INSTR here to check if our character is a digit DO WHILE INSTR("0123456789", char) > 0 'Read any digits tok.class = token.int digit$ = digit$ + char IF EOF(filenum%) THEN EXIT DO GET filenum%, , char LOOP 'Move the file pointer back one to undo reading the next token IF NOT EOF(filenum%) THEN SEEK filenum%, SEEK(filenum%) - 1 'Check for an error condition (no digits) IF digit$ = "" OR digit$ = "-" THEN PRINT "Unexpected character '"; char; "'" ERROR 200 'We'll use this for lexical errors END IF tok.text = digit$ EXIT DO END SELECT LOOP END SUB If you run this code on a sample input file, you'll find that it does what you'd expect, and returns a new token until it hits the end of file condition. Notice that this is well under 100 lines of code, and the logic here is very simple. For more complicated applications, it's common to build a lexer that recognizes tokens based on regular expressions. The code for such a program is more easily generated by a computer than by hand, and the theory behind doing it is more advanced than what I want to get into here. For most applications, lexical analysis is simple enough that you can code up your own nextToken sub by hand.
2012-03-2411:59 PM

Tutorials


2021 Brandon Cornell