Monday, 21 January 2008

Really simple lexical analysis

(Continues from the previous post)
If I ever develop my own programming language, you can bet I won't be writing the lexer by hand. Programs such as Flex will generate a lexer based on the regular expressions you feed it. However the purpose of this exercise is to learn the fundamentals of how compilers work, so today I will write the lexer. It should be easy because the strings I am parsing are so simple. I'm going to write a bunch of unit tests to define the behavior I am aiming for. Here is an example:
def test_integer_operator_integer
assert_equal([193,"*",671], tokenize_for_value("193*671") ); end
Basically this is saying that when passed the string "193*671", I expect the tokenize_for_value function to return an array that contains the values 193,"*" and 671. This is pretty simple and could be done in lots of different ways. I chose a mechanism that hopefully somewhat resembles what a proper lexer would do. First I define two types of token:
class IntegerToken
def initialize(value)
@value = value.to_i()
end
end

class OperatorToken
def initialize(value)
@value = value
end
end
The only difference between them at this point is that the IntegerToken converts its value to an integer (duh). I expect to extend these classes once I get to the next step but for now they will suffice. Next, I define an array of regular expressions that recognise integers and operators:
definitions = []
definitions << [/^\d+/, IntegerToken]
definitions << [/^\+/, OperatorToken]
definitions << [/^\-/, OperatorToken]
definitions << [/^\*/, OperatorToken]
definitions << [/^\//, OperatorToken]
Now we just loop our way through the string, matching the regular expressions against it and slicing out the characters that were matched successfully and build a token for them.
tokens = []
lastSize = 0
# Track the number of remaining characters in the expression.
# If we manage to go through all the regular expressions without
# getting a match then we can parse no more.
while(expression.size != lastSize)
lastSize = expression.size
definitions.each do |regex, token_type|
# Do a match with the regex against the string, remove the
# characters that match and put them into lexmeme
lexmeme = expression.slice!(regex)
if (lexmeme)
# We got a match, create a new token and break out of
# the regex loop to preserve the regex ordering.
tokens << token_type.new(lexmeme)
break
end
end
end

It would be fairly easy to add some code that will report if we had to finish parsing with characters still remaining - this would only happen in the case of a syntax error. I'll add that later once I have a basic version of the calculator working. Finally, I need a method that will take the string, call tokenize and give me an array of their values to make the unit tests easy to write:
def tokenize_for_value(expression)
tokens = tokenize(expression)
end
When I run the ruby file (source) the unit tests execute. I wrote seven, so here is the output:
>ruby calculator_source.rb