Monday, 28 January 2008

Evaluating the expression tree

Lets start off with some tests:

``````class EvaluateTests < Test::Unit::TestCase
def test_1
assert_equal(1, calculate("1"))
end

assert_equal(3, calculate("1+2"))
end

def test_subtract
assert_equal(3, calculate("7-4"))
end

def test_multiply
assert_equal(6, calculate("3*2"))
end

def test_divide
assert_equal(3, calculate("6/2"))
end

def test_negative_result
assert_equal(-1, calculate("1-2"))
end

assert_equal(14, calculate("4*3+2"))
assert_equal(14, calculate("2+3*4"))
assert_equal(10, calculate("4+3*2"))
end

def test_left_to_right
assert_equal(8, calculate("12-6+2"))
assert_equal(4, calculate("12/6*2"))
end
end
``````
Not exactly comprehensive, but its a start. I need to write a calculate method to tie together the work I've done over the last few weeks:
``````def calculate(expression)
tokens = tokenize(expression)
root_node = build_tree(tokens)
return root_node.evaluate()
end
``````
Evaluating the tree is obviously a recursive function. Here is the first bit:
``````class BinaryTreeNode
def evaluate()
return @token.evaluate(self)
end
end
``````
I decided the easiest way to do this is have the evaluate method take an argument: the tree node that contains the token. When evaluating an IntegerToken the value is simply returned, no need to use the argument:
``````class IntegerToken < Token
def evaluate(containerNode)
return @value
end
end
``````
For the OperatorToken, we want: evaluate(left child) [operator] evaluate (right child). There are lots of different ways to do this - a case statement is the simplest:
``````class OperatorToken
def evaluate(containerNode)
case @value
when "+"
return containerNode.lhs.evaluate() + containerNode.rhs.evaluate()
when "-"
return containerNode.lhs.evaluate() - containerNode.rhs.evaluate()
when "*"
return containerNode.lhs.evaluate() * containerNode.rhs.evaluate()
when "/"
return containerNode.lhs.evaluate() / containerNode.rhs.evaluate()
end
end
end
end
``````
My my isn't that boring. Another option would be to implement subclasses for OperatorToken - I wouldn't learn much from doing it though. Lets try something different instead:
``````class OperatorToken < Token

def initialize(value)
@value = value
@eval_method = Fixnum.instance_method(value)
case @value
when "+","-"
@major = false
when "*","/"
@major = true
end
end

def evaluate(containerNode)
bound_method = @eval_method.bind(containerNode.lhs.evaluate())
return bound_method.call(containerNode.rhs.evaluate())
end
end
``````
The three relevant lines of code are highlighted. But first, some background. Operators in Ruby are actually just methods. When I execute the line x = 2 + 3 I am calling the '+' method on the object '2' which is an instance of the class Fixnum. So what I am actually doing with that line is this: x = 2.+(3) . Classes in Ruby are just like normal objects, so you can call methods on them as I am doing in the first highlighted line. If the OperatorToken is constructed with "+", then I acquire an instance of the '+' method defined on the class Fixnum and store it as @eval_method. What I actually store is an instance of UnboundMethod - if you are a .NET developer, it might help to think of it as similar to MethodInfo as you can use it to invoke the method, but you will need to specify the instance you wish to invoke the method on. The second highlighted line does this - the left node is evaluated and the result is our bind target. The third line invokes the bound method and passes the result of evaluating the right node. The result is what we are after and is returned. Yes, this is much more complicated than a switch statement, and surely much slower. But I learned something, and hopefully so did you. Lets run my unit tests:
``````>ruby calculator.rb
Started
........................
Finished in 0.022 seconds.

24 tests, 37 assertions, 0 failures, 0 errors
>Exit code: 0
``````
I think its worth mentioning that I have sort of covered up some mistakes I made while implementing the tree building. I didn't initially write the left to right unit tests and so for a while my calculator could do "2+9/3*4" but not "9/3*4" (it returned zero rather than 12). I didn't notice this problem until I ran the stress tester:
``````def stress_test()
expressions = generate_expressions(100)
rejects = []
expressions.each do |exp|
rejects << exp if (eval(exp) != calculate(exp))
end
return rejects
end

def generate_expressions(count)
expressions = []
operators = ["+","-","*","/"]
count.times do
exp = (rand(100) + 1).to_s
rand(10).times do
exp << operators[rand(4)]
exp << (rand(100) + 1).to_s
end
expressions << exp
end
return expressions
end
``````
This baby will generate 100 expressions, each containing up to 10 terms between 1 and 100. It does not generate zeros because I would then have to deal with division by zero problems (I'll let that wait for version two). The eval method is not mine - it is a built in method that uses Ruby's own expression evaluation engine. This is really useful because I can pass an expression string into my program and compare the result with what Ruby came up with. If there is a discrepancy then I add it to the array of rejects and return them when done. The stress test revealed the aforementioned problem with left to right evaluation which I fixed, but it also raised something else. Consider this test:
``````def test_really_big_number
assert_equal(1000000000000000000001, calculate("1000000000000000000000+1"))
end
``````
Unfortunately, it doesn't pass:
``````>ruby calculator.rb
Started
........E................
Finished in 0.02 seconds.

1) Error:
test_really_big_number(EvaluateTests):
TypeError: bind argument must be an instance of Fixnum
calculator.rb:55:in `bind'
calculator.rb:55:in `evaluate'
calculator.rb:12:in `evaluate'
calculator.rb:118:in `calculate'
calculator.rb:209:in `test_really_big_number'

25 tests, 37 assertions, 0 failures, 1 errors
>Exit code: 1
``````
When a Fixnum gets larger than a certain size it changes into a Bignum. Unfortunately, this wreaks havoc with my evaluation code because my unbound method will only bind to a Fixnum, not a Bignum. There are some workarounds, a relatively simple one already springs to mind. However I don't think they are worth entertaining - going back to the case statement or something similar seems to be a better idea. Next post I clean up this issue and the code in general.

Pasting code into Blogger

I've been using a javascript tool to format my example code. It does two useful things:

1. Replaces html-unfriendly characters such as <>
2. Wraps it in some nice formatting tags.
You can choose whether it embeds the style information directly into the output or creates a seperate css entry. I've been using the embedding up until recently, but now I have a problem. I've discovered that Blogger's rich text compose mode annihilates my indenting so its time to switch to using the html editor instead. The embedded styles make for noisy html so I will fiddle with this post and try to get the css working instead.
``````class EvaluateTests < Test::Unit::TestCase
def test_1
assert_equal(1, calculate("1"))
end
end
``````
Edit: Yay, that works fine.

Sunday, 27 January 2008

More syntactic analysis: building an expression tree

Its been a few days since the last post in this series, mostly because this step was trickier than the previous few and partly because once I had the tree building working I couldn't resist adding the evaluation step. This was actually a good thing, because it helped me discover a mistake I made with the tree building and saved me from making a fool of myself by posting some broken crap (writing unit tests should have theoretically prevented this but I suck at them). Anyway, lets get on with it. I need to define the data structure the tree will be built from. I tried enhancing the tokens to act as the tree nodes themselves, but it made the tree manipulations more complicated. Instead I decide to keep it simple:

``````class BinaryTreeNode
attr_accessor :lhs, :rhs, :token

def initialize(token)
@token = token
end

def leaf?
return @lhs == nil && @rhs == nil
end
end
``````
Pretty self explanatory I think. The tree node holds a reference to a token and its two child nodes. The 'leaf?' method isn't really necessary but it helps simplify the building algorithm. I considered trying to download some sort of diagramming tool to explain the process, but I knew it was going to suck. So instead, you get this: Hopefully you can follow this without too much trouble. I build the tree by performing the same single operation over and over. I don't know what to call it ('create_parent' isn't particular explanatory), but this is the code:
``````def create_parent(target_node, operator, integer)
new_node = BinaryTreeNode.new(operator)
new_node.lhs = target_node
new_node.rhs = BinaryTreeNode.new(integer)
return new_node
end
``````
This method takes an existing node and makes it the left child of a new node. It then sets up the operator and the value and returns the new node. In the diagram above, this method is called on the (6) node in step 1 and the result is the tree in step 2. The tree building algorithm only needs to use this operation on two nodes:
• the root node if the operator is a + or - or if the root node is the only node or if the root operator is a * or /.
• the root node's right hand child in other cases (the operator is a * or / and the other conditions don't apply).
Once you understand the create_parent operation, you will find there is not much else to it:
``````def build_tree(tokens)
tokens = tokens.dup
root_node = BinaryTreeNode.new(tokens.shift)

while(tokens.size > 1)
operator = tokens.shift
integer = tokens.shift
if(root_node.leaf? || !operator.major || root_node.token.major)
# If the root_node is the only node or if we have a '+' or '-'
# OR if the root operator is '*' or '/' then append to root
root_node = create_parent(root_node, operator, integer)
else
# We have a * or /, append to root.rhs
root_node.rhs = create_parent(root_node.rhs, operator, integer)
end
end

return root_node
end
``````
The Operator.major method is a badly named way of determining whether the operator is addition or subtraction (minor) or multiplication or division (major). I'm sure there is some appropriately mathy term that I could use instead but it did not come to me. So while there are still tokens left to process, I take two off, and then do a create_parent with them. The logic I described above is used to figure out where the new node is created and thats it. I must admit I was pleased with the simplicity of this approach. If the next operator is addition or subtraction then you want the existing expression to be evaluated first, so simply creating a new root node and attaching the existing tree to the left does the job. If the next operator is multiplication or division, then you want it to take precedence before the root operator, UNLESS the root operator is also multiplication or division (to preserve left to right evaluation). You might be wondering what that first line of code is, 'tokens = tokens.dup'. This was a quick fix for a small issue I overlooked. I am using Array.shift to work my way through the tokens and as a result when the calling code returns, the array it passed in will be empty. The 'dup' method simply returns a clone of the array which I can modify without risk of breaking anything. I tried writing unit tests for the tree building but they end up being quite lame. Its fiddly work to verify via code that the nodes are arranged into the correct structure. I know it goes against the spirit of unit testing but I found it easier to test the tree structure simply by skipping forward to implementing evaluation and using the evaluation tests. Tomorrow I will post the wacky implementation I chose for evaluating the tree. Its probably not the best way to do it but it demonstrates some cool Ruby functionality.

Tuesday, 22 January 2008

Really simple syntactic analysis

(Continues from the previous post) So now I have an array of these token objects that I want to build into a binary tree. I said yesterday that I would not handle syntax errors in this first version but I have since realised that I was only talking about a particular type of syntax error: one that occurs at the lexical level. For example, the string "32+65-chocolate" is not an expression I expect my lexer to handle. Now if I was writing a lexer for a language that allows variable declarations, the aforementioned string might be valid because the previous line was "chocolate=6". That however, is outside the scope of this exercise so for now "chocolate" does not lex. So with that said, its time to discuss the other type of syntax error that springs to mind. Here are some examples:

• 1+
• 2*+3
• -4
Uhoh. How is "-4" a syntax error? Its time to spill the beans: I lied when I said my lexer would support integers. This first version will only support non-negative integers. The important thing is to stay focused on getting something working, no matter how basic. I'll deal with those pesky negative numbers later. Ok, so I wrote units tests for the three examples above:
``````class SyntaxTests < Test::Unit::TestCase
def test_disallow_end_with_operator
tokens = [IntegerToken.new("1"), OperatorToken.new("+")]
assert(!tokens_are_valid(tokens))
end

def test_disallow_two_consecutive_operators
tokens = [IntegerToken.new("2"), OperatorToken.new("*"), OperatorToken.new("-"), IntegerToken.new("3")]
assert(!tokens_are_valid(tokens))
end

def test_disallow_start_with_operator
tokens = [OperatorToken.new("-"), IntegerToken.new("4")]
assert(!tokens_are_valid(tokens))
end
end
``````
Ok now I need to make them pass. No problem!!
``````def tokens_are_valid(tokens)
return false
end
``````
Ok that was some savage cheats. I obviously need to define some tests for valid syntax aswell as invalid syntax:
``````def test_allow_two_numbers
tokens = [IntegerToken.new("2"), OperatorToken.new("+"), IntegerToken.new("6")]
assert(tokens_are_valid(tokens))
end

def test_allow_three_numbers
tokens = [IntegerToken.new("73"), OperatorToken.new("*"), IntegerToken.new("56"), OperatorToken.new("-"), IntegerToken.new("94")]
assert(tokens_are_valid(tokens))
end
``````
Ugh, those token constructors are taking up a lot of space aren't they? I might do something about that later. Also while I'm on the topic of token constructors, I also need to write tests to ensure that they raise exceptions if I do something silly like OperatorToken.new("6"). That will also have to wait. Getting back to the topic at hand, I now have some tests that fail:
``````>ruby calculator.rb
Started
FF..........
Finished in 0.009 seconds.

1) Failure:
test_allow_three_numbers(SyntaxTests) [calculator.rb:81]:
<false> is not true.

2) Failure:
test_allow_two_numbers(SyntaxTests) [calculator.rb:76]:
<false> is not true.

12 tests, 12 assertions, 2 failures, 0 errors
>Exit code: 1
``````
Ok, now its time to flake out a bit. I figure I'm supposed to define some sort of context-free grammar but it looks kinda hard. Lets do something way easier:
``````def tokens_are_valid(tokens)
return false if(tokens.empty?)
return false if(tokens.first.class != IntegerToken)
return false if(tokens.last.class != IntegerToken)

expect_next_is_integer = true
tokens.each do |token|
token_is_integer = (token.class == IntegerToken)
return false if(expect_next_is_integer != token_is_integer)
expect_next_is_integer = !expect_next_is_integer
end

return true
end
``````
If there are no tokens, or we do not start or end with an integer token, then return false. Then just loop through them and make sure the types alternate. I'm assuming that there are only two types of tokens - we will run into problems when it comes time to implement parentheses but that is to be expected. Ok lets give it a try:
``````>ruby calculator.rb
Started
............
Finished in 0.001 seconds.

12 tests, 12 assertions, 0 failures, 0 errors
>Exit code: 0
``````
Good job. The process of building the binary tree will be much easier now that we can ensure that our array of tokens is valid.

Monday, 21 January 2008

Really simple lexical analysis

(Continues from the previous post)
If I ever develop my own programming language, you can bet I won't be writing the lexer by hand. Programs such as Flex will generate a lexer based on the regular expressions you feed it. However the purpose of this exercise is to learn the fundamentals of how compilers work, so today I will write the lexer. It should be easy because the strings I am parsing are so simple. I'm going to write a bunch of unit tests to define the behavior I am aiming for. Here is an example:
``````def test_integer_operator_integer
assert_equal([193,"*",671], tokenize_for_value("193*671") ); end
``````
Basically this is saying that when passed the string "193*671", I expect the tokenize_for_value function to return an array that contains the values 193,"*" and 671. This is pretty simple and could be done in lots of different ways. I chose a mechanism that hopefully somewhat resembles what a proper lexer would do. First I define two types of token:
``````class IntegerToken
def initialize(value)
@value = value.to_i()
end
end

class OperatorToken
def initialize(value)
@value = value
end
end
``````
The only difference between them at this point is that the IntegerToken converts its value to an integer (duh). I expect to extend these classes once I get to the next step but for now they will suffice. Next, I define an array of regular expressions that recognise integers and operators:
``````definitions = []
definitions << [/^\d+/, IntegerToken]
definitions << [/^\+/, OperatorToken]
definitions << [/^\-/, OperatorToken]
definitions << [/^\*/, OperatorToken]
definitions << [/^\//, OperatorToken]
``````
Now we just loop our way through the string, matching the regular expressions against it and slicing out the characters that were matched successfully and build a token for them.
``````  tokens = []
lastSize = 0
# Track the number of remaining characters in the expression.
# If we manage to go through all the regular expressions without
# getting a match then we can parse no more.
while(expression.size != lastSize)
lastSize = expression.size
definitions.each do |regex, token_type|
# Do a match with the regex against the string, remove the
# characters that match and put them into lexmeme
lexmeme = expression.slice!(regex)
if (lexmeme)
# We got a match, create a new token and break out of
# the regex loop to preserve the regex ordering.
tokens << token_type.new(lexmeme)
break
end
end
end

``````
It would be fairly easy to add some code that will report if we had to finish parsing with characters still remaining - this would only happen in the case of a syntax error. I'll add that later once I have a basic version of the calculator working. Finally, I need a method that will take the string, call tokenize and give me an array of their values to make the unit tests easy to write:
``````def tokenize_for_value(expression)
tokens = tokenize(expression)
end
``````
When I run the ruby file (source) the unit tests execute. I wrote seven, so here is the output:
``````>ruby calculator_source.rb
Started
.......
Finished in 0.001 seconds.

7 tests, 7 assertions, 0 failures, 0 errors
>Exit code: 0
``````
I'm pleased with how easy it is to write and run unit tests in ruby. There was very little friction. Next up: building a binary tree out of the tokens.

Sunday, 20 January 2008

Discovering compilers

My first exercise for learning how compilers work is to write a program that takes a string containing a mathematical expression and evaluates the result e.g. "2+3*7" => 23. I will be implementing it using Ruby and will write unit tests as I go. Since my goal is to learn the underlying concepts of compiler construction, I am avoiding any libraries/functionality provided by Ruby that would make it too easy. The first implementation will ignore whitespace and accept strings containing integers and the 4 standard operators (plus, minus, multiply, divide). The second implementation will report syntax errors and accept decimal numbers and parentheses. According to my research on Wikipedia, this exercise can be broken into the following steps:

1. lexical analysis (identifying the tokens that appear in the string)
2. syntactic analysis (assembling the tokens into a tree)
3. semantic analysis (evaluating the tree for a result)
I will implement the lexical analysis step using regular expressions. They are well suited for the task and I need the practice with them. The syntactic analysis is the tricky bit because I have to follow order of operations. The semantic analysis will be trivial as it simply requires a recursive algorithm executed on the root node of the tree. Next up: Implementing lexical analysis.

Sunday, 13 January 2008

Using IRB as an engine for programming puzzle game things

So I've been screwing around with Ruby a fair bit, but haven't really started writing proper programs in it yet. However I did have this one dumb idea that amused me, perhaps you will get a kick out of it. I mentioned in the last post that Ruby has a REPL called IRB. While experimenting with how it works I had this idea of making small puzzles that can be loaded into the IRB and then solved by executing code. Here is sample output for a toy example I made:

``````C:\Users\paul>irb -r puzzle1.rb
You have begun Puzzle 1.
You are in a small room, with a single closed door.
You can see a silver_key and copper_key, both on the floor.
The goal of the puzzle is to get the door open so you can leave the room.
To inspect individual objects simply type their name.
To interact with objects, use 'noun.verb' syntax. For example, try 'door.open'.

Puzzle1 >door
The door is closed.
Avaliable actions:
close
inspect
lock
open
unlock

Puzzle1 >door.open
The door is locked.

Puzzle1 >door.unlock
Action 'unlock' expects a single argument.

Puzzle1 >door.unlock silver_key
You try to put the key in the lock but it does not fit.

Puzzle1 >door.unlock copper_key
You unlocked the door.

Puzzle1 >door.open
The door is now open.
You exit the room via the doorway. Wheee, you completed the puzzle!

C:\Users\paul>
``````
The source for this is quite ugly at the moment. I'm looking forward to digging into some of Ruby's meta-programming features that will help me clean it up. Now if player knows just a tiny bit of Ruby they can 'cheat' and finish this without using the silver_key or copper_key. I'm hoping to take this idea one step further and make puzzles that cannot be completed without this sort of 'cheating'. I'm not interested in developing typical text adventure style puzzles - I was never that much of a fan of the genre. But logic puzzles that require programming skills? That is another thing entirely. Surely lots of other people have had this same idea but I did a few searches and didn't find anything. A shame really because I imagine they could be quite fun. I will have a go at making one and we will see what happens.

Thursday, 10 January 2008

First impressions of Ruby

My first experience with Ruby was the really nice interactive tutorial. This is by far the best way to introduce people to a language that I have seen - it is really slick. My complements to Why for managing to make the first step of learning a new language so painless. However I would be remiss if I failed to mention the damn cursor on the thing - when you are using backspace it always deletes one more character than it looks like. Don't let this put you off though, give it a go! After finishing the interactive tutorial I downloaded the latest installer for windows and began playing around. Exploring Ruby is proving to be lots of fun but I am somewhat spoiled by Visual Studio - the tools and documentation included in the package are simply not as sophisticated and this is hardly surprising. The package comes with a REPL (Read-Eval-Print Loop). If you tried the interactive Ruby tutorial then you already have experience using a REPL - they are quite useful for learning a new programming language, though they are far better suited to interpreted languages than compiled languages. I find the simplicity of a REPL attractive because it makes it convenient to do short experimentation sessions. I can sit down for just 10 minutes and learn a couple of new things. Next post: RRC (Random Ruby Crap) #1

Friday, 4 January 2008

This is Steve Yegge's fault

A couple of weeks ago I discovered Steve Yegge's blog via Jeff Atwood. Steve has written many essays (you should see the size of these things, they are not mere 'posts') and in general they focus on programming languages and improving your development skills & practices. I found Steve's writing fascinating and to put it briefly, his essays are the reason this blog exists. Now, I've been trying to kick this blog off for a couple of days and this first post has been causing me some real difficulty because it keeps devolving into a flowery poem entitled "Dear Stevey, how do I love thee, let me count the ways". Let me try once again to express this succinctly and without unnecessary romantics: taken in concentrated dose, by a certain category of programmer, Steve's essays are a cure for laziness in multiple domains. Please allow me a moment of self-indulgence to provide a little background. I am a software engineering graduate and I have been developing software exclusively using Microsoft tools and environments for the entirety of my short career (5 years, including part time work during university). I suspect I have fallen into a particular stereotype that might not be well named but is at least recognisable, given these attributes:

• Windows is the only operating system I know how to use.
• Since leaving university 2 years ago I have not learned any new programming languages (I am primarily a C# developer).
• I have an unhealthy degree of affection for strong typing, static type checking and intellisense.
• Everything I know about web development is ASP.NET based.
I could go on, but I think you get the point. I am horribly Microsoft focused and lacking in perspective. I'd say the depth of my knowledge is pretty good given how long I've been working in the industry... but my breadth of knowledge? Forget about it! Its probably worth mentioning one thing that I consider to be well-founded: I am nutty about object oriented programming. I am hoping to learn first hand how programming without static typing can be a good thing, but as far as OO is concerned, you will have to pry it out of my cold, dead fingers! As you may have surmised, Steve has convinced me to start blogging. The primary focus of this blog is breaking out of the mold I described above. Reading software blogs and doing my job is not enough. Not if I ever want to work at Google (which I do). You don't become an amazing programmer simply by clocking in 9-5 for X years and reading Joel on Software (although I'm sure it helps). What embarrasses me is that it has taken me so long to figure that out. Next post: some more details about my plan.