Monday, 28 January 2008

Evaluating the expression tree

Lets start off with some tests:

class EvaluateTests < Test::Unit::TestCase
  def test_1
    assert_equal(1, calculate("1"))
  end
  
  def test_add
    assert_equal(3, calculate("1+2"))
  end
  
  def test_subtract
    assert_equal(3, calculate("7-4"))
  end
  
  def test_multiply
    assert_equal(6, calculate("3*2"))
  end
  
  def test_divide
    assert_equal(3, calculate("6/2"))
  end
  
  def test_negative_result
    assert_equal(-1, calculate("1-2"))
  end
  
  def test_multitive_before_additive
    assert_equal(14, calculate("4*3+2"))
    assert_equal(14, calculate("2+3*4"))
    assert_equal(10, calculate("4+3*2"))
  end
  
  def test_left_to_right
    assert_equal(8, calculate("12-6+2"))
    assert_equal(4, calculate("12/6*2"))
  end
end
Not exactly comprehensive, but its a start. I need to write a calculate method to tie together the work I've done over the last few weeks:
def calculate(expression)
  tokens = tokenize(expression)
  root_node = build_tree(tokens)
  return root_node.evaluate()
end
Evaluating the tree is obviously a recursive function. Here is the first bit:
class BinaryTreeNode
  def evaluate()
    return @token.evaluate(self)
  end
end
I decided the easiest way to do this is have the evaluate method take an argument: the tree node that contains the token. When evaluating an IntegerToken the value is simply returned, no need to use the argument:
class IntegerToken < Token
  def evaluate(containerNode)
    return @value
  end
end
For the OperatorToken, we want: evaluate(left child) [operator] evaluate (right child). There are lots of different ways to do this - a case statement is the simplest:
class OperatorToken
  def evaluate(containerNode)
    case @value
      when "+"
        return containerNode.lhs.evaluate() + containerNode.rhs.evaluate()
      when "-"
        return containerNode.lhs.evaluate() - containerNode.rhs.evaluate()
      when "*"
        return containerNode.lhs.evaluate() * containerNode.rhs.evaluate()
      when "/"
        return containerNode.lhs.evaluate() / containerNode.rhs.evaluate()
      end
    end
   end
end
My my isn't that boring. Another option would be to implement subclasses for OperatorToken - I wouldn't learn much from doing it though. Lets try something different instead:
class OperatorToken < Token
  attr_reader :major
  
  def initialize(value)
    @value = value
    @eval_method = Fixnum.instance_method(value)
    case @value
      when "+","-"   
        @major = false
      when "*","/"
        @major = true
      end
  end
  
  def evaluate(containerNode)
    bound_method = @eval_method.bind(containerNode.lhs.evaluate())    
    return bound_method.call(containerNode.rhs.evaluate())
  end
end
The three relevant lines of code are highlighted. But first, some background. Operators in Ruby are actually just methods. When I execute the line x = 2 + 3 I am calling the '+' method on the object '2' which is an instance of the class Fixnum. So what I am actually doing with that line is this: x = 2.+(3) . Classes in Ruby are just like normal objects, so you can call methods on them as I am doing in the first highlighted line. If the OperatorToken is constructed with "+", then I acquire an instance of the '+' method defined on the class Fixnum and store it as @eval_method. What I actually store is an instance of UnboundMethod - if you are a .NET developer, it might help to think of it as similar to MethodInfo as you can use it to invoke the method, but you will need to specify the instance you wish to invoke the method on. The second highlighted line does this - the left node is evaluated and the result is our bind target. The third line invokes the bound method and passes the result of evaluating the right node. The result is what we are after and is returned. Yes, this is much more complicated than a switch statement, and surely much slower. But I learned something, and hopefully so did you. Lets run my unit tests:
>ruby calculator.rb
Loaded suite calculator
Started
........................
Finished in 0.022 seconds.

24 tests, 37 assertions, 0 failures, 0 errors
>Exit code: 0
I think its worth mentioning that I have sort of covered up some mistakes I made while implementing the tree building. I didn't initially write the left to right unit tests and so for a while my calculator could do "2+9/3*4" but not "9/3*4" (it returned zero rather than 12). I didn't notice this problem until I ran the stress tester:
def stress_test()
  expressions = generate_expressions(100)
  rejects = []
  expressions.each do |exp|
      rejects << exp if (eval(exp) != calculate(exp))    
  end
  return rejects
end
 
def generate_expressions(count)
    expressions = []
    operators = ["+","-","*","/"]    
    count.times do
      exp = (rand(100) + 1).to_s
      rand(10).times do
        exp << operators[rand(4)]
        exp << (rand(100) + 1).to_s
      end        
      expressions << exp
    end
    return expressions
end
This baby will generate 100 expressions, each containing up to 10 terms between 1 and 100. It does not generate zeros because I would then have to deal with division by zero problems (I'll let that wait for version two). The eval method is not mine - it is a built in method that uses Ruby's own expression evaluation engine. This is really useful because I can pass an expression string into my program and compare the result with what Ruby came up with. If there is a discrepancy then I add it to the array of rejects and return them when done. The stress test revealed the aforementioned problem with left to right evaluation which I fixed, but it also raised something else. Consider this test:
def test_really_big_number
  assert_equal(1000000000000000000001, calculate("1000000000000000000000+1"))
end
Unfortunately, it doesn't pass:
>ruby calculator.rb
Loaded suite calculator
Started
........E................
Finished in 0.02 seconds.

1) Error:
test_really_big_number(EvaluateTests):
TypeError: bind argument must be an instance of Fixnum
calculator.rb:55:in `bind'
calculator.rb:55:in `evaluate'
calculator.rb:12:in `evaluate'
calculator.rb:118:in `calculate'
calculator.rb:209:in `test_really_big_number'

25 tests, 37 assertions, 0 failures, 1 errors
>Exit code: 1
When a Fixnum gets larger than a certain size it changes into a Bignum. Unfortunately, this wreaks havoc with my evaluation code because my unbound method will only bind to a Fixnum, not a Bignum. There are some workarounds, a relatively simple one already springs to mind. However I don't think they are worth entertaining - going back to the case statement or something similar seems to be a better idea. Next post I clean up this issue and the code in general.

Pasting code into Blogger

I've been using a javascript tool to format my example code. It does two useful things:

  1. Replaces html-unfriendly characters such as <>
  2. Wraps it in some nice formatting tags.
You can choose whether it embeds the style information directly into the output or creates a seperate css entry. I've been using the embedding up until recently, but now I have a problem. I've discovered that Blogger's rich text compose mode annihilates my indenting so its time to switch to using the html editor instead. The embedded styles make for noisy html so I will fiddle with this post and try to get the css working instead.
class EvaluateTests < Test::Unit::TestCase
  def test_1
    assert_equal(1, calculate("1"))
  end
end
Edit: Yay, that works fine.

Sunday, 27 January 2008

More syntactic analysis: building an expression tree

Its been a few days since the last post in this series, mostly because this step was trickier than the previous few and partly because once I had the tree building working I couldn't resist adding the evaluation step. This was actually a good thing, because it helped me discover a mistake I made with the tree building and saved me from making a fool of myself by posting some broken crap (writing unit tests should have theoretically prevented this but I suck at them). Anyway, lets get on with it. I need to define the data structure the tree will be built from. I tried enhancing the tokens to act as the tree nodes themselves, but it made the tree manipulations more complicated. Instead I decide to keep it simple:

class BinaryTreeNode
attr_accessor :lhs, :rhs, :token

def initialize(token)
@token = token
end

def leaf?
return @lhs == nil && @rhs == nil
end
end
Pretty self explanatory I think. The tree node holds a reference to a token and its two child nodes. The 'leaf?' method isn't really necessary but it helps simplify the building algorithm. I considered trying to download some sort of diagramming tool to explain the process, but I knew it was going to suck. So instead, you get this: Hopefully you can follow this without too much trouble. I build the tree by performing the same single operation over and over. I don't know what to call it ('create_parent' isn't particular explanatory), but this is the code:
def create_parent(target_node, operator, integer)
new_node = BinaryTreeNode.new(operator)
new_node.lhs = target_node
new_node.rhs = BinaryTreeNode.new(integer)
return new_node
end
This method takes an existing node and makes it the left child of a new node. It then sets up the operator and the value and returns the new node. In the diagram above, this method is called on the (6) node in step 1 and the result is the tree in step 2. The tree building algorithm only needs to use this operation on two nodes:
  • the root node if the operator is a + or - or if the root node is the only node or if the root operator is a * or /.
  • the root node's right hand child in other cases (the operator is a * or / and the other conditions don't apply).
Once you understand the create_parent operation, you will find there is not much else to it:
def build_tree(tokens)
tokens = tokens.dup
root_node = BinaryTreeNode.new(tokens.shift) 

while(tokens.size > 1)
  operator = tokens.shift
  integer = tokens.shift 
  if(root_node.leaf? || !operator.major || root_node.token.major)
    # If the root_node is the only node or if we have a '+' or '-'
    # OR if the root operator is '*' or '/' then append to root
    root_node = create_parent(root_node, operator, integer)   
  else
    # We have a * or /, append to root.rhs   
    root_node.rhs = create_parent(root_node.rhs, operator, integer)            
  end
end

 return root_node
end
The Operator.major method is a badly named way of determining whether the operator is addition or subtraction (minor) or multiplication or division (major). I'm sure there is some appropriately mathy term that I could use instead but it did not come to me. So while there are still tokens left to process, I take two off, and then do a create_parent with them. The logic I described above is used to figure out where the new node is created and thats it. I must admit I was pleased with the simplicity of this approach. If the next operator is addition or subtraction then you want the existing expression to be evaluated first, so simply creating a new root node and attaching the existing tree to the left does the job. If the next operator is multiplication or division, then you want it to take precedence before the root operator, UNLESS the root operator is also multiplication or division (to preserve left to right evaluation). You might be wondering what that first line of code is, 'tokens = tokens.dup'. This was a quick fix for a small issue I overlooked. I am using Array.shift to work my way through the tokens and as a result when the calling code returns, the array it passed in will be empty. The 'dup' method simply returns a clone of the array which I can modify without risk of breaking anything. I tried writing unit tests for the tree building but they end up being quite lame. Its fiddly work to verify via code that the nodes are arranged into the correct structure. I know it goes against the spirit of unit testing but I found it easier to test the tree structure simply by skipping forward to implementing evaluation and using the evaluation tests. Tomorrow I will post the wacky implementation I chose for evaluating the tree. Its probably not the best way to do it but it demonstrates some cool Ruby functionality.

Thursday, 24 January 2008

Android Code Day: London

In November last year, Google announced an open source operating system for mobile phones called Android. This is a big deal for me because I have wanted a "gPhone" for a while now and a phone that runs Android will be the closest I can get any time soon. I expect the mobile applications market will grow significantly over the next few years and so this is one of the fields I want to start spending time on. I was expecting to start posting about Android in a week or two, after I finished the series of articles on the parsing/calculator exercise. The reason why I'm posting about Android today is because last night I managed to register for the Android Code Day running next week at Google London. I found out about it via one of Google's official blogs and I'm quite glad I was subscribed - I checked a few minutes ago and registration has already closed. These things are popular! I'm still working on the parsing exercise. I can be kinda slow when it comes to this sort of stuff (hey that's part of why I'm doing it!) and I'm struggling to come up with the algorithm for building the binary tree while ensuring I follow order of operations. I could probably Google/Wiki it but its become a personal challenge now - I know I can do this bit without help, its just about getting my brain to shift into the right gear. "You are SUCH a geek – you know that don’t you?" - My boss's reply when I asked for the day off so I could go to the Android Code Day :)

Tuesday, 22 January 2008

Really simple syntactic analysis

(Continues from the previous post) So now I have an array of these token objects that I want to build into a binary tree. I said yesterday that I would not handle syntax errors in this first version but I have since realised that I was only talking about a particular type of syntax error: one that occurs at the lexical level. For example, the string "32+65-chocolate" is not an expression I expect my lexer to handle. Now if I was writing a lexer for a language that allows variable declarations, the aforementioned string might be valid because the previous line was "chocolate=6". That however, is outside the scope of this exercise so for now "chocolate" does not lex. So with that said, its time to discuss the other type of syntax error that springs to mind. Here are some examples:

  • 1+
  • 2*+3
  • -4
Uhoh. How is "-4" a syntax error? Its time to spill the beans: I lied when I said my lexer would support integers. This first version will only support non-negative integers. The important thing is to stay focused on getting something working, no matter how basic. I'll deal with those pesky negative numbers later. Ok, so I wrote units tests for the three examples above:
class SyntaxTests < Test::Unit::TestCase
def test_disallow_end_with_operator
tokens = [IntegerToken.new("1"), OperatorToken.new("+")]
assert(!tokens_are_valid(tokens))
end

def test_disallow_two_consecutive_operators
tokens = [IntegerToken.new("2"), OperatorToken.new("*"), OperatorToken.new("-"), IntegerToken.new("3")]
assert(!tokens_are_valid(tokens))
end

def test_disallow_start_with_operator
tokens = [OperatorToken.new("-"), IntegerToken.new("4")]
assert(!tokens_are_valid(tokens))
end
end
Ok now I need to make them pass. No problem!!
def tokens_are_valid(tokens)
return false
end
Ok that was some savage cheats. I obviously need to define some tests for valid syntax aswell as invalid syntax:
def test_allow_two_numbers
tokens = [IntegerToken.new("2"), OperatorToken.new("+"), IntegerToken.new("6")]
assert(tokens_are_valid(tokens))
end

def test_allow_three_numbers
tokens = [IntegerToken.new("73"), OperatorToken.new("*"), IntegerToken.new("56"), OperatorToken.new("-"), IntegerToken.new("94")]
assert(tokens_are_valid(tokens))
end
Ugh, those token constructors are taking up a lot of space aren't they? I might do something about that later. Also while I'm on the topic of token constructors, I also need to write tests to ensure that they raise exceptions if I do something silly like OperatorToken.new("6"). That will also have to wait. Getting back to the topic at hand, I now have some tests that fail:
>ruby calculator.rb
Loaded suite calculator
Started
FF..........
Finished in 0.009 seconds.

1) Failure:
test_allow_three_numbers(SyntaxTests) [calculator.rb:81]:
<false> is not true.

2) Failure:
test_allow_two_numbers(SyntaxTests) [calculator.rb:76]:
<false> is not true.

12 tests, 12 assertions, 2 failures, 0 errors
>Exit code: 1
Ok, now its time to flake out a bit. I figure I'm supposed to define some sort of context-free grammar but it looks kinda hard. Lets do something way easier:
def tokens_are_valid(tokens)
return false if(tokens.empty?)
return false if(tokens.first.class != IntegerToken)
return false if(tokens.last.class != IntegerToken)

expect_next_is_integer = true
tokens.each do |token|
 token_is_integer = (token.class == IntegerToken)
 return false if(expect_next_is_integer != token_is_integer)    
 expect_next_is_integer = !expect_next_is_integer
end

return true
end
If there are no tokens, or we do not start or end with an integer token, then return false. Then just loop through them and make sure the types alternate. I'm assuming that there are only two types of tokens - we will run into problems when it comes time to implement parentheses but that is to be expected. Ok lets give it a try:
>ruby calculator.rb
Loaded suite calculator
Started
............
Finished in 0.001 seconds.

12 tests, 12 assertions, 0 failures, 0 errors
>Exit code: 0
Good job. The process of building the binary tree will be much easier now that we can ensure that our array of tokens is valid.

Monday, 21 January 2008

Really simple lexical analysis

(Continues from the previous post)
If I ever develop my own programming language, you can bet I won't be writing the lexer by hand. Programs such as Flex will generate a lexer based on the regular expressions you feed it. However the purpose of this exercise is to learn the fundamentals of how compilers work, so today I will write the lexer. It should be easy because the strings I am parsing are so simple. I'm going to write a bunch of unit tests to define the behavior I am aiming for. Here is an example:
def test_integer_operator_integer
assert_equal([193,"*",671], tokenize_for_value("193*671") ); end
Basically this is saying that when passed the string "193*671", I expect the tokenize_for_value function to return an array that contains the values 193,"*" and 671. This is pretty simple and could be done in lots of different ways. I chose a mechanism that hopefully somewhat resembles what a proper lexer would do. First I define two types of token:
class IntegerToken
attr_reader :value
def initialize(value)
@value = value.to_i()
end
end

class OperatorToken
attr_reader :value
def initialize(value)
@value = value
end
end
The only difference between them at this point is that the IntegerToken converts its value to an integer (duh). I expect to extend these classes once I get to the next step but for now they will suffice. Next, I define an array of regular expressions that recognise integers and operators:
definitions = []
definitions << [/^\d+/, IntegerToken]
definitions << [/^\+/, OperatorToken]
definitions << [/^\-/, OperatorToken]
definitions << [/^\*/, OperatorToken]
definitions << [/^\//, OperatorToken]
Now we just loop our way through the string, matching the regular expressions against it and slicing out the characters that were matched successfully and build a token for them.
  tokens = []
lastSize = 0
# Track the number of remaining characters in the expression.
# If we manage to go through all the regular expressions without
# getting a match then we can parse no more.
while(expression.size != lastSize)
lastSize = expression.size
definitions.each do |regex, token_type|
# Do a match with the regex against the string, remove the
# characters that match and put them into lexmeme
lexmeme = expression.slice!(regex)
if (lexmeme)
  # We got a match, create a new token and break out of
  # the regex loop to preserve the regex ordering.
  tokens << token_type.new(lexmeme)
  break
end
end
end

return tokens
It would be fairly easy to add some code that will report if we had to finish parsing with characters still remaining - this would only happen in the case of a syntax error. I'll add that later once I have a basic version of the calculator working. Finally, I need a method that will take the string, call tokenize and give me an array of their values to make the unit tests easy to write:
def tokenize_for_value(expression)
tokens = tokenize(expression)
return tokens.map { |t| t.value }
end
When I run the ruby file (source) the unit tests execute. I wrote seven, so here is the output:
>ruby calculator_source.rb
Loaded suite calculator_source
Started
.......
Finished in 0.001 seconds.

7 tests, 7 assertions, 0 failures, 0 errors
>Exit code: 0
I'm pleased with how easy it is to write and run unit tests in ruby. There was very little friction. Next up: building a binary tree out of the tokens.

Sunday, 20 January 2008

Discovering compilers

My first exercise for learning how compilers work is to write a program that takes a string containing a mathematical expression and evaluates the result e.g. "2+3*7" => 23. I will be implementing it using Ruby and will write unit tests as I go. Since my goal is to learn the underlying concepts of compiler construction, I am avoiding any libraries/functionality provided by Ruby that would make it too easy. The first implementation will ignore whitespace and accept strings containing integers and the 4 standard operators (plus, minus, multiply, divide). The second implementation will report syntax errors and accept decimal numbers and parentheses. According to my research on Wikipedia, this exercise can be broken into the following steps:

  1. lexical analysis (identifying the tokens that appear in the string)
  2. syntactic analysis (assembling the tokens into a tree)
  3. semantic analysis (evaluating the tree for a result)
I will implement the lexical analysis step using regular expressions. They are well suited for the task and I need the practice with them. The syntactic analysis is the tricky bit because I have to follow order of operations. The semantic analysis will be trivial as it simply requires a recursive algorithm executed on the root node of the tree. Next up: Implementing lexical analysis.