Tuesday 22 January 2008

Really simple syntactic analysis

(Continues from the previous post) So now I have an array of these token objects that I want to build into a binary tree. I said yesterday that I would not handle syntax errors in this first version but I have since realised that I was only talking about a particular type of syntax error: one that occurs at the lexical level. For example, the string "32+65-chocolate" is not an expression I expect my lexer to handle. Now if I was writing a lexer for a language that allows variable declarations, the aforementioned string might be valid because the previous line was "chocolate=6". That however, is outside the scope of this exercise so for now "chocolate" does not lex. So with that said, its time to discuss the other type of syntax error that springs to mind. Here are some examples:

  • 1+
  • 2*+3
  • -4
Uhoh. How is "-4" a syntax error? Its time to spill the beans: I lied when I said my lexer would support integers. This first version will only support non-negative integers. The important thing is to stay focused on getting something working, no matter how basic. I'll deal with those pesky negative numbers later. Ok, so I wrote units tests for the three examples above:
class SyntaxTests < Test::Unit::TestCase
def test_disallow_end_with_operator
tokens = [IntegerToken.new("1"), OperatorToken.new("+")]

def test_disallow_two_consecutive_operators
tokens = [IntegerToken.new("2"), OperatorToken.new("*"), OperatorToken.new("-"), IntegerToken.new("3")]

def test_disallow_start_with_operator
tokens = [OperatorToken.new("-"), IntegerToken.new("4")]
Ok now I need to make them pass. No problem!!
def tokens_are_valid(tokens)
return false
Ok that was some savage cheats. I obviously need to define some tests for valid syntax aswell as invalid syntax:
def test_allow_two_numbers
tokens = [IntegerToken.new("2"), OperatorToken.new("+"), IntegerToken.new("6")]

def test_allow_three_numbers
tokens = [IntegerToken.new("73"), OperatorToken.new("*"), IntegerToken.new("56"), OperatorToken.new("-"), IntegerToken.new("94")]
Ugh, those token constructors are taking up a lot of space aren't they? I might do something about that later. Also while I'm on the topic of token constructors, I also need to write tests to ensure that they raise exceptions if I do something silly like OperatorToken.new("6"). That will also have to wait. Getting back to the topic at hand, I now have some tests that fail:
>ruby calculator.rb
Loaded suite calculator
Finished in 0.009 seconds.

1) Failure:
test_allow_three_numbers(SyntaxTests) [calculator.rb:81]:
<false> is not true.

2) Failure:
test_allow_two_numbers(SyntaxTests) [calculator.rb:76]:
<false> is not true.

12 tests, 12 assertions, 2 failures, 0 errors
>Exit code: 1
Ok, now its time to flake out a bit. I figure I'm supposed to define some sort of context-free grammar but it looks kinda hard. Lets do something way easier:
def tokens_are_valid(tokens)
return false if(tokens.empty?)
return false if(tokens.first.class != IntegerToken)
return false if(tokens.last.class != IntegerToken)

expect_next_is_integer = true
tokens.each do |token|
 token_is_integer = (token.class == IntegerToken)
 return false if(expect_next_is_integer != token_is_integer)    
 expect_next_is_integer = !expect_next_is_integer

return true
If there are no tokens, or we do not start or end with an integer token, then return false. Then just loop through them and make sure the types alternate. I'm assuming that there are only two types of tokens - we will run into problems when it comes time to implement parentheses but that is to be expected. Ok lets give it a try:
>ruby calculator.rb
Loaded suite calculator
Finished in 0.001 seconds.

12 tests, 12 assertions, 0 failures, 0 errors
>Exit code: 0
Good job. The process of building the binary tree will be much easier now that we can ensure that our array of tokens is valid.


  1. Hey Paul, glad to see you're blogging. I've been curious about Ruby for a while, but I haven't acted on that curiosity so for the moment the code is some crazy Moon-language to me.

    About yer unit tests, although now it's good to verify your input before trying to form it into a binary tree, it seems that adding parentheses will make it pretty attractive to validate/test while forming it into a binary tree.

  2. I've been tossing up this question about whether simply implementing parentheses now will make things easier in the long run. Its all about preserving order of operations - if I get parens working then to get order of operations all I gotta do is find the multiply and divide operators and stick parens in if there aren't any.

    You know, I'm not sure if I've followed your point. Can you elaborate further?

  3. Parentheses influence the branching, making it so that you can't just ensure alternating operators and integers. With negatives, you only need to modify that rule slightly to ensure that if you find a list of operators, the tail must be all minus.

    With parentheses:

    1. a*(b-c) -> Good
    2. (a*b)*c -> Good
    3. a*b(-c) -> Bad
    4. a*b(*c) -> Bad

    The second two are bad because the parentheses change the branching of the expression. With this altered branching, an (Operator,Integer) pair is the left branch of a tree section, and the right branch is missing. Of course, no right branch is going to fix #4.

  4. Still struggling to follow the original point you were getting at, but let me take a punt. Are you saying that the idea of trying to validate input before building a tree is going to fall through when I add support for parentheses, because once you have those its much easier to validate token by token as the tree is formed?

    I expect that adding parantheses will force me to scrap the existing syntax checking in favour of some grammar-thing. I wanted to learn the grammar stuff anyway but I didn't want to tackle it the first time through. If that is the case, do you your point will still apply?

  5. Yeah, that's my point. To verify the expression you need to ensure you've got terms on the left and right of any given operator. Parentheses complicate this in a way that makes it seem it'd be easier to do while building the expression tree.

    ...thinking about it more, you could probably still verify the expression by considering parentheses as characters that can appear anywhere, just so long as:
    * opening parens are followed by an integer(maybe with a negative sign)
    * closing parens are preceded by an integer
    * there's equal numbers of opening and closing parens.

    Verifying that a statement fits a grammar is also a pretty tree-ish exercise, so I think that my original point would apply if you were to use a grammar. No rush to use a grammar, though.

  6. Parens is harder than that. For example:

    ((3+5)*2)+1 -- open parens not followed by integer, but should be valid

    3)+(5 -- follows your rules but should be invalid

    (2+3)+3)+(2+(6+3) - extension of previous one