Monday 11 February 2008

Syntax checking with a LL(1) parser

After reading about some different types of parsers, I decided to have a go at LL(1). This means the parser works Left-to-right, generates the Leftmost derivation and looks one token ahead. The basic idea of LL(1) is to build a stack of symbols that represent the derived expression by using a lookup table that tells you what rule to apply based on the current symbol and current token. Not all context free grammars can be parsed using LL(1) but I think the calculator's grammar can, with some modification. Lets break the work into some smaller steps:

  1. Modify the grammar I described last week to support LL(1) parsing.
  2. Modify the tokens_are_valid method to use LL(1) to verify whether the token stream is well formed. Given that I already have tests for tokens_are_valid, I will know that I am already on the right track once the existing tests are passing with the new implementation.
  3. Add syntax tests for parentheses. Watch them fail.
  4. Modify the tokens_are_valid implementation to make the new tests pass. Once this is happening we are done with using LL(1) for syntax verification and can move on to using LL(1) to actually do the parsing.
  5. Modify the build_tree method to use LL(1) and watch the operator precedence tests fail because my grammar doesn't deal with the precedence issue.
  6. Fix the grammar to support operator precedence, make the corresponding changes to the lookup table and watch the tests pass.
  7. Add evaluation tests for parentheses . I think these babies should automatically work if I've done the other steps right.
In this post I will tackle steps 1,2 and 3. Here is the grammar I described last week:
E => n | E+E | E-E | E*E | E/E | (E)
As it stands, this grammar will not work with LL(1) because to figure out which rule to apply, I have to look at least 2 tokens ahead. For example, starting with E and given two different inputs "1" and "1+1", its obvious that I need to use a different rule for each input, but both of them start with the token "1". This problem can be fixed by modifying the grammar. Here goes:
E => (E) | nT T => +E | -E | * E | /E | ε
The symbol 'ε' seems to represent null or empty. Basically it means that the symbol T can just disappear without producing anything. Resuming the above example, now there is only one rule to apply: E => nT. In the "1" case, the T disappears. In the "1+1" case the T is expanded to +E. Here is the lookup table for this grammar:
n + - * / ( ) $
E nT (E)
T +E -E *E /E ε ε
The '$' represents the end of the input string. The gaps in the table represent syntax errors. Writing the code There are a few issues that need addressing when it comes to implementation. The most obvious one is that I need some way to represent the symbols in the above table. Well it turns out Ruby has a symbol class, though it was not designed to represent grammar symbols - Ruby symbols are just named constants that have no intrinsic value(as far as I can tell). They use the :symbol_name syntax, so for example E and T are now represented as :E and :T. I amended the existing token classes to expose their corresponding symbol:
class AdditionToken < OperatorToken
def symbol
  return :plus
end
end
I need a way to determine if a symbol is terminal or not, and there is a rather nice way to do this - simply defining an extra method on the Symbol class:
class Symbol
def non_terminal?
  return (self == :T || self == :E)
end
end
Just to be clear, Symbol is not my class, I'm just adding an extra method to an existing class. Executing :T.non_terminal? will return true. Next on the agenda is the end token. We don't strictly need an end token but it does make the algorithm simpler.
class EndToken
def symbol
  return :end
end
end
The lookup table can be done as 2 dimensional hash:
def create_lookup_table()
lookup = {}
lookup[:E] = {}
lookup[:E][IntegerToken] = [:int,:T]
lookup[:E][LeftParenToken] = [:lpar, :E, :rpar]
lookup[:T] = {}
lookup[:T][AdditionToken] = [:plus, :E]
lookup[:T][SubtractionToken] = [:minus, :E]
lookup[:T][MultiplicationToken] = [:multiply,:E]
lookup[:T][DivisionToken] = [:divide, :E]
lookup[:T][RightParenToken] = []
lookup[:T][EndToken] = []
return lookup
end
As you can see, the table is first indexed with the non-terminal (:E or :T) and then with a token type. The result is an array of symbols, which may be empty. If an attempt to lookup an undefined entry is made then nil is returned and this is interpreted as a syntax error. Right, with that out of the way, I can begin work on tokens_are_valid. The goal is to replace the old implementation with an LL(1) algorithm and hopefully all the existing tests will pass:
def tokens_are_valid(tokens)
tokens = tokens.dup
symbols = [:E]
lookup = create_lookup_table()

# The symbol and token arrays can be used as stacks if the contents is first reversed.
# An end token/symbol is appended to each before they are reversed.
tokens.push(EndToken.new())
tokens.reverse!
symbols.push(:end)
symbols.reverse!

while(tokens.size > 0)
  current_symbol = symbols.pop() #The current symbol will always be consumed. Safe to pop.
  current_token = tokens.last    #The current token won't necessarily be consumed, cannot pop it yet.
  if(current_symbol.non_terminal?)
    result = lookup[current_symbol][current_token.class]
    return false if result.nil?   
      symbols.push(result.reverse)
      symbols.flatten!     
  else
    return false if current_symbol != current_token.symbol
    tokens.pop()
  end             
end

return true
end
Hopefully this is relatively self-explanatory. The bit that may need some explanation is in bold. See, I am using arrays as stacks but to do so I need to reverse the contents - I do this once at the beginning for the tokens and the symbols. However its also necessary to reverse the result of the table lookup for the same reason. Yes, I could have coded the table with the symbols backwards but that is more confusing. Easiest just to push a reversed copy of the result onto the symbol stack. The only problem with doing this is that you end up with an array inside an array, which is why I call the flatten! method. In case you were wondering, Ruby has a convention to use the exclamation mark for methods that modify the instance you called the method on, rather than just returning a modified object. The following should demonstrate:
irb(main):001:0> x = [1,2,3]
=> [1, 2, 3]
irb(main):002:0> y = x.reverse
=> [3, 2, 1]
irb(main):003:0> x
=> [1, 2, 3]
irb(main):004:0> x.reverse!
=> [3, 2, 1]
irb(main):005:0> x
=> [3, 2, 1]
Well, my proposed change to tokens_are_valid works fine - all my tests are still passing! The next step is to write some tests that include parentheses. Here are a few (no need to bore you with the full set):
class ParenthesesSyntaxTests < Test::Unit::TestCase
def test_surround_one_number # "(1)"
  tokens = [LeftParenToken.new(),IntegerToken.new(1), RightParenToken.new()]
  assert(tokens_are_valid(tokens))
end   

def test_surround_operation # "(1+2)"
  tokens = [LeftParenToken.new(),IntegerToken.new(1), AdditionToken.new(), IntegerToken.new(2), RightParenToken.new()]
  assert(tokens_are_valid(tokens))
end

def test_left_compound # "((500*20)+16)"
  tokens = [LeftParenToken.new(), LeftParenToken.new(), IntegerToken.new(500), MultiplicationToken.new(), IntegerToken.new(20), RightParenToken.new(), AdditionToken.new(), IntegerToken.new(16), RightParenToken.new()]
  assert(tokens_are_valid(tokens))
end

def test_right_compound # "(16+(500*20))"
  tokens = [LeftParenToken.new(), IntegerToken.new(16), AdditionToken.new(), LeftParenToken.new(), IntegerToken.new(500), MultiplicationToken.new(), IntegerToken.new(20), RightParenToken.new() , RightParenToken.new()]
  assert(tokens_are_valid(tokens))
end

def test_empty_parens_error # "()"
  tokens = [LeftParenToken.new(), RightParenToken.new()]
  assert(!tokens_are_valid(tokens))
end

def test_wrong_order_error # ")2("
  tokens = [RightParenToken.new(), IntegerToken.new(1), LeftParenToken.new()]
  assert(!tokens_are_valid(tokens))
end

def test_no_operator_adjacent_to_paren_error # "1+2(3+4)"
  tokens = [IntegerToken.new(1), AdditionToken.new(), IntegerToken.new(2), LeftParenToken.new(), IntegerToken.new(3), AdditionToken.new(), IntegerToken.new(4), RightParenToken.new()]
  assert(!tokens_are_valid(tokens))
end

end
All but one pass! Can you guess which one? It turns out this is a problem with my grammar, not the implementation. Next post I will discuss the failing test and what I have to do to fix it.

No comments:

Post a Comment