Download Memoizing Top-Down Backtracking Left

A Monadic-Memoized Solution for Left-Recursion Problem of Combinatory Parser Seminar Presentation Report Submitted to Dr. AK Aggarwal 60-520, Fall 2005 Depatment Of Computer Science University Of Windsor Submitted By Rahmatullah Hafiz M.Sc candidate INDEX 1. INTRODUCTION ………………………………………………………………3 2. PARSING ………………………………………………………………………..4 3. COMBINATORY PARSERS …………………………………………………...5 4. THE LEFT RECURSION PROBLEM ………………………………………….9 5. RELATED PREVIOUS WORKS ………………………………………………11 6. OUR PROPOSSED APPROACH …………………………………………….. 13 7. MONADS ………………………………………………………………………17 8. FUTURE WORK ………………………………………………………………20 9. CONCLUSION ………………………………………………………………...20 10. REFERENCES ………………………………………………………………..21 List of Figures Fig 2.1: Top-own Parsing ………………………………………………………...5 Fig 4.1 Illustration of left and non-left recursive parsing ………………………..9 Fig 5.1 Problem with Grammar transformation ………………………………….12 Fig 6.1: Parse diagram of how top-down backtracking Left-recursive recognizer works …………………………………………15 Fig 7.1 Monadic-Memoized left-recursive parsing ………………………………19 2 Abstract An alternative efficient methodology was illustrated to solve the left-recursive problem of combinatory parser. Memoization was employed to reduce the exponential time complexity of traditional top-down backtracking parsing strategy. Use of monad was described to ensure the modular propagation of memo table. Related previous works were studied for comparison purpose. It was evident that approaches of previous works are not sufficient enough to satisfy the requirements of the construction of combinatory parsers. Related technical and theoretical backgrounds were also described. 1. INTRODUCTION Combinatory-parsers have been playing a significant role in order to construct formal or natural language parsing systems using functional programming languages. But conventional parsers or parser generators suffer from a common problem, which is they don’t provide a sophisticated solution for left-recursive grammar. As left recursive grammars are easy to implement, very modular and provides great deal of flexibility for natural-language parsing, it is important to implement parsers that can utilize benefits of left-recursive grammars correctly. Most of the combinatory parsers or parser generators handle left-recursive production by transforming them to an equivalent non-left recursive production. But this approach is not sufficient in case of parsing using ‘attributegrammar’, where each production rule is associated one or many semantic rule(s). When the left-recursive grammar is transformed, the new grammar is not able to carry out actions according to the previously assigned semantic rules. In this paper we investigated related previous works and presented our solution to solve this drawback of combinatory parsers. Section 2 and 3 are the descriptions of parsing and combinatory-parsing in general. In section 4 we define the common problem of parsing with left-recursive 3 grammar. Section 5 was an investigation on previous works and in section 6 we proposed our approach to solve this problem. And in section 7 we generalized the concept of monad and their usefulness for the construction of combinatory-parser. 2. PARSING Parsing is a process of determining if the sequence of input string can be recognized by a set of predefined rules or not. The computer program does the recognition process is called a ‘Parser’. A parser recognizes a given input sequence of text by constructing a ‘parse-tree’. The rules a parser follows to recognize an input string is formally known as Context-free grammar (CFG). A CFG, G can be defined as a 4-tuple: G = (Vt,Vn,P,S) where  Vt is a finite set of terminals  Vn is a finite set of non-terminals  P is a finite set of productions rules  S is an element of Vn, the distinguished starting non-terminal.  Elements of P are of the form Vn  (Vt  Vn )* Context Free Language (CFL) played a central role in natural languages since 1950’s (Chomsky) and in compilers since 1960’s (Backus). Parsing is one of the fundamental areas of computing and is necessary both for programming languages and natural languages. There are basically two kinds of parsing 4 Bottom-up parsing and Top-down parsing. We are particularly interested about the Topdown parsing. Aho, Sethi and Ullman (1986) provided a detailed description of parsing and related methodologies in their book “Compilers, Principles Techniques and Tools”. TOP-DOWN PARSING In case of top-down parsing, attempts of recognition of any given input start from the root of the context-free grammar and proceeds towards the leaf. The parser tries to recognize the input from left to right. These kinds of parsers are easy to write and very structured. One draw back is conventional Top-down parsing does not support all forms of CFG. One form of Top-down parsing is called recursive-decent backtracking parsing. In this type of parsing, the parser recursively tries all possible alternative rules to recognize the given input string. If no extra precaution is taken, then recursive-decent backtracking parsers show exponential time complexity in worst case. Parse tree exp Input “2*5+6” Rules exp -> digit op Exp |digit digit -> 0|1|..|9 Op-> *|+ digit op exp digit exp op digit 2 * 5 + 6 Fig 2.1: Top-own Parsing 5 3. COMBINATORY PARSERS In general, if a parser is written in functional-programming language is called ‘Combinatory parser’. But in order to be more specific, Combinatory parsers are written in lazy-functional languages using higher-ordered functions and as the higher order functions are used to combine parsers they are generally called “Combinatory Parsers’. Like any other parsers, they can be used for both parsing Natural languages (English) or Formal languages (Java). Though combinatory parsers were introduced by Burge in 1975, it was Wadler (1985) who first popularized the use of combinatory parsers. Combinatory parsers are written and used within the same programming language as the rest of the program. There is no gap between the grammar formalism and the actual programming language used. Users only need to learn the host language and can also enjoy all the benefits that exist in the host language such as: type checking, module system, development environment, etc. A combinatory parser can be defined as a program which analyses text to determine its logical structure. For example, the parsing phase in a compiler takes a program text, and produces a parse-tree, which develops the structure of the program. Many programs can be improved by having their input parsed. The form of input that is acceptable is usually defined by a CFG, using BNF notation. Although there are many methods of building parsing, ‘combinatory parsing’ has gained widespread acceptance for use in lazy functional languages. In this method, parsers are modeled directly as functions; larger parsers are built piecewise from smaller parsers using higher order functions. For example, ‘higher order functions’ can be defined for sequencing, alternation and repetition. In this way, the text of parsers closely resembles BNF notation. Parsers in this style are quick to build, and simple to understand and 6 modify. Combinatory parsing is considerably more powerful than the commonly used methods, being able to handle ambiguous grammars, and providing full backtracking if it is needed. In fact, we can do more than just parsing. Semantic actions can be added to parsers, allowing their results to be manipulated in any way we desire. More generally, we could imagine generating some form of abstract machine code as programs are parsed. Parsers themselves can be built by hand, but are most often generated automatically using tools like Lex and Yacc for imperative languages or Happy for functional language Haskell. One drawback of this approach is the user needs to learn a new language (Lex, Yacc or Happy) to generate a parser. EXAMPLES OF COMBINATORY-PARSER Lazy-functional programming languages come with diverse range of advantages over imperative or strict-functional languages. Discussing all those advantages is out of our topic. We’ll try to explain only why it provides great benefits for constructing parsers. Functional programming gives the advantages of using higher-ordered functions to construct combinatory parsers. As it is, functions are top-ordered citizens in functional languages. And higher-ordered functions stay above all the other function executions. A ‘higher-order function’ is a function which can takes other function(s) as its function argument and also can produce some other function(s) as it’s output. As higher-ordered function can represent BNF notation of CFG, they are very suitable for constructing topdown recursive-decent fully backtracking parsers. In 1989 Frost and Launchbury showed how to construct Natural-Language Interpreters in a Lazy-Functional Language (Miranda) using higher-order functions. In 1992 Huttton used higher-ordered functions to construct a complete parsing system in functional language Haskell. The following 7 example demonstrates how higher-order functions mimic the BNF notations of a contextfree grammar. This example is derived from Frost and Hutton’s explanation. Sample context-free grammar: s::= a s | empty We can interpret this grammar in English as “s is either a then s or empty”  s = a then s or empty Using higher-order functions it is possible to write a parser exactly as above. empty input = [input] a (x:xs) = if x==‘a’ then xs else [] (p òr` q ) input = p input ++ q input (p `then` q) input = if r == [] then [] else q r where r = p input s input = (a `then` s òr` empty) input The above is a sample example of how to write a simple parser using higher-order functions. The main parser ‘s’ is consists of two higher order function (formally called recognizer) ‘then’ and ‘or’. The recognizer ‘then’ takes two other functions (which are also recognizers) ‘p’ and ‘q’. Recognizer ‘p’ and ‘q’ are defined as function ‘a’. And recognizer ‘empty’ itself is a function too. The function ‘or’ also works same way. It is clearly evident that the final parser is exactly equivalent to ‘s::=a s | empty’ and when an input string “aaa” is supplied to ‘s’, it recognized it different possible ways (by recursivedecent backtracking) using the recognizer ‘then’ and ‘or’ . *Main> s "aaa“ ["","a","aa","aaa"] 8 4. THE LEFT RECURSION PROBLEM If a combinatory parser is written following a non-left recursive CFG, then either the parser will finish recognizing all the inputs at some point or it will terminate (indicating failing) by trying all possible rules by recursive-decent backtracking. But if the parser is written following some left-recursive context-free grammars, it normally doesn’t get the chance to try any alternative rules to recognize the input. Because it recursively call itself infinitely. That’s why a left-growing parser never terminates unless some action is taken. The following illustration shows why a leftrecursive grammar doesn’t terminate and why a non-left recursive one terminates for same input. Left-recursive grammar s :: s a|a Input “aaa” aaa aaa a aa s a aaa s a aaa (“a”, “aa”) a s s s s aaa Right-recursive grammar s :: a s|a Input “aaa” (“aa”, “a”) a Never Terminates a s “” a (“aaa”, “”) 9 Fig 4.1 Illustration of left and non-left recursive parsing So if the combinatory parser is written using higher-order functions, which are leftrecursive, the parser program will keep running until it runs out of memory space. For example if the previously defined parser is written following a left-recursive grammar ‘s ::= s a| empty’ , it simply fails to recognize any input. Never Terminates s input = (s `then` a òr` empty) input *Main> s "aaa“ (*** Exception: stack overflow Why bother for Left-Recursive Parsing? Below is a list of some reasons of why left-recursive grammars could be important for considering them as the ‘rules’ during combinatory parsing.  Easy to implement  Very modular  In case of Natural Language Processing Expected ambiguity can be achieved easily  Non-left Recursive grammars might not generates all possible parses  As left recursive grammar’s root grows all the way to the bottom level, it ensures all possible parsing 10 5. RELATED PREVIOUS WORKS We can categorize previous works in two categories: ‘grammar transformation approach’ and ‘approach leads to exponential time complexity’. Grammar transformation approach Most of the previous works fall into this category. The idea is transforming a leftrecursive grammar into a non-left recursive grammar. Many parsers and parser generators were constructed using this approach. Two important works related this approach are: Frost (1992) – Guarding left-production with non- left productions Hutton (1992, 1996) – Grammar transformation But grammar transformation technique suffers from the following two drawbacks:  If the original grammar is associated with some semantic rules, there is no. known ways to attach those rules to the newly transformed grammar. Natural-languageparsers are often written in form of ‘attribute grammar’ (a form of grammar which provides predefined semantic rules for each of the production rule). If a left recursive attribute grammar is transformed into a non-left recursive grammar. The newly formed grammars violate the predefined semantic rules.  The way a non-left recursive grammar construct a parse tree during parsing process, it differs completely from the equivalent left-recursive grammar. So there arise a need (another algorithm) to re-transform the non-left-recursive parse tree to the original one. Approaches lead to exponential time complexity Some works have done by not transforming left-recursive grammars to non-left recursive one. Among those Lickman’s (1995) was remarkable. He defined combinatory parsers in 11 terms of ‘fixed-point’, inspired by the work of Wadler. A fixed point in mathematic is a function which returns the original function input after some computation. In other words it is able to deal with infinite function calls. So Lickman’s idea was if a left-recursive parser can be re-defined in terms of fixed point operator, it will terminate at some point. But this termination results exponential time-complexity in worst case, which is not desirable for any form of parsing. Lickman also mentioned that redefining the definition of ‘fixed point’, it might possible to achieve better complexity. But he didn’t extent his work up to that point. Below is an illustration of why grammar transformation is not desirable: Example Original Left recursive grammar s::sa [rule 1] |b [rule 2] Transformed grammar s::bs` [rule ??] s`::as`|empty [rule ??] s s a s S ` b a a S ` s a b e m p t y Fig 5.1 Problem with Grammar transformation 12 6. OUR PROPOSSED APPROACH Basic idea is simple. If we somehow don’t let left-recursive grammar grow infinitely then it terminates at some point. Frost and Szydlowski (1996) already proposed a fame work to construct a top-down backtracking combinatory parser in lazy functional language. They also showed how memoization could help to achieve reduced time-complexity. We can simply extend their work for left-growing grammars by imposing some constrains. Memoizing Top-Down Backtracking Left-Recursive Recognizer As a left-recursive production grows infinitely, a left-recursive recognizer normally doesn’t terminate. Imposing some restrictions on the left-recursive function-calls could solve this drawback. By observation, it is evident that any left or right recursive recognizer doesn’t need to call itself more then the number of input tokens it is suppose to recognize at a time. A left-growing recursive function-call can be terminated by indicating a ‘fail’ after letting the call grow up to the length of input string. This termination, along with the current result of recognition for a particular production during recognizing a particular input, is needed to store in a ‘memo’ table so that the same leftproduction rule doesn’t try to re-grow for the same input. It can simply perform an ‘update’ operation on ‘memo’ table and return previously computed results. Memoizing left-recognizers’ termination points along with the current recognized results at that point ensure the cubic complexity. Procedure a. Change the previous ‘state’ to [(FT, MT)] where FT = [([Char],[([Char],[[Char]])],(num, num))] is the memo table for left-productions 13 MT = [([Char],[([Char],[[Char]])])] is the memo table for non left productions and terminals. The (num, num) pair of ‘FT’ keeps track of depth of current recursive call and the length of the current input respectively. Use of state monad to construct the recognizers ensures that every recursive call can have the up-to-date instant of this ‘state’. b. Define the ‘failCheck’ function which performs 1. Termination: When a left-recognizer function tries to recognize a particular input for the first time, ‘failCheck’ allows it to call it-self recursively up to the number of the length of that input and then terminates it by returning an empty list. 2. Update: ‘failCheck’ keeps the record of terminating point by saving the particular left-production rule with the input (for which the rule terminated) and up-to-date results of the recognition of that particular input in the table ‘FT’. 3. Lookup: If a left-production tries to ‘recognize’ any input, ‘failCheck’ performs a lookup operation on ‘FT’. If it finds any previously saved result for that particular production and input, it simply returns that result. Otherwise it performs operations of step 1 and 2. A successful lookup means the particular production once ‘failed’ for the same input. c. All non left-recursive and terminal recognizers are memorized through ‘memoize’ function using table ‘MT’ as described by Frost (1996). Example Definition of left-recognizer ‘ms’ s ::= s s ‘a’ | empty s = failCheck “s” (s $then s $then a $orelse empty) a = memorize “a” (term ‘a’) input = “aa” 14 *Main> ms "aa" [([],[])] (["","a","aa"],[([("ms",[("a",["","a"]),("aa",["a","aa"])], (0,1))],[("ma",[("aa",["a"]),("a",[""]),("",[])])])]) s [“”,“a”, “aa”] “aa” s s Lookup -> “aa” Call for-> “a” “aa” s Update ->(“a”,[“”,“a”]) (“aa”,[“”,“a”,“aa”]) ‘a’ | empty s ‘a’ | empty Update ->(“aa”,[“a”,“aa”]) Lookup -> “aa” “a” “aa” Update ->(“aa”,[“aa”]) s s Update ->(“a”,[“”,“a”]) ‘a’ | empty s “a” s ‘a’ | empty Lookup -> “a” Terminate ->[] s s ‘a’ | empty Update ->(“a”,[“a”]) Terminate ->[] Fig 6.1: Parse diagram of how top-down backtracking Left-recursive recognizer works 15 Definition of ‘failCheck’ function failCheck name f inp [(fTable,mTable)] = f inp [(((name,[],(1,(length inp))):fTable),mTable)], if ((first_elem fTable name) = [] ) = (res, [(updateF (fst(newtable!0)) name inp res,mTable)]), if ((first_elem fTable name) ~= [] & = (failRes, [((updateF fTable name inp failRes),mTable)]), if ((first_elem fTable name) ~= [] & where failInp failRes (res, newtable) failInp = []) failInp ~= []) = lookupF name inp fTable = lookupFR name inp fTable = ([], [((dec_table fTable),mTable)]), if (deptH > (length inp)) = f inp [((inc_table fTable),mTable)]), otherwise where deptH,lengtH) = head[(d,l)|(d,l)<-[c|(a,b,c)<- fTable; a = name]] dec_table ((key, fails, (d,l)):rest) = ((key, (inp,[]):fails, (0,(length inp))):rest), if key = name = ((key, fails, (d,l)): dec_table rest), otherwise inc_table ((key, fails, (d,l)):rest) = ((key, fails, (d+1,(length inp))):rest), if key = name = ((key, fails, (d,l)): inc_table rest), otherwise first_elem fTable name = [],if fTable = [] = [a|(a,b,c)<-fTable; a = name], otherwise lookupF name inp fTable = [],if res_in_table = [] = [i|(i, res) <- (res_in_table ! 0);i = inp], otherwise where res_in_table = [pairs|(n,pairs,dl) <- fTable;n = name] lookupFR name inp fTable = [],if res_in_table = [] = [res|(i, res) <- (res_in_table ! 0);i = inp]!0, otherwise where res_in_table = [pairs|(n,pairs,dl) <- fTable;n = name] updateF ((key, ((finp,frec):restr), dl):rest) name inp res = ((key,(addRes ((finp,frec):restr) inp res),dl):rest),if key = name = ((key, (finp,frec):restr,dl): (updateF rest name inp res)),otherwise addRes ((finp,frec):restr) inp res = ((finp,nub(res++frec)):restr),if finp = inp = ((finp,frec):addRes restr inp res), otherwise Memoization reduces worst-case time complexity from exponential to O(n3). The problem is Lazy-functional languages don’t let variable updating or keeping a global storage for the whole program. We need to pass around the ‘Memo’ table so that all recursive parsing calls access ‘Memo’ table. If ‘Memo’ table is used as the function arguments code gets messy and error-prone. The alternative solution is using Monad. The 16 concept of Monad was derived from category theory by Moggi in 1989. Below is a general discussion on Monad. 7. Monads Non-Strict functional programming languages (Miranda1, Haskell etc) are blessed with a special computational technique called ‘Monad’. Many new functional programmers get puzzled at first with the concept of monad. And the basic question arises “Why do we need them?” As non-strict functional programming languages don’t permit ‘side effects’, it’s relatively complex to perform operations like IO, maintaining states, raising exceptions, error handling etc. A naïve approach to solve these problems is redefining related recursive calls for each operation again and again. The Monads appeared as an easy solution of these kinds of problems. By adding simple changes to an existing monad, we can perform the above-mentioned operations in fairly easier way. Monads, to some extent, mimic an imperative-style programming environment within the scope of pure functional language. Monad helps programmers to construct a ‘bigger’ computation combining sequential blocks of ‘smaller’ computations. It abstracts the smaller computations and the combination strategy from the root computation. That’s why it’s easier to add new changes to an existing monad to fulfill computational requirements. Wadler (1992) illustrated a pretty neat example of how to redefine an existing monad to satisfy different requirements of a program. Haskell, a non-strict functional programming language, comes with many built-in monads (such as: list, maybe, IO etc) and the Prelude contains some monadic classes (such as: Monad, MonadPlus, Functor etc). It also provides a special syntax for monad 17 (the ‘do’ notation), which gives programmers the touch of imperative style programming in Haskell. It is important for a Haskell programmer to understand how that built-in monads work. Any monad consists of a triple (M, unit, bind). M is a polymorphic type constructor. Function ‘unit’ (of type aM a) takes a value and returns the computation of the value. Function ‘bind’ (of type M a  (a  M b)  M b) applies the computation ‘a  M b’ to the computation ‘M a’ and returns a computation ‘M b’. The ‘bind’ ensures sequential building blocks of computations. The following fig is a conceptual view of how monads work: M = Loader unit = tray bind = combiner Picture source: Newburn 2003 As monads give us structural way of computation, it’s a good idea to transform previously defined combinatory parsers in to monadic objects. Below is an example of how to transform the old ‘or’ parser recognizer. All other recognizers could be transformed in same way. 18 Original “Or” recognizer (p òr` q ) inp = p inp ++ q inp Monadic version (p òr` q) inp = p inp `bindS` f where f m = q inp`bindS`g where g n = unitS(nub(m ++ n)) LR Production keeps track of length and depth “aa” Production ‘a’ “lookups” the memo table s s “aa” a b a “aa” Production ‘a’ “updates” the memo table Memo table propagation is ensured correctly by Monads => O(n3) =(["","a","aa"],[(“a",[("aa",["","a","aa"]),("a",["","a"]),(10,11)])]) Fig 7.1 Monadic-Memoized left-recursive parsing 19 8. FUTURE WORK We have been working on constructing a complete left-recursive parsing system, which could be helpful to parse Natural language with reduced complexity. Also formal proof for terminating functions complexity is needed to be done. We only experimented the described procedure on toy grammars. It will be interesting to find out how this method works for a practical parsing system. 9. CONCLUSION We described how to write combinatory parsers using lazy functional languages and mentioned the advantages of constructing parsers in that way. We also mentioned why these parsers can’t be used with left-recursive context free grammar. Then we discussed advantages of left-recursive parsing and mentioned some related previous works and their drawbacks. We then proposed how to solve this problem by imposing some restrictions on the growth of left recursive grammar so that they can terminate at the desired point. Use of memoization was explained, which turns down the worst-case time complexity of to-down back tracking parsers from exponential to cubic. As lezy-functional languages don’t support side effects, we described how monads can be used to ensure correct propagation of memo table through out the whole parsing procedure. We believe our proposed solution is robust enough to construct parsing systems, which are able to handle left-recursive grammars. 20 Termina tes 10. REFERENCES 1.Frost, R. A. and Launchbury, E. J. (1989) Constructing natural language interpreter in a lazy functional language. The computer Journal – Special edition on Lazy functional Programming, 32(2) 3-4 2.Wadler, P. (1992) Monads for functional programming. Computer and systems sciences, Volume 118 3.Hutton, G. (1992) Higher-order functions for parsing. Journal of Functional Programming, 2(3):323-343, Cambridge University Press 4.Frost, R. A. (1992)Guarded attribute grammars: top down parsing and left recursive productions. SIGPLAN Notices 27(6): 72-75 5.Lickman, P. (1995) Parsing With Fixed Points. Masters thesis. Oxford University 6.Frost, R.A., Szydlowski, B. (1996) Memoizing Purely Functional Top-Down Backtracking Language Processors. Sci. Comput. Program. 27(3): 263-288 7.Hutton, G. (1998) Monadic parsing in Haskell. Journal of Functional Programming, 8(4):437-444, Cambridge University Press 8.Frost, R.A.(2003) Monadic Memoization towards Correctness-Preserving Reduction of Search. Canadian Conference on AI 2003: 66-80 21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Memoizing Top-Down Backtracking Left