What are the main advantages and disadvantages of using an LL parser or an LR?

Asked

Viewed 3,496 times

40

I’m building a parser for a programming language. Grammar does not need to meet all the complexities of a language like C++ or Lisp.

I have a moderate knowledge of language building but little experience with the tools and methodologies to implement them. I have a notion of how a parser works and which algorithms are most used for it. Among them, I was able to select two, the LL and the LR. I’ve studied the basics about them but because I have only theoretical knowledge I’m still not comfortable saying the advantages of each one.

If you have experience with both algorithms building a compiler in the above situation, you can help me.

  1. What I earn and what I lose when I choose the algorithm parser LL or with algorithm LR?

  2. I stopped looking at other important algorithms that may be more indicated in some situation than the two cited?

  3. How they behave in the face of grammar complexity, ease of implementation and maintenance, and speed of execution?

I do not want to know which is better, because if one was clearly better than the other, the latter would be cited only historically. That only I can determine. I need subsidies to make my decision.

5 answers

29


LL

LL works by trying foresee what will be the next production rule, and then applying this rule to the next read input symbols. If the forecast fails, it is necessary to return to the state before the forecast, return the symbols read to the input (backtracking) and make a new prediction. The parser fails when all possible predictions have been tested (i.e., the initial prediction that there is a valid input to the language if it proves false).

It’s a pretty easy type of parser to write manually because of its recursive feature. It becomes trivial if the language is regular or never needs backtracking (in this case it is a LL(1)). The parser code is also very similar to the grammar that produced it and can be read with ease.

However, runtime is difficult to predict (although it is not so different from linear on input size in most practical cases) and not all grammars can be processed by an LL parser. A classic example is left recursion:

Num    →   [0-9]+
Base   →   Num | '(' Sum ')'
Prod   →   (Prod '*')? Base
Sum    →   (Sum '+')? Prod

This grammar cannot be processed directly because the parser LL understands that when reading a Sum, should assume that soon after comes a Sum and check that possibility, generating an infinite loop. For this to work you must first transform the grammar so that it is acceptable for the LL algorithm, thus:

Num    →   [0-9]+
Base   →   Num | '(' Sum ')'
Prod   →   (Base '*')* Base
Sum    →   (Prod '+')* Prod

It continues to accept exactly the same language, however it will not produce the same syntax tree (Sum and Prod have seen n-dimentional operations in place of binaries) and need additional processing to adjust it.

LR

LR operates on a list in which at each stage one of the following operations will be executed:

  • Shift: A symbol will be read from the input and added to the end of the list as a terminal.
  • Reduce: A production rule will be applied in the last N terminal or non-terminal symbols present in the list, transforming them into a nonterminal.

The operation is repeated until the input is empty and the list contains only one nonterminal, which represents the language. If this state cannot be reached (the input ends and there are no rules to apply and reduce the list) the parser fails.

The interesting thing here is that an input symbol is never read more than once and that the processing time is clearly linear in the size of the input (assuming that the number of production rules is not absurdly large, practical cases). This mechanism can handle recursion smoothly. Some variants are the LALR and the SLR.

The LR parser accepts a larger set of grammars, having no recursion problems. The following grammar, for example, is perfectly acceptable.

Num    →   [0-9]+
Expr   →   Expr '+' Expr | Num

Being an ambiguous grammar, the rightmost derivation will be given (hence the R of LR). 1+2+3 is processed as 1+(2+3).

It is a type of parser that is difficult to write in the hand because its logic is mostly written in the form of transformation tables, not code. On the other hand, it is simple to be produced by another algorithm from grammar, a compiler. Some examples are the Bison and the Yacc that produce a parser LALR.

Completion

If you intend to write the parser yourself, without the help of any tool, immediately tend to choose a LL, especially if the language is regular or simple enough to LL(1). It’s an easy model to program and in most cases the difference in performance is too small to be relevant. (There are grammars that can make an LL require exponential time in input size, but I’m assuming real and practical cases). It is a good choice for those who are starting on the subject by allowing to really understand what is happening in the code with ease.

But if you plan a parser of greater complexity or what efficiency is important, the ideal is to rely on a tool to generate the parser for you. In most cases it will be able to make grammar simplifications that make it unreadable in the final parser code, but faster. Almost all such tools produce LALR parsers, although some use LL (such as ANTLR). In this case you describe the grammar, usually in PEG notation. The diagnosthus for syntax errors tends to be better too.

However, if you need fine-tuning in your parser, how to deal with some syntax element that is unconventional and cannot write a grammar directly. An example of this is the C/C++ Preprocessor. You can do the process in two stages by submitting the processed code to the parser. But doing so will lose important diagnostic information and the error message may have nothing to do with the code originally written. The option is to integrate the Preprocessor into the parser. But in this case there is no way to write grammar, you need to write the parser in code.

I particularly choose to write a parser (in LL) only when the language is extremely simple. For any real project I use some tool to generate the parser.


  • 5

    It’s always good to see that we don’t only have web developers here :) You already gave answers that I thought no one would answer and even had not gotten good things in English.

  • in productions Expr → Expr '+' Expr it seems to me that the text would be better if you didn’t use an ambiguous grammar (Expr → Expr '+' Num). Ambiguity will give unnecessary and semantically dangerous conflicts. It seems to me that you could review line that explains the LR.

  • @Jjoao In fact the example of LL was incorrect. Corrected. As for the LR, it was intended to display an ambiguous grammar, precisely to demonstrate that this is possible.

  • 1

    @Guilhermebernal, sorry if I insist. The idea of the R of the LR is not quite this. It is more waiting for the right corner of the derivation tree to make decisions. In ambiguous grammars like this all the tools I have used (no extra definition of associativity) complain about reduce-reduce conflicts and warn that semantics will go wrong

  • 1

    @Jjoao I will review this calmly and read better on the subject. Further forward I make the proper edits. Thank you very much for pointing out the error, I will study.

18

First you need to determine whether a formal grammar is even the best way to represent your language. If it is, if this grammar is implementable by one or another algorithm (because as pointed out by Guilherme Bernal, there are subtle differences in the grammars accepted by LL, LR and its variants) - or if some other technique will be necessary. It is also important to determine how much build efficiency is needed (since nowadays there is not so much need to "squeeze every CPU cycle").

A "compiler of compilers" can be interesting, although there are languages - from syntax to simple - that cannot be implemented in this way. An example is Prolog, that has instructions to define new operators and, in the following code line, already allows the programmer to use them as part of the program:

:-op(500, yfx, mais).
dobro(X, Y) :-
    Y is X mais X. /* Naturalmente, 'mais/3' tem de ser definido (antes ou depois)... */

That is, although it is possible to define the Prolog syntax with a context-free grammar, it would be necessary modify its derivation rules during the compilation itself. Compiler compilers in general are not as flexible... (another language that also gives a lot of flexibility in compilation is Lisp, with their reading macros)

Languages with these characteristics would be better expressed in another way (for example, the technique top down Operator precedence), But let’s assume you opted for a formal grammar. The next step would be to determine whether it fits the requirements for one of these automatic generation methods (such as LALR parsers or perhaps PEG, suggested by @Victor). Very likely this will be possible, although in some cases the form of express your derivation rules get a little "weird" (e.g.: when the syntax for generics was intruded in Java and C#, a special rule was needed to address Classe<Tipo1<Tipo2>> - for the Angle Brackets needed to be balanced, but the symbol >> was not recognized as two > for "longer marriage principle").

If either parser requires a lot of change in the expression of its rules, it can get in the way of maintainability of your compiler, so this should be avoided.

Finally, if there are still two or more parser options, consider whether there is even a need to optimize for performance: the advantage of using a ready-made generator or, conversely, the advantage of manually implementing a simpler algorithm, can allow you to have a faster functional version that allows you to use, test, receive feedback - in short, gain more "hands-on language creation experience". A second, more efficient version can always be made later.

  • Sometimes these grammars are designated as "Operator Precedence grammars" and are in the "context-free-Grammar" family"

14

Instead of using LR or LL, use Pegs (Parsing Expression Grammars, see here and here).

A PEG is defined similarly to a GLC (Context-Free Grammar), only with the following differences:

  • A PEG does not have a non-deterministic choice (rules of the type A ← b | c GLC). Instead Pegs have deterministic choices (rules like A ← b / c). In a deterministic choice in the format A ← b / c / d, the non-terminal A initially attempts to derive a b (and the corresponding parser will do this too). If A not derive b, then the parser will attempt to derive a c and if I don’t get this too a d. This rule is defined in this way in order to eliminate ambiguity in choice (which typically exist in Glcs), but also makes some cases problematic such as A ← x / xy, where xy will never be derived. An interesting practical result about this is that a rule of the type S ← 'if' C 'then' S 'else' S / 'if' C 'then' S does not suffer from the problem of dangling-Else.

  • A PEG can have a predicate and. This predicate works as follows: A rule A ← b &c means that A drift b if, and only if, shortly after the b there is a c.

  • A PEG can have a predicate not. This predicate works as follows: A rule A ← b !c means that A drift b if, and only if, shortly after the b nay there is a c.

In this way, the first rule reduces the recognition power of PEG languages relative to Cfgs and the other two increase. As a result there are languages that are PEG but are not context-free, such as the language {anbncn | n > 0}:

S ← &(A 'c') 'a'+ B !('a'/'b'/'c')
A ← 'a' A? 'b'
B ← 'b' B? 'c'

On the other hand, Pegs are not able to recognize inherently ambiguous Cfgs and also some that are not ambiguous.

The advantages of Pegs are:

  • Pegs are simple to define and implement. The implementation is similar to that of an LL parser, only slightly more sophisticated in some respects: It is necessary to backtrack input when dealing with predicates and and not and for the implementation also to be efficient, it is necessary to implement memoisation. In the case of error recovery, the treatment is similar.
  • Pegs are free of ambiguities.
  • Pegs are recognized (with the Packrat algorithm) in linear time relative to input size (if you don’t have recursion on the left).
  • Pegs are superset to all regular grammars, LR-k and LL-k for any k value.
  • Programming languages implemented with Pegs usually do not need to separate lexical analysis from syntactic analysis, everything can be done in a single grammar based on the characters of the source code, eliminating the need for tokenization. Not that this is necessarily impossible in LR or LL languages, but usually parsers of LR and LL languages have separated a tokenization step, while this is not usually considered necessary in PEG languages.

A simple implementation of a Pegs recognizer would take exponential time in the worst case, mainly because of the predicates and and not that would involve a lot of backtracking, and that you wouldn’t be able to handle recursive left. Using memoization (which is what the Packrat algorithm does), the time can be reduced to linear and the left recursion problems can be deleted with the use of guards/sentries that detect them.

In addition, there are already several implementations of Pegs parsers in various programming languages and a quick google search will bring you various results, including for parser generators.

There is only one major drawback: The memory consumption of a parser Packrat is higher than the memory consumption of a parser LL and LR. This is because of the massive use of memoization to ensure a linear time.

An important observation to make about Pegs is that they are a relatively new approach. The Pegs were invented by Bryan Ford in 2002, which in the area of formal languages is super new. Most formal language theory was created in the 1960s and 1970s. For this reason, you will hardly find any reference to Pegs in your books about compilers and it is also because of this that you will rarely see people using this approach.

14

LL and LR contrast analysis for a number of criteria:

Complexity

LL wins here, easy. You can easily write an LL parser by hand. In fact, it is commonly done: the Microsoft C# compiler is a handwritten descending recursive parser (font here, look for a comment made by Patrick Kristiansen - the blog is very interesting too).

LR analysis uses a rather counterintuitive method to parse a text. It works, but it takes some time to understand how it works exactly. Writing such a parser by hand is therefore difficult: you would be more or less implementing an LR parser generator.

Generality

LR wins here: all LL languages are LR languages, but there are more LR languages than LL languages (it is an LL language if it can be parsed with an LL parser, and a language is LR if it can be parsed with an LR parser).

LL has some nuisances that will get in the way of implementing virtually any programming language. See the article from Wikipedia for an overview.

There are unambiguous languages that are not LR languages, but are very rare. You will almost never find these languages. However, LALR has some problems:

  • LALR is more or less a "gambiarra" for LR parsers to get smaller tables. Tables for an LR analyzer can usually grow enormously. LALR parsers abstain from the ability to parse all LR languages in exchange for smaller tables. Most LR analyzers actually use LALR (usually you can find out exactly how it is implemented).

LALR may have difficulties with conflicts shift-reduce and reduce-reduce. This is caused by the table’s gambiarra: it 'joins' similar entries, which works, because most entries are empty, but when they are not empty, it creates a conflict. These types of errors are not natural, they are difficult to understand and the corrections usually get quite strange.

Compiler errors and error recovery

LL wins here. In an LL analysis, it’s usually very easy to issue useful compiler errors, especially in handwritten parsers. You know what you are waiting for next, so if it completes the compilation, you usually know what went wrong and what the nearest error would be occurring.

Also, in LL analysis, error recovery is much easier. If an entry does not parse as correct, you can try to jump a little ahead and find out if the rest of the input does not parse correctly. If, for example, some instruction of the program is incorrect, you can go ahead and analyze the next statement, so you can catch more than one error.

Using an LR analyzer this is much more difficult. You can try to increase your grammar so that it accepts the wrong input and print errors in the areas of text where things went wrong, but this is usually very difficult to do. The chance of you ending up with a non-RL (or non-LLL) grammar also increases.

Speed

Speed is not really a problem that depends so much on the way you analyze your input (LL or LR ), but rather the quality of the resulting code and the use of tables (you can use tables for both LL and LR). LL and LR are comparable.

Links

10

Any notes on this question and the answers:

Note 1: The top-down/LL techniques are being handled slightly unfairly:

Usually Top-Down tools accept EBNF grammars (they are easy to implement in TD).

Coverage: it is easy to get similar coverage to LR techniques at the expense of using EBNF.

Thus, the left factorization and recursion questions are expressed by writing the grammar differently (iteratively). Ex:

a --> a OPMul f
  | f

It is spelled as

a --> f ( OPMul f)*

Tools such as ANTLR use LL(K) with potentially infinite K, and are used to write royal compilers, and make available in their communities many royal grammars, and cover more than Parser. ANTLR4 accepts even (a good subset of) recursion on the left.

Top-Down methods make it easier to implement inherited attributes. (OK, OK, with LR you can also)

Note 2: Bottom-up is not just Yacc

Remember that even Bison currently generates LALR parsers but also LR1, IELR, GLR.

Many Bottom-up tools include directives linked to the "Operator Precedence grammars" (definition of operators, priorities and associativities) that greatly helps the writing of certain types of processors (for example it is excellent for expressions).

There are more Bottom-up parser generating tools!

Note 3: more than one parser is required (choice of generated language)

In addition to syntactic recognition, we have to create a semantic and pragmatic level: we should choose a generator that produces a language that we like!

For example there are variants Yacc-Eiras for almost everything that is language.

I like Perl: using Parse::Yapp (generates LALR) I can write semantic actions in Perl much more quickly than in C or Java. (This question is critical).

There are also tools connected to: (1) Crossings and decoration of Parsing trees; (2) generation/ optimization of code; (3) templating;

Note 4: DSL creation is more frequent than complete compilers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.