Should I use a parser generator or should I develop my own code to do "parse" and "lex"?

Asked

Viewed 1,365 times

17

Obviously the question should not be interpreted "to the letter". I do not want anyone to decide for me, I need to know the perks and disadvantages to use one method or the other.

When should I wear one parser generator (in English) and when should I develop my code from scratch?

What I gain or lose by choosing one or another approach?

  • Related: http://answall.com/questions/2044/quais-as-principais-vantagens-e-desvantagens-de-se-usar-um-parser-ll-ou-um-lr

2 answers

17


In fact there are three options, all three preferable in different situations. The third can be observed in the excellent original response in Software Engineering given by Alex ten Brink.

Option 1: parser generators, or "you need to parse some language and you just want to make it work, nothing more"

You need to build an analyzer for some old data format NOW. Or you need your analyzer ready quickly. Or you need your analyzer to be easy to maintain.

In such cases, it is probably best to use a generator parser. You don’t have to mess with the details, you don’t have to have a lot of complicated code to work properly, you just write the grammar of the input text that needs to adhere, write some manipulation code and ready: parser snapshot.

The advantages are clear:

 - It is (usually) very easy to write a specification, especially if the input format is not too weird (option 2 would be better if it is).  - You end up with a product that is very easy to maintain and easy to understand: the definition of grammar usually flows much more naturally than the code.  - Analysers generated by good generators of parsers are usually much faster than handwritten code. Handwritten the code can be faster, but only if you understand your material - that’s why most compilers use an parser recursive descendant handwritten.

There’s something you have to be careful of generators parser: They can sometimes reject their grammars. For an overview of the different types of analyzers and how they can "eat you", you can start here. Here you can find an overview of a number of implementations and the types of grammars they accept.

Option 2: Handwritten parsers, or "you want to build your own parser, and you care to be user-friendly"

Parser generators are good, but they are not very user friendly to the end user of the compiler (not the compiler creator). You usually fail to give good error messages, nor can you provide error recovery. Maybe your language is too strange and parsers may reject your grammar or you need more control than the generator gives.

In these cases, using a hand-written recursive descending parser is probably the best way. Although getting it right can be tricky, you have full control over your analyzer to do all kinds of cool things that you can’t do with analyzer generators like error messages and even error recovery (try to remove all semicolons from a C#file: the C# compiler will complain, but it will detect most other errors anyway regardless of the presence of semicolons).

Manual analyzers are also generally better than those generated in execution, assuming that the analyzer quality is high enough. On the other hand, if you can’t write a good parser - usually due to (a combination of) lack of experience, knowledge or proper design -, then performance is usually slower. To lexers the opposite is true: lexers generated, usually use lookup tables, making them faster than (most) handwritten.

Writing your own analyzer will teach you more than using a generator. You have to write more complicated code, after all, more, you will have to understand exactly how to analyze a language. On the other hand, if you want to learn how to create your own language (getting experience in design language), option 1 is preferable: if you are developing a language, it will probably change a lot, option 1 gives you an easier experience.

There is an extra objective reason for doing on your own. The creation of parser and lexer is a fraction of the work required in creating a really useful language. As other advantages exist in manual creation, the concern to gain time using generators is almost silly in most cases. Relevant information can be obtained from article by Walter Bright, the language creator D in Dr. Dobbs.

Finally, there is a subjective point. There are several reports, including the original answer in Software Engineering, that doing it by hand is an endless fun, at least for those who like the subject.

Option 3: handwritten parser generators, or "you’re trying to learn a lot from this project and you wouldn’t mind ending up with a bunch of cool code that you can go back to using"

Although useful, it is a little outside the intention of the question. Who is interested, can look at the original response.

  • 1

    Although I have put an answer, others can be quite useful. In fact, in the original post on Programmers has other very interesting answers that present good experiences. What is your?

  • The luaparse It’s a great example of a handwritten parser, you can learn a lot from watching your code. I’ve even managed to modify it to support things not implemented in Lua. The esprima is also a great example and is easy to maintain. I fully agree with Option 2

13

Generally speaking, a domain-specific tool will be easier to use, and potentially more efficient, than a general-purpose tool. That’s the logic behind the "domain-specific languages" (DSL), and is also the reason why rarely someone "programs a game" - preferring instead to program a engine games, and then develop a game in this engine. Likewise, implement a parser from scratch in a general purpose programming language will most likely be harder than doing so using a parser generator.

On the other hand, it may be the case of existing parser generators - or their language representation languages (formal grammars) - not be the most appropriate abstraction for what you want to do. Sometimes the problem is very simple, and it is not justified to make a parse complete with a few simple replacements (e.g.: Markdown - used on this site itself - is implemented largely by a sequence of regex substitutions). Other times the desired characteristics are unorthodox, and expressing them in simple production rules becomes impractical (e.g.: in Prolog, the table of operators is not fixed, so that during the syntactic analysis itself you can find instructions to change them, already valid in the rest of the source code). The situation of the target platform may still not have many good options of generators ready (e.g.: in a personal project, I had to write a parser in Javascript - the desired system characteristics excluded the use of any code on the server side).

If you’re in one of the above situations - or if your goal is to actually learn more about how a compiler works (in my opinion, it’s an exercise that greatly broadens your understanding of how a computer works underneath the scenes) - then it might be worth writing your parser by hand. I only advise to make it as modular and extensible as possible - almost to the point of writing a "compiler of compilers". That is, avoid instructions hardcoded, preferring to parameterize everything feasible - lexer, operator table, etc. Thus, although the result is not a "generic" parser (i.e. able to interpret a large subset of context-free languages), it may still be extended to interpret languages more or less similar to yours, but with subtle differences.

P.S. There are cases where a general purpose programming language has its own functionalities to facilitate the writing of parsers and/or the creation of Dsls that fit certain patterns. See the reading macros of Lisp, the grammar of defined clauses of the Prolog (excellent for round-trip Engineering), and for simpler cases, the language Groovy.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.