How is a compiler made?

Asked

Viewed 9,485 times

82

A compiler is programmed in what language? All compilers of a language are equal or one may exist better than another?

1 answer

87


To not get too broad I will make a summary (yes, this is a summary :P). And I will make some simplifications. Certainly fit more specific questions to deepen in some points.

Various techniques

Compilers, like any application, can be done in many ways. There are several techniques studied that work better, but there is no unanimity of which is better. And probably each one can be better at each different language and goal. This is one of the most studied subjects in computing.

What is known is that this is one of the most studied problems in computing, it is a very specific problem that attracts many people. And there are always novelties, recently there was a revolution in the way of building compilers, although the basis is the same.

Some examples of ways to make one of the main stages of compilation can be found in another my question.

The Parsing can be written manually (most good compilers do this) or can be generated by specialized software, also described in another question.

Choosing the best technique can give a brutal difference in speed, error detection quality, ease of maintenance, etc.

There are variations of how tree analysis is done, for example: starting from top to bottom (top-down) or from bottom to top (bottom-up*).

Compilers make extensive use of recursion. This is one of the domains where this is actually quite useful.

Languages used

They can be written in any language, like any problem that can be solved computationally. The compiler of a language can even be written in the language itself.

Oops, how can this be possible? It’s not the chicken and egg problem?

That’s called Bootstrapping. The compiler of a language begins to be written in another language and then gradually or at the end of the process it is rewritten using the language it compiles. The problem was solved in the same way as the egg and the chicken, there was no first, there was mutation (the religious view is no explanation, ie it is what today is called good practice, follow what I said and shut up).

One day there was only machine code, it was used to write an assembly assembler (which is similar to a compiler, only simpler), which was used to create the first high-level language (mainstream), the Fortran, which was later used to create other compilers, and so on.

An example of this is the the compiler of the C# which was originally written in C++ and most recently rewritten in C#. C has obviously done this and many languages today do the same even to show how capable they are. Low-performance languages avoid doing because compilers need to be fast. By the way, most compilers are written in C or C++, since they are powerful, ubiquitous languages and allow high performance.

Its operation

The compiler takes text, parses character by character, tries to find patterns recognizable by a pre-established grammar, makes an analysis whether everything makes sense and generates a set of data that will allow the creation of the program to be executed. There is not much secret, the basic idea is very simple. Of course the relations between all elements of language begin to complicate. Its complexity is mainly due to the complexity of the language.

The compilation phases can vary from compiler to compiler. And if it will be done in one or more steps too (phase and step are different things, the first may occur separately but occur in my step). This depends on the compiler’s architecture and the language’s need. So make a assembler is much simpler than a compiler. An assembler works with very simple rules.

Often the compiler is divided into front-end (takes more care of the source language) and back-end (takes over the target platform), and even Middle-end, dividing well some of the phases.

Fluxo de compilação

Lexical analysis

The basic analysis (Scanning) where it tries to find certain patterns is called lexical analysis. In it the search is for tokens recognizable. These tokens will be elements that can be used to produce some understanding in the code. Examples of tokens: operators, keywords, identifiers, literals, etc. If you find text snippets that cannot be considered tokens already allows the compiler to generate a syntactic error.

At this stage each character is being read and seeing if it can be something recognizable. For example:

  • he finds a i, don’t know what to do with it;
  • then find a f, Still can’t be sure that this is something useful;
  • then find a space and now you can discover that the if previous is possibly a keyword;
  • now find a x, and doesn’t know what to do with it;
  • finds a = and still not able to make a decision unless there is a possible identify before, have to make a broader analysis to be sure;
  • finds a = and you can’t be sure what it’s about;
  • finds a 1, now he knows that what he had previously found was an operator, he can even identify which;
  • now find a line break, and you can be sure that the 1 was a numerical literal.

Preprocessing

Here there may be a prior processing which is a very simplified translation of some tokens by others. It is a very simple exchange usually without further checks.

Syntactic analysis

Then comes the syntactic analysis (Parsing) who organizes these tokens all forming constructions that can be understood in isolation. It uses the method of divide to conquer. Usually a tree of tokens is mounted called AST. In this process it is verified if the construction found is correct for what is expected in that language. If not, the compiler has to generate a syntactic error.

Good compilers do a deeper analysis to generate more accurate errors that can help the programmer know more about what it is without causing confusion. Between these analyses he checks whether in a place that expects a variable has this, if he expects an expression it is there, etc.

Using the previous example, at this stage he will see if the if fits in that place, if then comes an expression, as it should be, checks if the expression has a correct shape, for example if it has an operator and two operands on each side, checks if there is nothing else that is disturbing. Of course, this is a huge simplification of the process.

Parsing

The idea here is to use the good old divide and conquer which is used in everything in computing, and which is the best way to solve much of the problems even by humans (algorithms serve humans as well), and many do not know this (solve one part at a time), unless the problem is division, then the union is solution. Sorry for the Offtopic but some humans have solved the problem of the masses with the divide and conquer, political and religious, for example, so it’s getting easier for them to run the world, we’re increasingly divided

Semantic analysis

Then comes the semantic analysis that looks for the relations between the tokens are as expected for the language. It analyzes whether data types are suitable in each place, whether an operation is possible, whether something needed to complete an operation has not been lacking, etc. If something like this does not work, the compiler will generate a semantic error. At this stage additional information will be collected to help in optimization and code generation. It can also better organize and simplify the syntactic tree.

By the previous example one of the many analyses that will be made is that if the expression of the condition will result in a boolean, it is necessary for all if. It will also verify that the operator == can be used with x according to its type and numerical literal. Note that these checks, although related and one dependent on the other are done separately.

If you had multiple operators you would select the precedence, which would change the tree.

Example of abstract syntax tree:

AST

Typing analysis

It may be a variation of semantic analysis. In complex typing languages it may be useful to have this separated.

Code optimization

An optional next stage is optimization. Then some changes can be made to the syntactic tree (although the phase can occur at another time and try to modify another representation). These changes aim to make the code better to save processing and memory, that is, to be able to perform the same operation, obtaining the same result, taking better advantage of the processing capacity of the target of this program.

Optimization techniques commonly used.

Code generation

The last phase is code generation. From the parsed and modified syntactic tree it is already possible to generate the output, which in general is a code in another language, as Assembly or the machine code for a specific processor. In general it is a sequence of bytes that the processor or a virtual machine understands.

Using our example above, hypothetically, it could generate an instruction for the processor to make a numerical comparison (CMP) and then a conditional deviation (JZ).

In some cases it is possible that this generation is of an intermediate code that can go through another optimization and generation of a final code. This technique helps a lot the compiler be built to suit several different platforms.

See a example of compiler flow.

Other phases can be used, there is no rule that compilers need to be like that. Today modern compilers are more modular in order to be used in other tools that need some of these phases alone. Great for syntax highlighters, static analyzers, refactors, static analyzers AOP, code generators, REPL, formatters, etc.

Variations

When a compiler generates code for another high-level language in the latter phase, it is customary to be called transpilator.

When this phase executes commands instead of generating binary or source code in another language it acts as an interpreter. Normally the interpretation works a little different, since the compilation and execution are occurring in small blocks. But this has changed, the interpreters are benefiting from a more global analysis.

I didn’t even mention other guys like Just-in-time Pilot.

Completion

Every programmer should learn how a compiler works. Even though he will never write one. There is a huge difference in the quality of the programmer who understands this (and other issues) and what he does not understand. You don’t need to be a deep connoisseur, but you need to know what it’s like to use the tools of your day-to-day better. You just need to avoid the cult post.

I see a lot of programmers who suffer a lot doing bad things because they don’t understand the workings of mathematics, of languages, of science, of the computer, of the operating system, of languages and compilers, of data structures and algorithms, of paradigms, of methodologies, etc. And the worst thing is that some, even not knowing any of this, they think they know everything they need and have no need to learn more, to improve. Want to know what you’re doing? Be dedicated, don’t think you already know everything. You always have something to learn. And compilation is one of the most fascinating areas of computing.

References

Additional information.

Wikipedia.

The canonical book on the subject is Dragon’s Book (considered obsolete by many).

Example of grammar - Language C - Example of compiler.

Modern construction of compilers [video].

  • 2

    Excellent "summary"!! hahaha is worth until it becomes a wiki.

  • @bigown in case the assembler would not be "Assembler" and "Assembly" a language ? I may be wrong, not sure...

  • @Magichat precisely.

  • 1

    The dragon book is good, but leaves much to be desired in the back-end part of the compiler. I recently read the Engineering a Compiler and it’s quite complete, going from A to Z. I believe it’s a great book for those who want to study the subject more deeply.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.