How to Make a Simple Assembly Assembler

Asked

Viewed 1,832 times

4

I’m learning about processor architecture and I intend to assemble an Assembler. What is the procedure and steps for writing a program that transforms written code into machine code?

The assembler can be for ARM or 8086 and can be in C.

  • 2

    Welcome to Stackoverflow Anderson. Make a tour to better understand the model of our community. What is your question on the subject? See how to ask a good question.

  • 5

    Is it writing Assembly grammar? Is it writing parser? Is it sending binary code? Is it linking between more than one file? Is saving the result in an executable format? What is your question?

  • the main doubt is to write the parser

1 answer

6

What you are asking for is to make a simple compiler, which receives a source code and produces a machine code.

Structure of a compiler

The compilation process is normally divided into the following steps:

  • Lexical analysis.
  • Syntactic analysis.
  • Semantic analysis.
  • Intermediate code generation.
  • Generating object code.
  • Code optimization.

And also often a few more things occur in typical compilers:

  • Allocation of registers.
  • Error handling and recovery.
  • Type verification and inference.
  • Dependency management and/or code linkage.

However, since your project is not a major commercial product and the compilation process should only be a 1-to-1 Assembler instruction translation for machine code instruction, then the structure of your compiler will be much leaner.


Starting up your compiler:

First, choose a small set of instructions that your compiler will accept. Start small and then grow. In your code create a struct or something similar to represent an instruction. Basically this structure will have a field that is equivalent to the type of the instruction and other fields that represent the parameters/arguments/operands of this instruction. You can give a little push and fit Abels and directives in this struct also.


Lexical and syntactic analysis:

The ideal would be for you to create a complete lexical parser and a complete parser. Normally you would use a regular lexical grammar and a context-free syntactic grammar. But, unless you have some tool ready and master this kind of knowledge, doing it would be a very costly job. Then I propose a simpler approach:

  1. Divide your program into a sequence of lines. Basically, read the entire font and split where you can find one \n.
  2. Divide each line into a sequence of words. Basically, use the spaces to punch the line into several "words".
  3. Discard empty "words" (zero size). Detect where the comments are and discard them as well. Discard lines that reveal to be totally empty.
  4. The words left on the line correspond to the codes of your instruction.
  5. Analyze the remaining words to see if you won’t have to subdivide into smaller words because of things like parentheses, brackets, commas, etc.
  6. Identify which is the mnemonic of the instruction and create an instance of your struct to store what was read on the line.
  7. If no mnemonic serves or is recognized on a particular line, issue an error and stop. As your project is simple, leave the error handling and recovery aside (at least for now), aborting the build process on the first error found.
  8. Repeat this process for each line, creating at the end a list of instructions that represents the whole program.

Semantic analysis:

Having the list of program instructions, each one within its due struct, check that all Abels and referenced routines exist. Check that the arguments, registers and operands used in each instruction are valid, compatible with each other and with the instruction, and that they are in a certain/valid quantity and in the right/valid order, as expected by the corresponding instruction. Check everything pertinent. If you find something wrong, issue an error and stop.

To do this, you will probably need to create a role in your specialized C code to analyze each distinct type of instruction. Something like verificar_MOV(), verificar_POP(), verificar_ADD(), etc..


Code generation:

First, you will have to calculate what is the size of each instruction. As you have already done semantic analysis, then, unless you have made a mistake, so far all the instructions are valid and well-formed. With this, you will have to figure out what are the values of all offsets necessary to calculate all necessary addresses. You should have some table or manual that explains how to convert each instruction into your corresponding machine code, and you will do so at this stage by storing all the resulting machine codes in a list with a 1-to-1 ratio with the instructions you have stored on struct. If you want, you can store the machine codes inside the struct.

Again, you will probably need a specialized code generation function for each type, such as gerar_codigo_MOV(), gerar_codigo_PUSH(), gerar_codigo_ADD(), etc..

Once this is done, all you have to do is write the machine codes sequentially within the executable file. You may have to add things like headers to these files as well.


Upshot:

At this point you should have your simple compiler working. There is no code optimization. There is no intermediate code generation (which is useful for generating code for different architectures). Error handling and recovery is minimal. Since your program is monolithic so far, there is no dependency management and/or linkage. All these features can be added later incrementally one at a time, if you want.

In addition, you should be using a minimum set of instructions, and for semantic analysis and code generation, your compiler will probably have a good piece of code specific to each instruction. Once you have a minimal compiler implemented, if you have done it in a modular way, adding support for new types of instruction, one-on-one should not be too difficult.

Finally, since its language is very low level, it probably makes no sense to talk about checking and inference of types and the allocation of registers is something that becomes a problem of the user and not of the compiler, then you probably won’t need to worry about these aspects.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.