What is lexical analysis?

Asked

Viewed 4,038 times

40

I was taking a look at the source code of a well-known php library, called Twig (is a template engine, with its own syntax), and I came across classes, interfaces and methods, such as Lexical, Lex and LexInterface.

I did a little research and I realized it was about the term lexical analysis.

Though I understood some things, I was confused in others.

For example, I’m used to seeing the term Parser or Parsing, in the case of transformation of a given data into another data item.

  • What would be the Lexical Analysis?

  • Lexical Analysis and Parsing/Parser is about the same things, or they’re actually different things?

Sorry if I’m confused on the question, but I think the community will help me with a good and enlightening answer.

  • 3

    Maybe answer http://answall.com/a/104818/91, more or less in the middle of the answer there is an explanation about lexical analysis.

  • @rray is already a good way. I wouldn’t know how to search (a specific term) to find out how to get that answer.

  • 1

    You hear that term a lot in automata.

  • 1

    You can’t "protect" your question against votes, protection only prevents new users from answering it, it doesn’t make much sense to protect your question at the moment ;-)

  • 1

    I know @Math. I confess that I protect some just to use the functionality, rsrsrsrs.

  • @Wallacedesouza: the link entered in the rray comment has a response excellent to this question. Do you expect a more "practical" answer? A simple example of constructing a compiler? (I answered a question on the subject here: Parser misidentifying tokens ). Finally, what is your expectation for a good answer?

  • @Gomiero at the time I put her on the reward did not have so many details. I will only give the reward when I am missing some time. Thus, the answers are visible to other people to see and also form an opinion.

  • @Wallacedesouza: Ok! Thank you :)

Show 3 more comments

6 answers

33


Definition

Lexical analysis is the process performed on a text, say a computer program or a markup language such as HTML, which divides this into lexemes and converts them to a sequence of tokens, which are used to feed a parser. A program that performs lexical analysis is usually called lexer, scanner or tokenizer.

tokens / lexemes

Lexemes are relevant syntactic units of the lexer context, for example for a lexer aimed at a certain programming language some lexemes could be: 1, "hello", for, ==, variableName, Function.

A token is a structure that categorizes a lexeme, it contains an abstract name that represents the lexeme group and a possible value in this case the group is not unique. Example of some tokens (the token is represented in < token format, optional value> and "#" represents start of comments):

< digit, 1 > # qualquer valor numérico
< literal, "olá" > # valores contidos entre aspas duplas
< for > # os caracteres f,o,r
< comparison, == > # os símbolos <, >, <=, >=, == e !=
< id, variableName > # sequência de letras que representa uma variável
< function > # as letras f,u,n,c,t,i,o,n

difference between Parsing

Lexers and parsers are closely linked yet are distinct concepts. Lexer specializes in extracting the relevant portions of text and transforming them into "meaningful" structures, in this case tokens. Parser has the function of analyzing the syntactic structure of a text, for example to say whether in a given programming language the expression "olá" 1 == "outroliteral" is syntactically valid or invalid. The link between the two is that the structures produced by the lexer are what parser uses as the source to perform the Parsing, i.e., the parser works with tokens. This is not mandatory, nothing prevents you from building a parser that parses over plain text, however the separation of the two tasks brings some advantages:

  1. Design simplification. A parser that also performs the work of a lexer is significantly more complex.
  2. Optimization. By separating the two tasks (lexing and Parsing) you have more freedom to apply specific optimization techniques for each task.

a lexer in practice

The theory is essential, yet nothing better than real code to aid understanding. Here I show a javascript handmade lexer that handles a subset of CSS, more specifically it handles this example:

h1 {
    color: red;
    font-size: 30px;
}

body {
    background-color: yellow;
}

div {
    margin: 10px;
    padding: 10px;
}

The code can be executed and will display the list of tokens generated after processing our target CSS:

// nosso CSS
var css = "h1 { \n\
	color: red; \n\
	font-size: 30px; \n\
} \n\
 \n\
body { \n\
	background-color: yellow; \n\
} \n\
 \n\
div { \n\
	margin: 10px; \n\
	padding: 10px; \n\
} \n\
";

/**
* Lista que define nossos tokens e as respectivas expressões regulares que os encontram no texto.
*/
var TOKENS = [
	{name: 'EMPTY', regex: /^(\s+)/ },
	{name: 'RULE_SET_START', regex: /^({)/ },
	{name: 'RULE_SET_END', regex: /^(})/ },
	{name: 'RULE_DEFINITION', regex: /^(:)/ },
	{name: 'RULE_END', regex: /^(;)/ },
	{name: 'SELECTOR', regex: /^([a-zA-Z]+[a-zA-Z0-9]*)(?=\s*{)/ },
	{name: 'RULE', regex: /^([a-z][-a-z]+)(?=\s*:)/ },
	{name: 'RULE_VALUE', regex: /^(\w+)(?=\s*(;|}))/ }
];


var text = css;
var outputTokenList = [];
while(text !== '') { // enquanto existir texto a ser consumido

	var hasMatch = false;
	
	/**
  	 * Iteramos sobre toda a lista de TOKENS até encontrarmos algum cujo padrão bata com o início do nosso texto.
  	 * Quando ocorre um "match" nós adicionamos o lexeme com seu respectivo token na lista de tokens de saída do lexer.
  	 * Caso nenhum padrão bata com o texto uma exceção é lançada imprimindo a linha que contêm o erro.
	 *
	 */
	for (var i=0; i<TOKENS.length; i++) {
		var obj = TOKENS[i];
		var tokenName = obj.name;
		var tokenRegex = obj.regex;
		
		var matched = text.match(tokenRegex);
		if (!matched) {
			continue;
		}
		
		hasMatch = true;
		var lexeme = matched[1];
		
		// removemos do texto o lexeme encontrado
		// para que outro possa ser considerados
		// na próxima iteração
		text = text.substring(lexeme.length);

		if (tokenName in {'SELECTOR': 1, 'RULE': 1, 'RULE_VALUE': 1}) {
			// adicionamos tanto o nome do token quanto o lexeme
			outputTokenList.push([tokenName, lexeme]);
		}
		else if (tokenName in {'EMPTY': 1}) {
			// discard, não nos importamos com espaços e quebras de linha.
		}
		else {
			// nestes casos o relacionamento entre o nome do token
			// e o lexeme é 1<->1 então não precisamos adicionar o lexeme.
			outputTokenList.push([tokenName]);			
		}
		
		break;
	};

	if (!hasMatch) {
		throw 'Invalid pattern at:\n' + (text.substring(0, text.indexOf('\n')));
		break;
	}
}

outputTokenList.forEach(function(token) {
	document.write('< ' + token[0]);
    if (token.length > 1) {
	    document.write(' , ' + token[1]);
    }
	document.write(' ><br>');
});

I won’t go into explanations about lexer implementations because this is an extensive subject and not directly related to the question, so note that this is just an illustration of the possible functioning of a lexer, reads a target text and generates tokens as output, the code is neither efficient nor robust.

  • I liked it! (+1) and I will steal your example in my reply :)

12

What would be the Lexical Analysis?

It is the analysis of lines of characters to create symbols that facilitate the understanding of what is written there.

Lexical Analysis and Parsing/Parser is about the same things, or they’re actually different things?

Are complementary.

The first step is the generation of symbols (tokens), called lexical analysis, by which the input character stream is divided into significant symbols defined by a grammar of regular expressions.

The second step is syntactic analysis, which is the verification that tokens form an allowed expression.

The final phase is semantic analysis, which elaborates the implications of validated expression and takes appropriate action.

  • "First stage, second stage, final stage" stages/phases of what? I imagine they are the front end of a compiler, but this was not specified in the text.

  • Steps of a parser (parser? I do not know a good translation). It may actually be from a compiler, since it needs to do this, but it may be from a parser that is analyzing an XML or JSON communication, for example.

10

Lexical and syntactic analysis (Parsing) are in fact very similar things, since the algorithms that work on these two fronts operate in a similar way: the input processed by both is similar and the results they present also.

As mentioned in the comments of the question, in fact we hear this expression a lot when studying automata precisely because lexical analyzers deal only with regular languages. In practice, a lexical analyzer acts as a token recognizer. That is, given the grammar of any regular language, a lexical parser is able to determine whether a given string is part of that language or not. Hence the famous regular expressions.

Differently, parsers act with a more complex level of languages because they deal with context-free languages. Also in practice, parsers usually process a sequence of tokens (usually recognized by lexers) and determine whether such tokens satisfy the structures defined in the grammar of that language.

Therefore, the lexical analysis consists of validating tokens taking into account the rules of formation of a regular language, if that language is no longer regular, then it is a parser. In the case of Twig, because it is a template engine, I believe that lexical analysis occurs in the recognition of special markers such as {{, else, {% etc..


I’m **updating** my answer because I believe it wasn’t broad enough and also because I think the other answers were either too generic or tied too many lexers to parsers.

First the fundamental similarities between lexers and parsers:

  1. Both have as input symbols of some alphabet. Usually the symbols of a lexer are ASCII or Unicode characters while parsers process tokens (terminal grammar symbols that are validating).

  2. They analyze these symbols by associating them with a grammar to recognize them as members of a language. As I explained above, lexers validate only regular languages whereas parsers operate in context-free languages. Levels 3 and 2 in chomsky hierarchy, respectively.

  3. Both produce as output sentences from the language they are evaluating. Lexers distinguish tokens of a string given as input, while parsers generate syntactic trees.

With respect to the last point, although both are complementary in the vast majority of cases, this does not mean that a parser will always receive its tokenized input from a lexer. Being a parser capable of generating multiple sentences of any L1 language, we can obtain an L2 language whose tokens are L1 sentences. Parsers can also be tokenizers of other parsers.

In natural language processing, parsers get their tokenized input from a series of steps that may involve manual editing, statistical analysis and machine Learning.

Therefore, the fundamental difference between the two is as I explained above: the level of language in which they operate and consequently the complexity needed to operate under this level. While for regular finite state automata languages are sufficient, context-free languages require more complex mechanisms such as stack automata and all the huge variation of existing parsers (bottom-up, top-down, tabular etc).

  • There are erroneous statements there Roney: "the lexical analysis consists of validating tokens", the lexical analysis produces tokens and does not validate them, because tokens do not exist before the lexer generates them, validation occurs on the lexemes.

  • @Brunorb lexers are not generating mechanisms, so they do not produce tokens. They validate entry as tokens or not. You can understand how production by still being tied to its employment in the front-end of a compiler, but are in short recognizers of its entry having in view a regular referential grammar. Distinguishing in an input string which are the tokens of the reference language does not imply that such tokens are produced. I made explicit in the answer that tokens are nothing more than the atomic symbols of an LR, so there is no reason to produce what already exists.

  • ... the main task of the lexical analyzer is to&#xA;read the input characters of the source program, group them into lexemes, and&#xA;produce as output a sequence of tokens for each lexeme in the source program Compilers: Principles, Techniques, and Tools.

  • @Brunorb yes man, it’s a definition among so many. The distinction between tokens and lexemas are terminologies created to better present the interesting concepts to your application in compilers. I use a distinct definition in which the term token is employed as the "atomic constituent of a language". As long as the ideas presented are consistent with the definition, there is no point in debating. I could call them a ball, if it were interesting. I believe I have been consistent with the definition I have put forward and therefore I see no point in extending this non dialogue.

  • 1

    @Brunorb with your comments I noticed a possible contradiction in my reply and I edited it to make more explicit my understanding of the subject. Thank you.

8

An interpreter/compiler does not understand "thick text" directly, he needs the data well organized and defined.

For example, let’s assume that our interpreter needs to interpret this simple sum:

42 + 42

How do we solve this? That’s where the role of the lexical analyzer comes in.

The programmer who created our imaginary interpreter defined that a number is the set of 1 digit followed by other digits and that sum is simply the symbol '+'.

[0-9]+ returns NUMBER;
"+"    returns PLUS;
" "    ;

Now what? Well, let’s look at what happens when he analyzes the input, the end point being the character he’s analyzing:

1) . 42 (Our parser contains the number 4 in the buffer, he knows that by setting, a number is 1 digit followed by more digits.)

2) 4 . 2 (So far, it has 4 as number and continues, storing 2 in buffer.)

3) 42 . (It is the end of the number, our interpreter returns NUMBER. We find a blank, which by definition returns nothing.)

For now, we know this:

<NUMBER, 42> . + 42

The parser is on the '+', he knows that by the definition is a PLUS:

<NUMBER, 42> <PLUS, '+'> . 42

And it’s the same process about the 42:

<NUMBER, 42> <PLUS, '+'> <NUMBER, 42>

The analysis has been completed, and now what? Our interpreter can interpret this data consistently. Let’s assume that our interpreter uses a very simple and restrictive grammar for sums:

sum -> NUMBER PLUS NUMBER

It can use the output tokens of the lexical analyzer and focus only on Parsing. As the output consists of NUMBER PLUS NUMBER, it fits as a valid sum.

7

The answers are very good, and have already addressed the whole technical part of the subject. Therefore, just complementing the information, follows an explanation focused only on the etymology of words.


Lexical Analysis

Refers to language vocabulary (words). In a simple way, it is a dictionary analysis and checks the existence of the terms/words within the language.
For example "carnival", "egg", "ball" are part of the lexicon of the Portuguese language. The words "party", "Egg", "ball" are part of the lexicon of the English language. A lexical analysis does not concern itself with the order or sense of the terms, but only with the terms themselves.

You can see a technical example here.

Parsing

Refers to the grammar rules of the language, I mean, it’s about how we can organize the terms of language to make sense. Can be viewed as the way a command should be structured to perform an action, or the rules for the formation of a sentence.

Technical example here.

Semantic Analysis

Here we are talking about the meaning/meaning employed in the phrase/command. For example, we can use the phrase Me inclui fora dessa, where the words and syntax are correct but not semantically.

Technical example here.


These definitions are part of the linguistics as a whole and are used as a way to organize and facilitate the understanding of how a code/process proposes to work.

I chose not to use a technical approach and pass examples of the Portuguese language, as they are equally valid when you think of programming languages and make understanding the terms simpler.

  • 1

    (+1) Soiling: Join the pragmatic!

6

Everything has been said; however I like language processors, and I cannot resist setting an example here in this case Lex+Yacc.

Statement: given a CSS (simplified, I will take as an example the case presented by @Brunorp) calculate how many properties each tag has.

Translator grammar: grammar + semantic actions

Parser=parser (Yacc)

%{
#include <stdio.h>
%}

%union {char *s; int i;}
%token <s> ID STR MEA
%type  <i> props

%%
css  :                                      // bloco*
     | css bloco
     ;
bloco: ID '{' props '}'  { printf("%s - %d\n",$1,$3);}
     ;
props:                   { $$ = 0    ;}     // prop*
     | props prop        { $$ = $1+1 ;}     // contar as prop
     ;
prop : ID ':' v ';'      ;                  // propriedade
v    : ID | STR | MEA    ;                  // valor
%%
#include "lex.yy.c"

yyerror(char *s){fprintf(stderr,"ERRO (linha %d):'%s'\n-%s",yylineno, yytext,s);}

int main() { yyparse(); return 0; }

Lexical analyser (flex)

%option yylineno
%option noyywrap

%%
[ \n\t]+              {  } // ignorar espaços
#.*                   {  } // ignorar comentários

[{}:;]                { return yytext[0];} // caracteres especiais

[a-zA-Z][-a-zA-Z0-9]* { yylval.s=strdup(yytext);  return ID;  }
\"[^\"]*\"            { yylval.s=strdup(yytext);  return STR; }
[0-9]+(px|pt|mm)      { yylval.s=strdup(yytext);  return MEA; }

.  {fprintf(stderr,"%d: Erro - invalid character '%c'\n",yylineno,yytext[0]);}

%%

What is lexical analysis:

  1. skip the spaces and comments
  2. return codes of reserved words and special characters
  3. group the characters that form the ID, etc; return their code and value
  4. deal with lexical errors

Compiling and testing:

$ flex proc.l
$ yacc proc.y
$ cc -o proc y.tab.c
$ proc < ex.css      ##ex.css é o exemplo de css do @BrunoRP

h1 - 2
body - 1
div - 2

Browser other questions tagged

You are not signed in. Login or sign up in order to post.