Definition
Lexical analysis is the process performed on a text, say a computer program or a markup language such as HTML, which divides this into lexemes
and converts them to a sequence of tokens
, which are used to feed a parser
. A program that performs lexical analysis is usually called lexer
, scanner
or tokenizer
.
tokens / lexemes
Lexemes are relevant syntactic units of the lexer context, for example for a lexer aimed at a certain programming language some lexemes could be: 1, "hello", for, ==, variableName, Function.
A token is a structure that categorizes a lexeme, it contains an abstract name that represents the lexeme group and a possible value in this case the group is not unique. Example of some tokens (the token is represented in < token format, optional value> and "#" represents start of comments):
< digit, 1 > # qualquer valor numérico
< literal, "olá" > # valores contidos entre aspas duplas
< for > # os caracteres f,o,r
< comparison, == > # os símbolos <, >, <=, >=, == e !=
< id, variableName > # sequência de letras que representa uma variável
< function > # as letras f,u,n,c,t,i,o,n
difference between Parsing
Lexers and parsers are closely linked yet are distinct concepts. Lexer specializes in extracting the relevant portions of text and transforming them into "meaningful" structures, in this case tokens. Parser has the function of analyzing the syntactic structure of a text, for example to say whether in a given programming language the expression "olá" 1 == "outroliteral"
is syntactically valid or invalid. The link between the two is that the structures produced by the lexer are what parser uses as the source to perform the Parsing, i.e., the parser works with tokens. This is not mandatory, nothing prevents you from building a parser that parses over plain text, however the separation of the two tasks brings some advantages:
- Design simplification. A parser that also performs the work of a lexer is significantly more complex.
- Optimization. By separating the two tasks (lexing and Parsing) you have more freedom to apply specific optimization techniques for each task.
a lexer in practice
The theory is essential, yet nothing better than real code to aid understanding. Here I show a javascript handmade lexer that handles a subset of CSS, more specifically it handles this example:
h1 {
color: red;
font-size: 30px;
}
body {
background-color: yellow;
}
div {
margin: 10px;
padding: 10px;
}
The code can be executed and will display the list of tokens generated after processing our target CSS:
// nosso CSS
var css = "h1 { \n\
color: red; \n\
font-size: 30px; \n\
} \n\
\n\
body { \n\
background-color: yellow; \n\
} \n\
\n\
div { \n\
margin: 10px; \n\
padding: 10px; \n\
} \n\
";
/**
* Lista que define nossos tokens e as respectivas expressões regulares que os encontram no texto.
*/
var TOKENS = [
{name: 'EMPTY', regex: /^(\s+)/ },
{name: 'RULE_SET_START', regex: /^({)/ },
{name: 'RULE_SET_END', regex: /^(})/ },
{name: 'RULE_DEFINITION', regex: /^(:)/ },
{name: 'RULE_END', regex: /^(;)/ },
{name: 'SELECTOR', regex: /^([a-zA-Z]+[a-zA-Z0-9]*)(?=\s*{)/ },
{name: 'RULE', regex: /^([a-z][-a-z]+)(?=\s*:)/ },
{name: 'RULE_VALUE', regex: /^(\w+)(?=\s*(;|}))/ }
];
var text = css;
var outputTokenList = [];
while(text !== '') { // enquanto existir texto a ser consumido
var hasMatch = false;
/**
* Iteramos sobre toda a lista de TOKENS até encontrarmos algum cujo padrão bata com o início do nosso texto.
* Quando ocorre um "match" nós adicionamos o lexeme com seu respectivo token na lista de tokens de saída do lexer.
* Caso nenhum padrão bata com o texto uma exceção é lançada imprimindo a linha que contêm o erro.
*
*/
for (var i=0; i<TOKENS.length; i++) {
var obj = TOKENS[i];
var tokenName = obj.name;
var tokenRegex = obj.regex;
var matched = text.match(tokenRegex);
if (!matched) {
continue;
}
hasMatch = true;
var lexeme = matched[1];
// removemos do texto o lexeme encontrado
// para que outro possa ser considerados
// na próxima iteração
text = text.substring(lexeme.length);
if (tokenName in {'SELECTOR': 1, 'RULE': 1, 'RULE_VALUE': 1}) {
// adicionamos tanto o nome do token quanto o lexeme
outputTokenList.push([tokenName, lexeme]);
}
else if (tokenName in {'EMPTY': 1}) {
// discard, não nos importamos com espaços e quebras de linha.
}
else {
// nestes casos o relacionamento entre o nome do token
// e o lexeme é 1<->1 então não precisamos adicionar o lexeme.
outputTokenList.push([tokenName]);
}
break;
};
if (!hasMatch) {
throw 'Invalid pattern at:\n' + (text.substring(0, text.indexOf('\n')));
break;
}
}
outputTokenList.forEach(function(token) {
document.write('< ' + token[0]);
if (token.length > 1) {
document.write(' , ' + token[1]);
}
document.write(' ><br>');
});
I won’t go into explanations about lexer implementations because this is an extensive subject and not directly related to the question, so note that this is just an illustration of the possible functioning of a lexer, reads a target text and generates tokens as output, the code is neither efficient nor robust.
Maybe answer http://answall.com/a/104818/91, more or less in the middle of the answer there is an explanation about lexical analysis.
– rray
@rray is already a good way. I wouldn’t know how to search (a specific term) to find out how to get that answer.
– Wallace Maxters
You hear that term a lot in automata.
– rray
You can’t "protect" your question against votes, protection only prevents new users from answering it, it doesn’t make much sense to protect your question at the moment ;-)
– Math
I know @Math. I confess that I protect some just to use the functionality, rsrsrsrs.
– Wallace Maxters
@Wallacedesouza: the link entered in the rray comment has a response excellent to this question. Do you expect a more "practical" answer? A simple example of constructing a compiler? (I answered a question on the subject here: Parser misidentifying tokens ). Finally, what is your expectation for a good answer?
– Gomiero
@Gomiero at the time I put her on the reward did not have so many details. I will only give the reward when I am missing some time. Thus, the answers are visible to other people to see and also form an opinion.
– Wallace Maxters
@Wallacedesouza: Ok! Thank you :)
– Gomiero