Can a regular expression become Assembly?

Asked

Viewed 489 times

-5

Regular expressions are known to be character patterns that associate character sequences in the text. Would it be possible to do the same for Assembly? Make there are expressions that would translate sequences or steps as it is in Assembly with opcodes?

Here is an example of a regular expression: [A-Za-z0-9]. This says it can be uppercase, lowercase, and there are numbers that are 0 through 9. So, can a regular expression turn into Assembly? So: mov a, mov b, mov c ... etc.

If it were instead of writing to compilers and interpreters we could write natively on the machine, the opcodes as: mov a, mov b etc. would just be a regular expression. It would be like a specific rule for the processor.

And it would be quick debug since the interruptions would work on any platform and as possible and well. What are the references on this subject?

  • It is not trivial to port a directed graph of colored edges to be processed directly by the processor

  • Why? I could explain. Your answer is subtle, but I don’t understand.

  • If you tell me what strange words I mentioned above, I can give you a more targeted response =) If you want, you can include in hall also: regular grammar, formal language, Chomsky hierarchy, state machine, deterministic automaton

  • Have you tried checking which are the points from distance 15 to up to 3 jumps from an origin? Have you checked how much operation you need to know? In assembly, that has only logic-arithmetic operations, jumps, register transfer, reading and writing in the main memory, imagine the effort needed to do this.

  • I don’t understand your last comment

  • 1

    I did not understand the comments on Pythagorean numerology, either because they were written or because they were excluded

  • 1

    I believe that the new edition took away the original spirit of the issue, I took it to a discussion on the goal

  • Edited again, see if it’s good. I’ll make it easier to understand.

Show 3 more comments

2 answers

10

To answer your question, A regular expression can become Assembly?, we need to understand 2 concepts a priori:

  1. regular expression
  2. Assembly

Regular expression

I am not taking as a basis modern regular expressions with look-Behind, atomic cluster and rearview mirrors; I’m focusing on the most mathematically speaking regular part. The possibility of including these functionalities in regex disfigure the finite automaton, as they require the presence of some memory.

By the way, the presence of rear-view mirrors implies that it solves context-sensitive problems, two levels lower in the hierarchy of Chmosky.

A regular expression is a scheme of writing a regular grammar on a single line. It can also be seen as a very compact way to assemble a state machine without memory, with one-way reading on the tape.

Regular grammar

A regular grammar is obtained through a grammar that has only regular productions. A production is called regular if it is from a non-terminal to one of the following targets:

  1. empty string
  2. other non-terminal
  3. a single terminal
  4. a pair: terminal and non-terminal

See more in that meta-language response grammar for constructing grammars. See that other answer that talks about formal languages with intense affection, especially the initial parts.

From a regular grammar, we can assemble a directed graph that represents it, where the vertices are non-terminal (with the possible exception of the accepting vertex) and the edges represent a grammatical production. Take the following grammar as an example, with q0 for the initial symbol and 0,1 as the terminals:

q0 -> 1 q1
q1 -> 0 q2
q2 -> 1 q5
q2 -> 1 q3
q3 -> 0 q3
q3 -> 1 q3
q3 -> 0 q4
q4 -> 1 q5
q5 -> 0 q6
q6 -> 1 q7
q7 -> 0 q8

It can be interpreted as the following graph:

um grafo da gramática regular

Image source: https://stackoverflow.com/a/22366722/4438007

A word p is said accepted in grammar G if it is possible, through the initial non-terminal, to obtain the word p using only the transactions of G. p, to be considered a word, it must be composed only of terminals.

State machine

A state machine is a mathematical abstraction that represents how a program will behave. In this case, we are dealing with a finite state automaton. To be regular, this automaton only contains its states, its current state and the reading word.

Every automaton starts in an initial state. To get out of one state and into another, there are only two hypotheses:

  1. if transaction is transaction-lambda (a.ka. tranation-Epsilon);
  2. if the letter of the word available to be consumed is one of the options of the transaction, situation in which the letter will be consumed (that is, it can never be consulted again by the regular expression) and will walk in the transaction.

A state machine can be very well represented by a graph. Vertices represent the states and edges transactions. The vertices can be colored to indicate accepting states (multiple accepting states are possible) and initial (only one initial state is allowed). Edges are colored with the letter that is consumed in the transaction, or receive lambda color if they represent lambda transitions.

A graphically represented state machine can be this:

um grafo da máquina de estados

A word p is recognized by the finite automaton if possible, using its states and the two transaction rules described above, for in one of the accepting states without there being one more letter to be consumed.

Note that automata do not need to be deterministic; by the way, the very existence of lambda transactions requires the automaton to be nondeterministic. However, there is an algorithm that transforms any nondeterministic finite automaton into a deterministic finite automaton.

The above automaton is nondeterministic: take the word 1010, in what state will it stop? Options are: q6, q4 and q3.

Note that, in a finite automaton, walking a path in a graph implies:

  1. make decisions
  2. loop

The decision to leave a state x any depends only on itself and the next letter available in the word; any set of previous letters and triggered transactions that result in the same state then become idempotents.

There is a useful concept in the finite automaton that is called well. When a character not expected by the automaton is found, the next state is well. When regular expression is of type belongs to the word, and not the word must be identical to the expression, the well has a transaction-lambda for the initial state. Note: all states so implicit transactions for the well with all other letters of the alphabet not described in their transactions. Due to this, usually the well is ignored from the drawing.

Regular language

A regular language is the set of words recognized by a grammar (or automaton) that describes it. In this context, grammar (or automaton) can be considered generative. In this context, instead of consuming the input word, transactions are triggered generating terminals, until you reach the end and you have a new word of language. I talk more about languages and words in that reply. (Automata can also be generative, behaving similarly to a Markov chain).

Regular expression, the return

Regular expression, then, is a scheme to describe a regular language. It is easy to see through the grammar graph how to construct a regular expression.

In general, the main operators of regular expression are:

  1. Kleene star * used to represent loops occurring within the graph
  2. Selection | used to represent diverse paths that can be traced from a common origin
  3. Grouping () used to demonstrate a series of transactions that have been triggered
  4. List [], used to match a multi-color transaction, such as q3->q3
  5. Escape \, force the following character to have literal meaning, losing its powers as a meta-character

In view of these rules of formation, we can transform the non-deterministic lambda-free automaton into the following regular expression:

10(1|1[01]*01)010

More modern regular expressions gained other more modern operators, such as the optional ? and the kleene cross +, but those operators are easily transformed into the previous operators.

Compiling regular expressions with context-free grammars

Regular expressions cannot be identified by regular expressions. This is because, in its most basic definition, regular expressions suffer self-nesting. More about this in that reply.

A quick demonstration of the self-nesting of regular expressions:

11100(0101(11111[01][01]*[01]00000)*00)*11100

The general grammar for a regular expression is (in BNF-simile notation):

S -> EXP
EXP -> EXP . EXP
EXP -> NQEXP
EXP -> NQEXP . QUANT
QUANT -> '*'
NQEXP -> '(' . EXP . ')'
NQEXP -> '[' . LETRAS_LISTA . ']'
NQEXP -> '\' . qualquer caracter
NQEXP -> LETRA
LETRAS_LISTA -> ']' . LETRAS_LISTA2
LETRAS_LISTA -> LETRAS_LISTA2
LETRAS_LISTA2 -> todas as outras, exceto ']'
LETRA -> qualquer caracter não especial

Note:

  • S of start
  • EXP of expression
  • NQEXP non-quantified expression
  • QUANT of quantified

Assembly

Tag :

Generally, a mnemonic is a symbolic name for a single instruction in executable machine language (an operation code), and there is at least one mnemonic optode defined for each machine language instruction. Each instruction typically consists of an operation code operation and zero or more operands. Most instructions refer to a single value, or a pair of values. Operands can be immediate (value encoded in the instruction itself), records specified in the instruction or implicit, or addresses of data located elsewhere in the storage. This is determined by the underlying processor architecture: the assembler only reflects how this architecture works.

Typically, these operators perform the following computations:

  1. load memory value in register
  2. write in memory recorded value
  3. logical operations with operators and constants
  4. arithmetic operations with operators and constants
  5. operations bitwise with operators and constants
  6. transaction flow

It is important to emphasize that the logical/attitudmatic operations/bitwise are of fixed dwelling to maintain asymptotic efficiency and complexity.

Pointer Dereferencing Operations is an operation to load value from memory to a logger.

Implementation of a programme

Assembly is done deterministically, each execution being planned (directly or indirectly) by the programmer. The same data set is executed step by step by the processor.

The only nondeterminism option I can find for non-quantum processors is the processing relying on random data of uncontrollable origin (I’m not talking about the random() of languages such as or , but variables that come from outside the computer; random() is deterministic to the past seed, while normally the seed is dependent on the time of the system). Even so, this behavior may well be simulated as an oracle function, the behavior of the system (ignoring the inner workings of the oracle) becomes deterministic.

Assembly for Regex

How to define an Assembly for regular expressions? Well, Assembly depends on a set of instructions that are followed deterministically. Simply porting what is written in regex without major translations into a machine language will not make it deterministic.

An alternative, therefore, would be to transform the regular expression into a finite automaton and then transform it into a deterministic finite automaton. Commonly, this is implemented as a multiple deviation for each state and each state attempts a multiple deviation for the relevant letters. States with unforeseen transactions would fall into the pit. Implementation of the meta-character . and the denied list falls on multiple deviations as well.

The automaton of the examples can be described by the following unambiguous deterministic regular grammar:

q0 -> 1 q1
q1 -> 0 q2
q2 -> 1 q_{5,3}
q_{5_3} -> 1 q3
q3 -> 0 q_{3_4}
q_{3_4} -> 0 q3
q3 -> 1 q3
q_{5_3} -> 0 q_{6_3_4}
q_{3_4} -> 1 q_{5_3}
q_{6_3_4} -> 0 q3
q_{6_3_4} -> 1 q_{7_5_3}
q_{7_5_3} -> 1 q3
q_{7_5_3} -> 0 q_{8_6_3_4}
q_{8_6_3_4} -> 0 q_{3_4}
q_{8_6_3_4} -> 1 q_{7_5_3}

This results in the following graph (with q0 the incial state and q_{8_6_3_4} the only state of acceptance):

autômato finito não-ambíguo determístico

Demonstration that is unambiguous and deterministic:

  1. check all edges emerging from a given vertex; they all have different shooting conditions for whatever it is;
  2. no transitions-lambda.

Therefore, it is unambiguous and deterministic.

To represent in atomic units, the following transactions could turn an Assembly more or less as follows:

  1. sets the state table with labels; whether the automaton is in the state q_{5_3}, it will go to the content of the label q_{5_3} with a detour;
  2. the step contained in início is a deviation to the correct label; this can be done by elaborating a vector of labels relating each position to a label; in this case, the deviation would be:

    # sejam os trechos começados por "#" comentários até o final da linha
    # sejam "reg_a", "reg_b", "reg_c" registradores de propósito geral
    # seja "reg_est" o registrador específico para guardar o estado do autômato
    # seja strings começadas com "$" constantes
    # seja a operação "set a, b" a operação que seta em "a" o valor de "b"
    # seja "add y, x, a" a operação que salva em "y" o valor da soma de "x" e "a"
    # seja "[a]" o acesso ao local de memória apontado por "a", portanto retornando o conteúdo apontado pelo ponteiro "a"
    # seja "jmp a" a operação de salto incondicional para a posição "a" da pilha de execução
    # seja "label a" o rótulo chamado "a", usado para futura referência de desvios incondicionais
    # seja "<a>" a referência ao rótulo "a"
    
    label inicio:
    set reg_a, $pos_vetor_rotulos
    add reg_b, reg_a, reg_est
    set reg_a, [reg_b]
    jmp reg_est
    
  3. for each state block end, there is an unconditional deviation to the início;

  4. for each finite affirmative transition fired, has a conditional by changing the state to the vertex at the end of the edge of the transition fired; if accepted under any of these conditions, jump to the end of the block;
  5. for the above transition, if you have not jumped to the end of the block, the next state will be the well;

    # exemplo para o estado q_{5_3}
    # seja "reg_c" o registrador que contém o caracter sendo consumido da palavra
    # seja "jneq dest, v1, v2" o salto condicionado para "dest" caso "v1" e "v2" sejam não iguais
    label q_{5_3}:
    jneq <q_{5_3}_1>, reg_c, '0'
    set reg_est, <q_{6_3_4}>
    jmp <fim_q_{5_3}>
    
    label q_{5_3}_1:
    jneq <q_{5_3}_2>, reg_c, '1'
    set reg_est, <q_3>
    jmp <fim_q_{5_3}>
    
    label q_{5_3}_2:
    set reg_est, <poço>
    label fim_q_{5_3}:
    jmp <inicio>
    
  6. for each element of the list transition denied, has a conditional with jump to the well;

  7. for wildcard transitions (usually called by the metacharacter . or by valid content in a linked list), has the given jump to the next state of that transition.

With this, we would have an Assembly-simile code of regular expressions.

Completion

Using the limitations of an assembly language

  1. load memory value in register
  2. write in memory recorded value
  3. logical operations with operators and constants
  4. arithmetic operations with operators and constants
  5. bitwise operations with constant operators
  6. transaction flow

the code representing a regular expression would be too long and unreadable. For debugging purposes, it would be better to transform into the unambiguous deterministic finite automaton and show in which state is the recognition of the past pattern.

Graphically, it is very interesting to check the current state of processing highlighted in AFD, as well as it would be interesting to follow in the word passed to find the pattern which position is recognized at that time. It would also be interesting to map the state of the deterministic automaton to its equivalent positions in regular expression.

6

The question as a whole is very broad, what you can answer is that it is possible yes, anything can be transformed into Assembly. Or machine code that is probably the desired.

Of course what you have to do is build a compiler, can even be a simple one since it only needs regular expression. But on the other hand what is the usefulness of regular expression so if the rest is not?

Why do this? You must be thinking about performance? Forget it, Assembly is not something magical that makes things go faster. If you make a bad compiler you will have Assembly that will generate a processing worse than a good regular expression interpreter ready. Assembly is always the extraordinarily naive solution to performance problems. Just think about it that is absurdly far from being able to do something on this level. Whoever can do it knows that this is not the solution.

Want more performance? It’s simple, do not use regular expression. Generally it is something slow and does not bring the benefits that most people believe in much of the cases.

What gives performance is to use the right algorithm, they produce almost unbelievable gains. One of the simplest and most known can shorten the run time by a fraction of millions. And it can be written in any slow language that will get this result. This algorithm does not apply to Regex, but little can be done to have a huge win in this case.

I’m not saying you won’t win, but it’s not worth the effort. What’s out there is enough.

Where to start is what I always say. Start with the basics. Don’t burn steps, don’t think you can take the easy way out, don’t believe it’s just asking random people on the Internet and you’ll know what to do, because this isn’t learning anything.

Learning is construction, one brick at a time and you can’t skip any. You can’t start without foundation. Foundation in the case is to know well what comes before knowing programming. Some people think it is possible to program well without knowing communication and expression or mathematics well before everything. It is not possible, I have never seen an even reasonable programmer case that did not have a formal education at reasonable level. Obviously programmers who only decorate a lot of revenue are not good programmers either. Good programmers don’t ask basic things because (I’m not saying there’s anything wrong with this question) because they’ve learned to find the information on their own. Asking is useful, it’s good, but good questions are those that help build knowledge, not those that ask for everything how to do.

In this case, we still need to understand what Assembly is, how to transform one thing into another, so the distance to achieve this is so great that the question is mere curiosity. The final part of the question doesn’t even make sense.

More specific questions may be more useful, but without the foundation of the computer dominated nothing will do.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.