How is the find function implemented?

Asked

Viewed 330 times

4

Unlike match() looking for an occurrence of a language y in the beginning of the string, the find() makes a quest for y inside that string. I could use the algorithm of Knuth-Morris-Pratt to search for the first occurrence n0, n1, ..., n of X(string) in S(string) in complexity O(n+k), being n the size of X and k the size of S, but this is unfeasible in implementing the search automaton because of the state transitions (i believe, because the algorithm only looks at the beginning of the string, I need to implement it so that a certain language, can be found at both the beginning/middle/end).

Can someone bring me a light on this?

  • 1

    I don’t quite understand your question. A deterministic finite automaton will process the input string in O(n) time and the KMP algorithm is efficient precisely because it constructs a finite automaton instead of doing a naive search...

  • Yes, that’s obvious, the problem is HANDLING THE DFA/NFA STATES of the language I’m looking for within S. My problem is this, I have no idea how to implement this.

  • Maybe, I can call KMP(char) in every state of my DFA and so go riding the acceptable returns. This would be expensive because I would have O(n+1) to each char, unless, I change the KMP to maintain the current position in the buffer.

  • 2

    Did you find a solution? Poste as an answer to help other people.

  • Boyer-Moore, not Knuth-Morris-Pratt? http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

  • 2

    Do you want to find a fixed string inside another string, or find a regex inside the string? And asking "how is it implemented" without specifying a language/platform is complicated to answer, as each can do it in a different way (there are several string search methods, as well as various methods of pattern marriage, and not always the "most efficient" - if any - is used in practice, for reasons that vary from case to case).

  • @mgibsonbr, I put the tag pseudocode so.

  • 1

    Why not build the classic DFA/NFA for .*expressao-regular ?

Show 3 more comments

2 answers

1


Build a parser LR from the production:

s -> .*minha_regex
minha_regex -> sua regex original

When production minha_regex is reduced for the first time, you will have found.

Converting parts of regex into productions is relatively simple:

  • a -> x* is the same as:

    a -> ε
    a -> a x
    
  • a -> x | y is the same as:

    a -> x
    a -> y
    
  • a -> (x | y) z* is the same as:

    a -> a0 a1
    a0 -> x
    a0 -> y
    a1 -> ε
    a1 -> a1 z
    

0

Regex - IN JAVA (you didn’t say the language, I hope it helps!)

The regular expressions(regex), is a resource where we can be searching with only simple characters (sequence of letters, digits, etc.) or with metacharacters. Therefore, to use such advantages we obviously need to indicate a source and the search characters.

De Matcher the methods: find(), start(), group(), end();

We have as source: mouse of the blue color. So I want to find where the word color is in our sentence (in bold). And now!?

import java.util.regex.Matcher; 
import java.util.regex.Pattern; 

public class Regex { 

public static void main(String[] args) { 

String fonte = “mouse da cor azul”; 

String queremosIsso = “cor”; 

/*
* Nesse momento obteremos uma instância de Pattern, através do * método 
* static compile(String regex), o qual recebe uma String que é a 
* expressção regular 
*/ 
Pattern p = Pattern.compile(queremosIsso); 

/* 
* Aqui, através da instancia de Pattern, chamando o método * matcher() e passando a fonte de busca 
*/ 

Matcher m = p.matcher(fonte); 

//Ligando o motor 
while (m.find()) { 

//Obtendo o inicio do que foi encontrado 
System.out.println(m.start()); 

} 

} 

} 

By executing this code, we will get 9(nine) as a response. But, why 9?

First let’s see what happens.

After obtaining the instances of Pattern and Matcher, in a loop(while) we call the method m.find(). When we call this guy, the regex engine is fired, that is, the search task is effectively started. Vale remembers that the regex engine, runs the search from left to right and while it finds an occurrence, is no longer participating in the search process. Just below in the code, I called the m.start() method. This method returns an integer, where it represents the initial index of each sequence found, which in this case would be 9(nine). Reminder that starts from scratch. Below follows the letters and their indexes (where there are two quotes without content, represent spaces).

0-m 1-o 2-u 3-s 4-e 5-’ ‘ 6-d 7-a 8-’ ‘ 9-c

mouse of the blue color

There is also as I said at the beginning, the group() method. It is responsible for bringing the group of characters that was found in the search.

Ex: //Starting the engine

while (m.find()) { 

//Obtendo o inicio do que foi encontrado 

System.out.println(m.start() +” “+ m.group() +” “+ m.end()); 

} 

Will have as output: 9 color 12

Then the end() method, where this is the counter start( ), bringing the index of the last character found, but this the index starts at the one(1).

Source:http://www.devmedia.com.br/introducao-a-regex/15597

Browser other questions tagged

You are not signed in. Login or sign up in order to post.