Clear string with Regex

Asked

Viewed 1,422 times

4

I have the following array in PHP:

[
"Opcao 01 - Frase aleatória - Menu Superior",
"Opcao 02 - Outra Frase aleatória - Menu Su",
"Opcao 03 - Mais 01 Frase - Menu",
"Opcao 04 - Mais Frase -",
"Opcao 05 - Frase Simples",
]

I need to clean her up to look like this:

01 - Frase aleatória",
02 - Outra Frase aleatória",
03 - Mais 01 Frase",
04 - Mais Frase",
05 - Frase Simples", 

And I have to do this with regex. What would the sequence for this filter look like?

  • Have you tried this without quotation marks? ""d+ s+[-] s+ w+ s+ d+"

  • What language? you already have some code context that hasn’t worked?

  • @Park look I just tried and it didn’t work out. It doesn’t bring anything back. Until the w worked out but then already error. Thanks

  • @Martinsluan will use in PHP.

  • Please click on [Edit] and better inform the criteria. It will always be "Option number - Item number - Top Menu"? Texts may vary? etc

  • 1

    @hkotsubo made friend. I think it improved now. Thanks for the tip.

Show 1 more comment

3 answers

5

You can use the function preg_replace to make the replacement.

The regex can be something like ^\w+ (\d+ -[^\-]+)( -.*)?$:

  • the markers ^ and $ are, respectively, the beginning and end of the string. This ensures that I am checking the entire string.
  • the shortcut \w means "letters (from A to Z, uppercase or lowercase), numbers (from 0 to 9) or the character _"
  • the shortcut \d means "any digit from 0 to 9"
  • the quantifier + means "one or more occurrences".
  • [^\-] is "any character that nay be hyphenated"
  • .* is "zero or more occurrences of any character", and the ? soon after it becomes the ( -.*) optional (ie, can have a space, hyphen and "anything" at the end of the string)

So the regex starts with \w+ (one or more occurrences of letters, numbers or _), followed by space, then one or more numbers (\d+), space, hyphen, multiple characters that are not hyphenated (this ensures that you will only catch until the next hyphen), optionally followed by space, hyphen, and .* (zero or more occurrences of anything), and finally the end of the string.

The excerpt \d+ -[^\-]+ is in parentheses, and this forms a catch group. This means that the text corresponding to this passage can be referenced later.

In case, as is the first pair of parentheses, the text that is captured will be available in the special variable $1, I can use in the second parameter of preg_replace:

$textos = array(
  "Opcao 01 - Frase aleatória - Menu Superior",
  "Opcao 02 - Outra Frase aleatória - Menu Su",
  "Opcao 03 - Mais 01 Frase - Menu",
  "Opcao 04 - Mais Frase -",
  "Opcao 05 - Frase Simples");
foreach($textos as $texto) {
    echo preg_replace('/^\w+ (\d+ -[^\-]+)( -.*)?$/', '$1', $texto), PHP_EOL;
}

The result is:

01 - Random phrase
02 - Another Random Phrase
03 - More 01 Sentence
04 - More Phrase
05 - Simple Sentence


If you want, you can pass the entire array to preg_replace, that the return will be another array with the replacements made:

$textos = array(
    "Opcao 01 - Frase aleatória - Menu Superior",
    "Opcao 02 - Outra Frase aleatória - Menu Su",
    "Opcao 03 - Mais 01 Frase - Menu",
    "Opcao 04 - Mais Frase -",
    "Opcao 05 - Frase Simples");
var_dump(preg_replace('/^\w+ (\d+ -[^\-]+)( -.*)?$/', '$1', $textos));

Exit:

array(5) {
  [0]=>
  string(21) "01 - Frase aleatória"
  [1]=>
  string(27) "02 - Outra Frase aleatória"
  [2]=>
  string(18) "03 - Mais 01 Frase"
  [3]=>
  string(15) "04 - Mais Frase"
  [4]=>
  string(18) "05 - Frase Simples"
}

Accented characters

In the above regex, \w does not consider accented characters, so if the string starts with "Option", for example, it will not work. Another detail is that \w also considers numbers and the character _. If you only want letters, one option is to use the unicode properties (using the category L - that considers all letters, including other alphabets, such as Japanese, Korean, Cyrillic, etc), not forgetting to use the modifier u (just after the second / in regex):

// trocar \w por \p{L} e adicionar a opção "u" na regex, para considerar letras acentuadas 
var_dump(preg_replace('/^\p{L}+ (\d+ -[^\-]+)( -.*)?$/u', '$1', $textos));

If you want to leave the \w, just add the option u (recalling that the \w also considers numbers and the _):

var_dump(preg_replace('/^\w+ (\d+ -[^\-]+)( -.*)?$/u', '$1', $textos));

Spaces

The above options work for when there is only one space separating words, numbers and hyphens.

But if there is more than one space separating these parts, you can use \s+ (one or more spaces). Also, I modified the regex a little in case of having more than one space before the second hyphen (for example, "Opção 01 - Frase aleatória - Menu Superior"):

var_dump(preg_replace('/^\w+\s+(\d+\s+-\s+([^\-\s]+(\s+[^\-\s]+)*))(\s+-.*)?$/u', '$1', $textos));

For the stretch between the two hyphens, I used [^\-\s]+(\s+[^\-\s]+)*:

  • [^\-\s]+: one or more occurrences of anything other than hyphen or space
  • (\s+[^\-\s]+)*: zero or more occurrences of "spaces, followed by multiple characters other than hyphens or space"

With that, I capture the text "Frase aleatória", without the risk of catching the second hyphen, nor the spaces you have before it (but picking up the spaces between the words). It might seem like an extra complication because you might think "why not use .*? which is simpler?". Which brings me to another topic: the efficiency of a regex.


Efficiency

Only for the purpose of comparison with user reply @Park (who is also right, I am not criticizing, just comparing the solutions), the regex I suggested is more efficient. The regex101.com has a tool of debugging which is very interesting to see how the regex behaves.

In the case of regex (\d+\s+[-]\s+.*?(?=\s+-)|\d+\s+[-].*), see that she takes between 67 and 137 steps (depending on the string) to find a match. The regex I suggested takes a maximum of 21 steps. The second version, with \s+ instead of space, it takes at most 29 steps.

And if the string does not match the regex, which I used less time to realize that and report a no-match (53 steps against 149 - the second version, with \s+, also needs 53 steps).

That being said, obviously these results are estimates and the exact numbers depend on how the engine internal PHP is implemented: both the links of the regex101.com that I put up as far as the functions preg_xxx use a engine PCRE (Perl Compatible Regular Expressions), but depending on the regex and the strings used, some languages perform internal optimizations in some cases, for example. Even using the same type of regex (PCRE), numbers can vary from one language/engine/tool to another.

But anyway the real numbers should not change much. The regex of the other answer uses alternation (|) (which always makes her try all the alternatives, until one works out) and .*?, that makes the engine test various possibilities (after all, it means "zero or more occurrences of whichever character"), and this causes it to perform several additional steps (in the case of strings that do not satisfy regex, it needs to test all possibilities until it is sure that it does not even give).

My regex even uses .*, but it is at the end of the string and within an optional block, which reduces a bit this overhead, besides having no toggle (a single alternative to test instead of two) and using [^\-] (any non-hyphenated character), which creates fewer possibilities than the . (which is "any character", which increases the possibilities exponentially, since the hyphen itself may be included if regex finds it necessary).

The fact that I used ^ and $ also helps in this sense, as without them the regex is tested again at each position of the string, until it finds the point at which it is satisfied. Using ^ eliminates these extra steps, as she already knows that she should always search from the beginning of the string. That is, several details that alone seem "silly", but together make a difference.

Of course that for a small amount of short strings, the difference will be insignificant (likely to be milliseconds or even less), but if dealing with a large amount of data, it may make a difference. Inclusive, use .* can cause an exponential increase of steps, depending on the case (increase 3 characters in the invalid string, for example, just add 3 more steps in my regex (from 53 to 56 inclusive in the second version with \s+), while in the other adds 15 (jumps from 149 to 164)).

I don’t know if you will handle such a large amount of data to make any difference in performance, but anyway gets recorded the alternative.

  • 1

    You always surprise me with regex. I would give more than one upvote if it were possible! Kk

2


Try the next:

\d+\s+-[^-]*

As you prefer without the spaces, try that:

\d+\s+-[^-]*(?=\s)

There are other ways you can do this:

  • 1) \d+ s+-. *? (?= s+-|$) see

  • 2) \d+ s+-(?: s. *? (?= s+-)|.+) see

  • 3) (\d+ s+[-] s+.?(?=\s+-)| d+ s+[-].) see

  • 4) \d+ s+-(?: s+ S*(?: s(?! s*-) S*)*|.+) more efficient than the others. Behold

The fourth option is very efficient because the pieces up to a space in white are combined in "batches", the checks of a hyphen are made only when a blank space is found.

  • Cara gave it right the only thing is that it’s taking the last space before the -. It has to remove this space from the end?

  • @Joao Nivaldo Dá yes, I updated the answer

  • 1

    Gave right now. Thank you very much.

-1

You need to use the function called str_replace, I made an example for you

<?php    
function limpar_stringl($string){ // replace para limpar variaveis Opcao 01 - Item 01 - Menu Superior
    $string = str_replace('Opcao','', $string);
    $string = str_replace(' - Menu Superior','', $string);
    return $string;
}
echo limpar_stringl("Opcao 01 - Item 01 - Menu Superior");
  1. Whenever you want to refactor a new character or word character set etc you will need to add a new line in the code
  2. Relate the item$string = str_replace('caractere ou palavra','', $string);
  • I need to do as I said with the REGEX. Thank you

Browser other questions tagged

You are not signed in. Login or sign up in order to post.