replace with regular expression (regex) ignoring accents

Asked

Viewed 1,732 times

3

I recently changed the friendly url, because the querystryng used for searches was filtered to without accents. Regarding the queries to the database regardless of whether the word has accent or is not being found. But I used to give a replace to highlight the searched words

    Replace (texto, palavra,"<b>" & palavra & "</b>"

In short, how can I use a ereg_replace() to ignore accents. Example:

    Texto="Este é um filme de Ação e tem muita ação. "
    Palavra="acao"
    texto=Ereg_replace(condição_RegEx_que_ignora_Acentos-e-case, texto, palavra,"<b>"& palavra & "</b>")
    'Resultado q preciso:
    "Este é um filme de <b>Ação</b> e tem muita <b>ação</b>."

Thanks for your attention

@EDIT to explain better:

i have a word that is without accent coming from querystring, a search: example: What I need to make a substitution to highlight this word in the text, whether or not it is accented. Then a text with Small or Tokyo or Tokyo or Tuchium that was found by the search term: Small or Tokyo or Tokyo or Tuchium needs to be replaced by the tag itself added <b>...</b> so any text with the word: Small or Tokyo or Tokyo or Tuchium should get itself added to the tag, like this:

era assim: Fui pra Toquio.
precisa ficar assim: Fui pra <b>Toquio</b>.

mas se for assim: Fui pra Tóquio.
precisa ficar assim: Fui pra <b>Tóquio</b>.

ou ainda: Fui pra tóquio.
precisa ficar assim: Fui pra <b>tóquio</b>.

ou ainda: Fui pra toquio.
precisa ficar assim: Fui pra <b>toquio</b>.

whether the user typed in the search Small or Tokyo or Tokyo or Tuchium

got better @david

  • I think you’re a little confused, as I understand it, you need an algorithm that when finding the word in the text adds the right <b> </b> tags?

  • that. independent of case (capital or minuscule) and accents.. Just like the example I put on it

  • It just got confused, what you asked "ignore accents", and in case that wouldn’t be it, would have to rephrase your question?

  • I don’t know Asp, but probably ereg_replace accepts capture groups, so if you pass the regex "(" & padrao & ")" as a first argument, the substitution may be "<b>$1</b>" and it will insert the word found in the result (and not the default). It remains to know how to make this pattern to ignore accents and capitalization.

  • @david but that’s exactly what I want: to exchange a word; whether it accent or not; for the same words, whether it had accent or not, adding a tag, just that. I believe the term q I used is right and the clear example

  • @mgibsonbr, it’s about that, I found a function for Regex in Asp but I’m not sure how to make a condition to meet this, I found a similar need in American stackoverflow but it’s in php and I didn’t get it right, pq no longer know mto of php in English yet... http://stackoverflow.com/questions/10477213/regex-to-ignore-Accents-php

  • 1

    I don’t know Sp... :( Take a look in that other question (for Javascript) - in particular that answer - and see if it gives you light. The idea is to replace each letter in the pattern with a set of "similar" characters, before using on ereg_replace. So if your word is acao, its final regex would be [aAáÁâÂàÀãÃ][cCçÇ][aAáÁâÂàÀãÃ][oOóÓôÔòÒõÕ]. You would need to create a function to from the search term create this regex.

  • P.S. Just clarifies one thing: you’re using classic ASP even, not ASP.Net, right?

  • 1

    @mgibsonbr, sim Asp classico

  • @mgibsonbr, I had found this page, but I couldn’t convert it to a similar Asp function. On this page https://github.com/chiquitto/FuncoesASP/blob/master/lb.string.asp has the function ereg_replace, which I believe does what I need, but I don’t know exactly how it would be used, especially the Pattern for it to work doing what I need

  • @David I edited to explain in more detail, see if it’s clear to you.

Show 6 more comments

1 answer

3


Note: I do not know very well ASP, I will respond with the proposed logic, but the code may not be 100% correct.

I suggest from the search term you create a regex for it, replacing each letter of the word with an interval containing all the variations of the letter (uppercase/lowercase/accented), and then use this regex in the Ereg_replace.

function CriarRegex(palavra)
    palavra = eregi_replace("a", "[aAáÁâÂàÀäÄãÃ]", palavra)
    palavra = eregi_replace("e", "[eEéÉêÊèÈëË]", palavra)
    palavra = eregi_replace("i", "[iIíÍîÎìÌïÏ]", palavra)
    palavra = eregi_replace("o", "[oOóÓôÔòÒöÖõÕ]", palavra)
    palavra = eregi_replace("u", "[uUúÚûÛùÙüÜ]", palavra)
    palavra = eregi_replace("c", "[cCçÇ]", palavra)
    CriarRegex= "(" & palavra & ")"
End Function

(The eregi_replace should disregard uppercase/lowercase, however as I do not know if this also applies to accented words, I used all combinations in the code above.)

When using, replace the text with the result of the first capture group, so the word inserted at the end will be the same word found in the text, not the search term:

Texto="Este é um filme de Ação e tem muita ação. "
Palavra="acao"
Regex = CriarRegex(Palavra)
texto=eregi_replace(Regex, "<b>$1</b>", texto)

By the way, the code above will also highlight things like "criaction" - what may or may not be what you want. If you just want to highlight whole words, use a "word delimiter" (word Boundary) in its regex (if supported by library you mentioned):

    CriarRegex= "(\b" & palavra & "\b)"
    CriarRegex= "(\\b" & palavra & "\\b)"

(I don’t know if in ASP you need to "escape" the backslash or not, so I posted the two variations)

  • 1

    It is true that this type of thing is probably a case of string substitution and not Regex, but as the question asks so, it is well solved. + 1

  • 1

    @Bacco I do not know, for me it is a matter of regex. Because the AP wants to identify in the text a word "similar" to the search term, and replace it with something that is a function of the word found (and not the term sought). It seems to me a proper regex application. Or you refer to the function CriarRegex in itself? (which maybe could be made better using a loop)

  • @mgibsonbr is quite right. But I think q needs some adjustments to Asp yet. especially at the end where it is "<b>$1</b>" think q in Asp will be considered all only as literal text

  • @mgibsonbr imagine that a simple substitution table + a single loop does everything much better than Regex, considering that the charset involved is small. But much of each kind of implementation goes. I use something similar to this to search that ignores accents in Sqlite, for performance.

  • 1

    @Supermax Esse $1 is a reference to the first catch group. It is the responsibility of the regex library - not the language itself - to replace that with the real word. And I see that ASP supports this, see in the library that you indicated, there are several examples of substitutions of this type throughout the code. (in other words, it’s supposed to be a literal text)

  • @mgibsonbr, cara, demais, it worked at first. Thank you very much, that was it. And tbm agree with you, it was a matter of Regex and not just replace. In case it would be a Regex inside a replace. But thank you not only for the excellent response but the full understanding of it and its resolution. Gde hug and great Christmas

  • Good idea, although in sql query already doing this filter correctly may actually occur some situation of suddenly having the word "vote" in the middle of the context, there was no attempt to do so, but in case your solution with a ""b just already solves it. One more thank you

  • A problem has arisen to delimit the whole word. Qdo the accent is in the first or last letter ai is not picking up. See short example http://www.regexr.com/3cfp7

  • Very strange indeed... I don’t know what’s causing it, but I got a similar result using lookarounds: (?<!\w)([eé]t[eé]st[eé])(?!\w). The problem is that they (mainly the lookbehind) are not supported by all regex Engines. regexr.com for example accepted Lookahead but not the lookbehind. The ruble accepted both. You’d have to test it on your platform and see if it works. P.S. This regex isn’t perfect, a áéteste or etesteá for example should not be married but is.

  • Thank you again for your return. It didn’t work for me, but I gave a modified one like this: ( b| s|-)([eé]t[eé]st[eé])(?! w) http://www.regexr.com/3cfr6 and changed the replace to: "$1<b>$2</b>". Solved almost everything (I think) the only exception I noticed so far was the second q you quoted: etesteá. Even if there is no solution I think you can live with it.

  • 1

    @Supermax The difficulty is that \w does not match accented characters, so you would have to replace it with the full list of characters. For example, if you replace the latter (?!\w) for (?![\wé]) he stops marrying the last two words left. He would then have to use everything to catch what is missing: (?![\wáâàäãÁ...ÜçÇ]). By the way, does your system support possessive quantifiers? A more "secure" way around the lack of lookarounds would be doing ([^\wá...Ç]?+)(regex)([^\wá...Ç]?+) for $1<b>$2</b>$3 (but the ?+ has to be possessive to work, not enough ?)

  • ?+ did not run, gives error, Mas ( |! b| s|-)([eé]t[eé]st[eé])([ wáâàäãéêèëíîìïóôòöõúùüç-]|$) Solved practically everything; the only exception was to have the word rethought right next to, type: bla bla eteste eteste bla bla. But I find it unlikely that this will actually occur, I just hope that when I have your words; eteste and ateste does not occur, I think it will not occur. Tb could not catch if it is in quotes: "eteste". but tb would not have any occurrences for this situation. I just wonder if the size it has left can get in the way of performance? Thanks again.

  • @Supermax In this case no, because the regex consumes one character at a time and there is no backtracking. Even though she is "long" the performance must be good.

Show 8 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.