The regex does not accept __foo_bar__
because of the character class denied [^_]
, which corresponds to "any character that nay be it _
". And how foo_bar
has a _
, no one is found match.
To accept this case, we have to include another condition, which is "the character _
, as long as you have no other _
after". We can do this with a Lookahead negative: _(?!_)
.
So this section would be [^_]|_(?!_)
(a character that is not _
, or one _
as long as you have no other _
afterward).
Another detail is that regex does not accept cases with only one character between delimiters, such as __a__
(see). That’s because the [^\s_]
appears twice: once after the initial delimiter, and once before the final delimiter. Therefore regex requires at least two characters between delimiters.
We can solve this using a Lookahead negative right after the initial delimiter: I change the [^\s_]
for (?![\s_])
(that is, it cannot have space nor _
after). The difference is that the Lookahead only looks at what is ahead, but does not consume the character (so if you only have one, it will be consumed by the following parts of the regex, allowing the case that only has one character).
We can also exchange the beginning and end checks for a Lookahead and a lookbehind, not to create random groups and not have to include them in the substitution. And finally, leave only one capture group for the content between the delimiters, transforming the others into catch groups, changing the (
for (?:
. So I get only one group, and in substitution I can do only <i>$1</i>
.
Would look like this:
(?<=^|[^_])__((?![\s_])(?:(?:[^_]|_(?!_))+?)?[^\s_])__(?=[^_]|$)
See here the regex working
In short:
(?<=^|[^_])
: lookbehind to check if before the delimiter has the beginning of the string or a character that is not _
(?![\s_])
: Lookahead negative to check that after the delimiter there is no space and _
[^_]|_(?!_)
: a character that is not _
, or a _
as long as you have no other _
afterward
[^\s_]
: a character that is neither space nor _
(?=[^_]|$)
: Lookahead to check if after the delimiter has the end of the string or a character that is not _
For delimiters **
, just use the same logic. Remember that the *
has special meaning in regex (it is a quantifier indicating "zero or more occurrences") and must be escaped with \
(getting \*
), unless it is in square brackets. That is:
(?<=^|[^*])\*\*((?![\s*])(?:(?:[^*]|\*(?!\*))+?)?[^\s*])\*\*(?=[^*]|$)
Testing:
$q = "
Boa __tarde__ **Bacco**, isto é um **teste** com diversos **negritos** e __sublinhados__
__**um** dois__
**__um__ dois**
-- funciona
__foo bar__
__f o o b a r__
**teste __lorem ipsum__ dolor sit**
__teste **lorem ipsum** dolor sit__
abc__xyz__teste
-- não funcionava na primeira versão
__foo_bar__
__a__
__*__
__foo_*bar__
-- não é para funcionar
__ foo bar__
__foo bar __
_foo__
__ foo __
_foo_
__foo_
__ __
__a____
____a__
_____
-- corner case: aninhado (não tratei pois não sei o que deveria fazer)
__abc__xyz__123__
__abc __xyz__ 123__
-- corner case: escapes
__\___
__a\___
__\_abc__
";
$q = preg_replace('#(?<=^|[^*])\*\*((?![\s*])(?:(?:[^*]|\*(?!\*))+?)?[^\s*])\*\*(?=[^*]|$)#', '<b>$1</b>', $q);
$q = preg_replace('#(?<=^|[^_])__((?![\s_])(?:(?:[^_]|_(?!_))+?)?[^\s_])__(?=[^_]|$)#', '<i>$1</i>', $q);
var_dump($q);
Exit:
string(637) "
Boa <i>tarde</i> <b>Bacco</b>, isto é um <b>teste</b> com diversos <b>negritos</b> e <i>sublinhados</i>
<i><b>um</b> dois</i>
<b><i>um</i> dois</b>
-- funciona
<i>foo bar</i>
<i>f o o b a r</i>
<b>teste <i>lorem ipsum</i> dolor sit</b>
<i>teste <b>lorem ipsum</b> dolor sit</i>
abc<i>xyz</i>teste
-- não funcionava na primeira versão
<i>foo_bar</i>
<i>a</i>
<i>*</i>
<i>foo_*bar</i>
-- não é para funcionar
__ foo bar__
__foo bar __
_foo__
__ foo __
_foo_
__foo_
__ __
__a____
____a__
_____
-- corner case: aninhado (não tratei pois não sei o que deveria fazer)
<i>abc</i>xyz<i>123</i>
__abc <i>xyz</i> 123__
-- corner case: escapes
__\___
__a\___
<i>\_abc</i>
"
As you can see, there’s still corner cases to be treated:
- nested delimiters:
__abc__xyz__123__
and __abc __xyz__ 123__
, which have been replaced by <i>abc</i>xyz<i>123</i>
and __abc <i>xyz</i> 123__
(I don’t know how the parsers usually treat such cases)
- escapes:
\_
should be treated with a common character, for example, __a\___
should become <i>a_</i>
- see here an unsuccessful attempt to resolve, the final solution I believe will be much more complicated (has a very complicated to follow)
Anyway, the ideal is to use a parser markdown, regex is not always the best solution.
Just for the record, you’re following an alternative to treating leaks:
$regex = '#(?<=^|[^_])__(?![\s_])(?=[^_])((?>[^\\\\_]*)(?>(?:(?>\\\\_)|(?>(?!__)_[^_]*))[^\\\\_]*)*)(?<![\s\\\\]|(?=\\\\)_)__(?=[^_]|$)#';
$q = preg_replace($regex, '<i>$1</i>', $q);
$q = preg_replace('/\\\\([*_])/', '$1', $q);
Starting with the second regex, which is easier. It checks to see if you have a \
followed by _
or *
, and removes the \
. But the really tricky one is the first, which checks whether it has a _
escape. Breaking it into pieces (and putting in some spaces to make it easier to understand):
(?<=^|[^_]) __ (?![\s_]) <-- delimitador inicial
(?=[^_]) <-- verifica se tem pelo menos um caractere à frente (que não seja _)
( <-- inicia grupo de captura (conteúdo que ficará entre as tags)
(?> [^\\\\_]* ) <-- qualquer caractere que não seja _ nem \
(?>
(?:
(?> \\\\_) | <-- um escape (\_), ou
(?> (?!__) _ [^_]* ) <-- um _ que não seja delimitador, seguido de caracteres que não são _
)
[^\\\\_]* <-- qualquer caractere que não seja _ nem \
)*
)
(?<![\s\\\\]|(?=\\\\)_) __(?=[^_]|$) <-- delimitador final
If you want, you can put it in the code as above, using the flag x
so that spaces and line breaks are ignored:
$regex = '#(?<=^|[^_]) __ (?![\s_])
(?=[^_])
(
(?> [^\\\\_]* )
(?>
(?:
(?> \\\\_) |
(?> (?!__) _ [^_]* )
)
[^\\\\_]*
)*
)
(?<![\s\\\\]|(?=\\\\)_) __(?=[^_]|$)#x';
This regex follows the technique of unroll the loop, in which you must identify the following parts:
- initial and final delimiters: I used the same idea of the previous regex, which is to use lookaheads and lookbehinds to check what you have before and after the
__
(that is, check if there is no space after the __
initial, etc). I just added a few more cases to check if one of the _
is not escaped with \
- "normal": what happens most often between delimiters. In this case, it is the characters that are not
\
nor _
- "special": what is not normal. In this case, it is an escape (
\_
), or a _
alone (as long as there is no other _
afterward)
The general structure of the regex is:
delimitador normal* (?:especial normal*)* delimitador
As both the normal and the special are marked with *
(zero or more occurrences), regex would accept cases such as ____
, so I added the Lookahead (?=[^_])
right after the initial delimiter to ensure that it has at least one character.
Are also used atomic groups (marked by (?>
) to reduce the backtracking.
This regex ignores escape cases (\_
), but does not remove them, so I needed another regex to remove them later.
See here this regex working. Although it works, there must be others corner cases that she doesn’t take. But even if she didn’t, I still don’t think it’s worth it, and use a parser remains the best option.
@Sam wouldn’t work in a sentence with straight Makrkdowns...
– MagicHat
@Magichat well said. Markdown is not regular, it is context-free.
– Jefferson Quesado
@Guilehermenting, need to be PHP? No JS (no third party library)?
– Jefferson Quesado
I read a while ago about the Markdown specification being something very messy. Apart from the purpose of studies, I don’t see much why we should do this considering there’s a number large of markdown parsers in PHP to use/be inspired.
– gmsantos
@gmsantos I am a person who want to build my own things for a number of reasons, if it’s worth building, if it’s not worth it I won’t reinvent the wheel, in case here I don’t know if it’s worth it or not, besides my goal is more study than solutions :)
– Guilherme Nascimento
My goal in answering is also in study. I’m not very good with regex and this question helped me to give an evolved.
– Andrei Coelho