How to create a simple markdown with PHP?

Asked

Viewed 295 times

12

I would like to create a simple markdown, for bold and italic for now only, for example:

  • **foo** flipped <b>foo</b>
  • __bar__ flipped <i>bar</i>

Of course some details are needed, for example in italics this cannot work:

 __ foo __

For it is separated, the first and last letter must be joined to the "delimiters", however this would be valid:

 __foo bar__     => <i>foo bar</i>
 __f o o b a r__ => <i>f o o b a r</i>

For spaces between the first and the last letter are accepted.

At the moment I created this:

  • Bold:

    $str = preg_replace('#(^|[^\*])\*\*([^\s\*]([^\*]+?)?[^\s\*])\*\*([^\*]|$)#', '$1<b>$2</b>$4', $str);
    
  • Italic:

    $str = preg_replace('#(^|[^_])__([^\s_]([^_]+?)?[^\s_])__([^_]|$)#', '$1<i>$2</i>$4', $str);
    

Both are very similar and seem to work well, to explain better the regex:

(^|[^_])__([^\s_]([^_]+?)?[^\s_])__([^_]|$)

  ^     ^   ^     ^        ^     ^   ^
  |     |   |     |        |     |   |
  |     |   |     |        |     |   |
  |     |   |     |        |     |   |
  |     |   |     |        |     |   |
  |     |   |     |        |     |   +-- verifica se após o delimitador não é underscore ou se é o final da string
  |     |   |     |        |     |
  |     |   |     |        |     +-- verifica se o delimitador são 2 underscores
  |     |   |     |        |
  |     |   |     |        +-- o ultimo caractere antes do delimitador não pode ser espaço e nem underscore
  |     |   |     |
  |     |   |     +-- pega qualquer coisa que não seja underscore, esse grupo é opicional
  |     |   |
  |     |   +-- verifica se o que vem após o primeiro delimitador é diferente de espaço e diferente de underscore
  |     |
  |     +-- verifica se o delimitador são 2 underscores
  |
  +-- checa se é o começo da string ou se o que vem antes do delimitador é diferente de underscore _

Example in the ideone: https://ideone.com/PL8nTA

Yet the way I made is not possible do this:

__foo_bar__

And neither is this:

**foo*bar**

I would like some improvement suggestion on this or else something totally different from this, even if it is without regex.

  • 3

    @Sam wouldn’t work in a sentence with straight Makrkdowns...

  • 1

    @Magichat well said. Markdown is not regular, it is context-free.

  • @Guilehermenting, need to be PHP? No JS (no third party library)?

  • I read a while ago about the Markdown specification being something very messy. Apart from the purpose of studies, I don’t see much why we should do this considering there’s a number large of markdown parsers in PHP to use/be inspired.

  • 1

    @gmsantos I am a person who want to build my own things for a number of reasons, if it’s worth building, if it’s not worth it I won’t reinvent the wheel, in case here I don’t know if it’s worth it or not, besides my goal is more study than solutions :)

  • My goal in answering is also in study. I’m not very good with regex and this question helped me to give an evolved.

Show 1 more comment

2 answers

5

The regex does not accept __foo_bar__ because of the character class denied [^_], which corresponds to "any character that nay be it _". And how foo_bar has a _, no one is found match.

To accept this case, we have to include another condition, which is "the character _, as long as you have no other _ after". We can do this with a Lookahead negative: _(?!_).

So this section would be [^_]|_(?!_) (a character that is not _, or one _ as long as you have no other _ afterward).


Another detail is that regex does not accept cases with only one character between delimiters, such as __a__ (see). That’s because the [^\s_] appears twice: once after the initial delimiter, and once before the final delimiter. Therefore regex requires at least two characters between delimiters.

We can solve this using a Lookahead negative right after the initial delimiter: I change the [^\s_] for (?![\s_]) (that is, it cannot have space nor _ after). The difference is that the Lookahead only looks at what is ahead, but does not consume the character (so if you only have one, it will be consumed by the following parts of the regex, allowing the case that only has one character).

We can also exchange the beginning and end checks for a Lookahead and a lookbehind, not to create random groups and not have to include them in the substitution. And finally, leave only one capture group for the content between the delimiters, transforming the others into catch groups, changing the ( for (?:. So I get only one group, and in substitution I can do only <i>$1</i>.

Would look like this:

(?<=^|[^_])__((?![\s_])(?:(?:[^_]|_(?!_))+?)?[^\s_])__(?=[^_]|$)

See here the regex working

In short:

  • (?<=^|[^_]): lookbehind to check if before the delimiter has the beginning of the string or a character that is not _
  • (?![\s_]): Lookahead negative to check that after the delimiter there is no space and _
  • [^_]|_(?!_): a character that is not _, or a _ as long as you have no other _ afterward
  • [^\s_]: a character that is neither space nor _
  • (?=[^_]|$): Lookahead to check if after the delimiter has the end of the string or a character that is not _

For delimiters **, just use the same logic. Remember that the * has special meaning in regex (it is a quantifier indicating "zero or more occurrences") and must be escaped with \ (getting \*), unless it is in square brackets. That is:

(?<=^|[^*])\*\*((?![\s*])(?:(?:[^*]|\*(?!\*))+?)?[^\s*])\*\*(?=[^*]|$)

Testing:

$q = "
Boa __tarde__ **Bacco**, isto é um **teste** com diversos **negritos** e __sublinhados__

__**um** dois__
**__um__ dois**

-- funciona
__foo bar__
__f o o b a r__
**teste __lorem ipsum__ dolor sit**
__teste **lorem ipsum** dolor sit__
abc__xyz__teste

-- não funcionava na primeira versão
__foo_bar__
__a__
__*__
__foo_*bar__


-- não é para funcionar
__ foo bar__
__foo bar __
_foo__
__ foo __
_foo_
__foo_
__ __
__a____
____a__
_____

-- corner case: aninhado (não tratei pois não sei o que deveria fazer)
__abc__xyz__123__
__abc __xyz__ 123__

-- corner case: escapes
__\___
__a\___
__\_abc__
";

$q = preg_replace('#(?<=^|[^*])\*\*((?![\s*])(?:(?:[^*]|\*(?!\*))+?)?[^\s*])\*\*(?=[^*]|$)#', '<b>$1</b>', $q);
$q = preg_replace('#(?<=^|[^_])__((?![\s_])(?:(?:[^_]|_(?!_))+?)?[^\s_])__(?=[^_]|$)#', '<i>$1</i>', $q);

var_dump($q);

Exit:

string(637) "
Boa <i>tarde</i> <b>Bacco</b>, isto é um <b>teste</b> com diversos <b>negritos</b> e <i>sublinhados</i>

<i><b>um</b> dois</i>
<b><i>um</i> dois</b>

-- funciona
<i>foo bar</i>
<i>f o o b a r</i>
<b>teste <i>lorem ipsum</i> dolor sit</b>
<i>teste <b>lorem ipsum</b> dolor sit</i>
abc<i>xyz</i>teste

-- não funcionava na primeira versão
<i>foo_bar</i>
<i>a</i>
<i>*</i>
<i>foo_*bar</i>


-- não é para funcionar
__ foo bar__
__foo bar __
_foo__
__ foo __
_foo_
__foo_
__ __
__a____
____a__
_____

-- corner case: aninhado (não tratei pois não sei o que deveria fazer)
<i>abc</i>xyz<i>123</i>
__abc <i>xyz</i> 123__

-- corner case: escapes
__\___
__a\___
<i>\_abc</i>
"

As you can see, there’s still corner cases to be treated:

  • nested delimiters: __abc__xyz__123__ and __abc __xyz__ 123__, which have been replaced by <i>abc</i>xyz<i>123</i> and __abc <i>xyz</i> 123__ (I don’t know how the parsers usually treat such cases)
  • escapes: \_should be treated with a common character, for example, __a\___ should become <i>a_</i> - see here an unsuccessful attempt to resolve, the final solution I believe will be much more complicated (has a very complicated to follow)

Anyway, the ideal is to use a parser markdown, regex is not always the best solution.


Just for the record, you’re following an alternative to treating leaks:

$regex = '#(?<=^|[^_])__(?![\s_])(?=[^_])((?>[^\\\\_]*)(?>(?:(?>\\\\_)|(?>(?!__)_[^_]*))[^\\\\_]*)*)(?<![\s\\\\]|(?=\\\\)_)__(?=[^_]|$)#';
$q = preg_replace($regex, '<i>$1</i>', $q);
$q = preg_replace('/\\\\([*_])/', '$1', $q);

Starting with the second regex, which is easier. It checks to see if you have a \ followed by _ or *, and removes the \. But the really tricky one is the first, which checks whether it has a _ escape. Breaking it into pieces (and putting in some spaces to make it easier to understand):

(?<=^|[^_]) __ (?![\s_])  <-- delimitador inicial
(?=[^_])  <-- verifica se tem pelo menos um caractere à frente (que não seja _)
(  <-- inicia grupo de captura (conteúdo que ficará entre as tags)
 (?> [^\\\\_]* )  <-- qualquer caractere que não seja _ nem \
 (?>
  (?:
    (?> \\\\_) |          <-- um escape (\_), ou
    (?> (?!__) _ [^_]* )  <-- um _ que não seja delimitador, seguido de caracteres que não são _
  )
  [^\\\\_]*   <-- qualquer caractere que não seja _ nem \
 )*
)
(?<![\s\\\\]|(?=\\\\)_) __(?=[^_]|$)  <-- delimitador final

If you want, you can put it in the code as above, using the flag x so that spaces and line breaks are ignored:

$regex = '#(?<=^|[^_]) __ (?![\s_])
(?=[^_])
(
 (?> [^\\\\_]* )
 (?>
  (?:
    (?> \\\\_) |
    (?> (?!__) _ [^_]* )
  )
  [^\\\\_]*
 )*
)
(?<![\s\\\\]|(?=\\\\)_) __(?=[^_]|$)#x';

This regex follows the technique of unroll the loop, in which you must identify the following parts:

  • initial and final delimiters: I used the same idea of the previous regex, which is to use lookaheads and lookbehinds to check what you have before and after the __ (that is, check if there is no space after the __ initial, etc). I just added a few more cases to check if one of the _ is not escaped with \
  • "normal": what happens most often between delimiters. In this case, it is the characters that are not \ nor _
  • "special": what is not normal. In this case, it is an escape (\_), or a _ alone (as long as there is no other _ afterward)

The general structure of the regex is:

delimitador normal* (?:especial normal*)* delimitador

As both the normal and the special are marked with * (zero or more occurrences), regex would accept cases such as ____, so I added the Lookahead (?=[^_]) right after the initial delimiter to ensure that it has at least one character.

Are also used atomic groups (marked by (?>) to reduce the backtracking.

This regex ignores escape cases (\_), but does not remove them, so I needed another regex to remove them later.

See here this regex working. Although it works, there must be others corner cases that she doesn’t take. But even if she didn’t, I still don’t think it’s worth it, and use a parser remains the best option.

3

After several tests I created a solution, which I believe will contemplate all cases of string identifying the correct and wrong ones. For this, I started from the following premise:

Cases that are right:

Entree:

__correto__
__c o r r e t o__
__c_o_r_r_e_t_o__
__cor   re  to__
__co rre _to__
__a__

Exit:

  • correct
  • c o r e t o
  • c_o_r_r_e_t_o
  • color re to
  • co rre _to
  • to

Cases that are wrong:

__errado __
__ errado__
__errado___
___errado__

This applies to cases in bold

Using this regex:

(.?)(__([^_\s]+\s*_?)*[^\s_]+__)([^_]|$)

along with preg_match_allPHP, we can analyze the groups as follows:

(.?) ---> pega qualquer caractere ou não, antes do próximo grupo

(__([^_\s]+\s*_?)*[^\s_]+__)  ([^_]|$) --> verifica se após o delimitador não é underline ou se é o final da string
^   ^             ^      ^
.   .             .      ----------> finaliza grupo com 2 underlines
.   .             .
.   .             ----------> pega um ou mais caracteres diferente de espaço e undeline
.   .
.   ------------------> este grupo pode ter ou não qualquer caractere seguido de 1 ou mais espaços(ou não) seguido de 1 underline(ou não)
.
----------------> inicia grupo com 2 underline

With the help of PHP, we will do this:

    $string = "Boa __tarde__ **Bacco**, isto é um **teste** com diversos **negritos** e __sublinha_dos__

    __**um** dois__  **__um__ dois**
    __aqui nao_funciona __ __ nem_aqui,pois está errado__
    __aqui está certo__ ___errado__ __certo__";

    preg_match_all("/(.?)(\*\*([^\*\s]+\s*\*?)*[^\s\*]+\*\*)([^\*]|$)/", $string, $resultNegrito);

    $negrito = $resultNegrito[2];
    $iniNegrito = $resultNegrito[1]; // valores do grupo (.?)
    for($x = 0; $x < count($negrito); $x++){
        if($iniNegrito[$x] != "*"){
            $res = "<b>".substr($negrito[$x],2,strlen($negrito[$x]) -4)."</b>";
            $string = str_replace($negrito[$x],$res,$string);
        }
    }

    preg_match_all("/(.?)(__([^_\s]+\s*_?)*[^\s_]+__)([^_]|$)/", $string, $resultSublinhado);

   $sublinhado = $resultSublinhado[2];
    $iniSublinhado = $resultSublinhado[1]; // valores do grupo (.?)
    for($x = 0; $x < count($sublinhado); $x++){
        if($iniSublinhado[$x] != "_"){
            $res = "<u>".substr($sublinhado[$x],2,strlen($sublinhado[$x]) -4)."</u>";
            $string = str_replace($sublinhado[$x],$res,$string);
        }
    }

    echo $string;

IDEONE

ESCAPES..

In this script you can use backslashes to create underlined or bold text that is not standard. Imagine that user wants to underline this: __METHOD__ .For this, just apply so: __\_\_METHOD_\_\__

Using the stripslashes PHP you remove the backslash used leaving the text clean.

An example with the text:

Para gerar um construtor nas recentes **\*versões do php*\**, 
usa-se o __**\_\_construct()**__ 
**esta é a __forma correta__** para usar. 

No **php** Existe a possibilidade de usar a 
contante mágica __\_\_FUNCTION_\_\__ para pegar o nome da função. 

__Neste script__, se eu quiser usar um **\*escape*\** para a 
barra inversa dentro de um sublinhado ou negrito, basta 
multiplica-lo por 3. Assim:

            __\\\_teste_\\\__

The output using echo stripslashes($string) will be:

Para gerar um construtor nas recentes <b>*versões do php*</b>, usa-se o <u><b>__construct()</b></u> <b>esta é a <u>forma correta</u></b> para usar. 

No <b>php</b> Existe a possibilidade de usar a contante mágica <u>__FUNCTION__</u> para pegar o nome da função. 

<u>Neste script</u>, se eu quiser usar um <b>*escape*</b> para a barra inversa dentro de um sublinhado ou negrito, basta multiplica-lo por 3. Assim:

<u>\_teste_\</u>

  • Dear Andrei, first thank you for answering, second, I’m sorry but I need to be honest, this use of _{1} "it has no effect", it is the same as without it, except that it does not solve the problem quoted at the end of the question. I will wait for its edition. See you soon.

  • @Guilhermenascimento I thank you for your patience. you have rasão. These {1} has no effect. However, it works for the case __fo_bar__ see: https://regexr.com/3t45l

  • @Guilhermenascimento made the change and put a link. Take a look. Maybe I didn’t quite understand the problem.

  • 1

    Dear Andrei it even works, but it’s more or less, has the issue of your allowing things like __a_______ probably due to the end of the "b group", whose intention is to avoid, the delimiters must always be 2, or be followed by any character that is not _, the same goes for spaces, whose regex allows you to do this __foo __, but keep trying, I’m sure something good will come out :) Thanks already!

  • @Guilhermenascimento has rasão! Valeu a tentativa.... =/ ... I will try other things, anything I put here. Abraço!

  • @Guilhermenascimento I made changes... He does not allow __a_______ and does not allow __foo __ take a look

  • @Guilhermenascimento take a look when you can. I had done some testing with the old code and it wasn’t working perfectly. But now it’s very close to ideal.

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.