Retrieve content between custom tags using Regex

Asked

Viewed 3,727 times

0

I need to capture content that is between a custom tag that has a default identifier, for example: <:item>Conteúdo</item>, but I’m not getting the closure of this tag to be customizable, and in this case I’m only getting it this way: <:item>Conteúdo</end>, maintaining a standard closure for all tags of the same content.

Current regex:

preg_match_all("~<:(.*?)>(.*?)</end>~si", $conteudo, $retorno);

What would be the regular expression to find the opening tag and its relative closing tag? even if there is an inheritance of parents and children with the same tag name.

  • You intend to use the tag <:item>, or only the content?

  • if you are not required to use Regex, why not use Simplexml?

  • How do you mount this string? It has no pattern, neither XML, nor HTML...

2 answers

1


1) If you intend to use ONLY the content of the tag, you can use the ER below, in case it removes everything between <>:

$conteudo = '<:item>Conteúdo</item>';
print_r( preg_replace("/<.*?>/", "", $conteudo) );

Example available on ideone


2) If you intend to use the tag itself and the content, you can use the ER below:

$conteudo = '<:item>Conteúdo</item>';
preg_match_all( '~<.+?>(.+?)<\/.+?>~' , $conteudo , $retorno );
echo $retorno[1][0];

Example available on ideone


Updating

Step 1) replace <...> by a marker |
Upshot: |HEADER||MAIN|ITEM||

Step 3) Remove Double Markers || by simple |
Upshot: |HEADER|MAIN|ITEM|

Step 4) Break the string into the markers and filter the null values
Upshot: array( 1 => 'HEADER' , 2 => 'MAIN' , 3 => 'ITEM' )

$string = '<:header>HEADER</header><:main>MAIN<:item>ITEM</item></main>';

// passo 1
$string = preg_replace( '/<.*?>/' , '|' , $string );

// passo 2
$string = preg_replace('/\|+/', '|', $string);

// passo 3
$string = array_filter( explode( '|' , $string ) );

Note that this NAY is ideal, just solves a problem. The way you generate this string is inappropriate. See a demo on ideone

  • The second example is pretty much what I need, but it doesn’t work for child tags, for example: $conteudo = '<:header>HEADER</header> <:main>MAIN <:item>ITEM</item></main>';&#xA;preg_match_all( '~<.+?>(.+?)<\/.+?>~' , $conteudo , $retorno );&#xA;var_dump($retorno); Return: array(2) { [0]=> array(2) { [0]=> string(24) "<:header>HEADER" [1]=> string(30) "<:main>MAIN <:item>ITEM" } [1]=> array(2) { [0]=> string(6) "HEADER" [1]=> string(16) "MAIN <:item>ITEM" } }

  • 1

    Add this information to the question!

0

I don’t know if this is the way you want it, but it works:

$html = "<:item>ConteúdoA</item><:valor>ConteúdoB</valor><:tag>ConteúdoC</tag><:teste>ConteúdoD</teste>";
preg_match_all("/<:(.*?)>/", $html, $arrTag);

foreach($arrTag[1] as $tag)
{
echo $tag;
preg_match_all('/<:'.$tag.'>(.+?)<\/'.$tag.'>/sm', $html, $conteudo);
print_r($conteudo);
}
  • The problem is that tags cannot be set directly, in case only the default is the tag tag <:TAG_AQUI></TAG_AQUI>, then you need to find the tag closing from the second tag.

  • Hi, Motonio, this is what happens with answers that only have code: http://i.stack.Imgur.com/Wnj5w.png . It is a nice explanation, albeit brief, why your code solves the problem. See [].

  • I did it differently, I hope it’s the way you want it

  • If you set the TAG in the loop rule, you don’t need it <:'.$tag.'.*?>.

  • @Papacharlie I did this way in case the author needs to know which tag he’s getting

Browser other questions tagged

You are not signed in. Login or sign up in order to post.