0
I am trying to create an interpreter (parser) of robots.txt
with Regex but I can’t make the expression right. I did several tests in Regex101 and still did not achieve an expected result.
My regular expression:
/user-agent: (bot|\*)\n*((disallow:\s*(?<disallow>.*)|allow:\s*(?<allow>.*)|sitemap:\s*(?<sitemap>.*))\n*)+/gi
My variable of tests:
User-agent: *
Disallow: /exemplo/
Allow: /dolor/
Disallow: /sit/
Allow: /amet/
Sitemap: http://www.loremipsum.com/sitemap.xml
In the image you can see the result that Regex101 returns and the one that I wanted to return.
Can you explain what exactly you want to do with regex? It’s easier than identifying the colors of your example.
– Molx
I want to put the values of
disallow
,allow
andsitemap
within a namesake array. For example,/amet/
would be inside the arrayallow
.– hsbpedro
You might want to rethink how you are going to use this regex. I think it is not possible for a group with multiple results, or multiple groups with the same name. An easier alternative is to do it in stages. For example, you can only take the
Disallow
using(?<=disallow:)\s?(.*)
and do the same forAllow
and other elements of the robots.– Molx
@Only there’s a problem: the regular expression will catch all the
allow
anddisallow
file. I wanted you to only get those rules that were inside theuser-agent
right (how*
orbot
).– hsbpedro
It can be in Perl?
– JJoao
No need. I got it. Thank you very much!
– hsbpedro