Regex extract substring from text delimited by an end pattern that repeats in the text (Starttextofim Starttextofim)

Asked

Viewed 68 times

1

I aim to perform the extraction of the substring delimited by ENTER and Tempo de viagem total:4h 05m in the text below:

For this I built the following regular expression: ENTER[\S\s]+Tempo de viagem total:.*h .*m (general to work in any text)

However when extracting in the text below :

ENTERAZUL Linhas Aereas Brasileiras - 4884
ENTERAerospatiale/Alenia ATR 72
ENTERFev
ENTER1
ENTERBauru, SP, BR
ENTERSao Paulo, SP, BR
ENTER19:50JTC
ENTER20:50VCP
ENTERMoussa Nakhl Tobias Airport
ENTERViracopos International Airport
ENTER1h 00m
ENTEReconômica
ENTERExecutiva
ENTER1h 25m escala · Sao Paulo, SP, BR
ENTERAZUL Linhas Aereas Brasileiras - 4663
ENTERUnknown Aircraft
ENTERFev
ENTER1
ENTERSao Paulo, SP, BR
ENTERBrasilia, DF, BR
ENTER22:15VCP
ENTER23:55BSB
ENTERViracopos International Airport
ENTERBrasilia International Airport
ENTER1h 40m
ENTEReconômica
ENTERExecutiva
ENTEREmissões de CO2:econômica/Econômica "Premium": 154kg
ENTERExecutiva: 195kg
ENTERTempo de viagem total:4h 05m
ENTERAZUL Linhas Aereas Brasileiras - 4399
ENTERUnknown Aircraft
ENTERFev
ENTER5
ENTERBrasilia, DF, BR
ENTERSao Paulo, SP, BR
ENTER05:25BSB
ENTER07:05VCP
ENTERBrasilia International Airport
ENTERViracopos International Airport
ENTER1h 40m
ENTEReconômica
ENTERExecutiva
ENTER2h 25m escala · Sao Paulo, SP, BR
ENTERAZUL Linhas Aereas Brasileiras - 4530
ENTERAerospatiale/Alenia ATR 72
ENTERFev
ENTER5
ENTERSao Paulo, SP, BR
ENTERBauru, SP, BR
ENTER09:30VCP
ENTER10:35JTC
ENTERViracopos International Airport
ENTERMoussa Nakhl Tobias Airport
ENTER1h 05m
ENTEReconômica
ENTERExecutiva
ENTEREmissões de CO2:econômica/Econômica "Premium": 154kg
ENTERExecutiva: 195kg
ENTERTempo de viagem total:5h 10m

The substring end delimiter match is done with Tempo de viagem total:5h 10m instead of Tempo de viagem total:4h 05m resulting in the separation of the text, as if the search for the final delimiter was being carried out from the end of the text to the beginning.

Is there any way to perform this type of text search by searching for the first occurrence of the final delimiter ? (in this example first occurrence of Tempo de viagem total:.*h .*m )

I’m using the site https://regexr.com/ to test

2 answers

2


First, an explanation about [\S\s]: this is a "trick" known to pick up "any character". Usually we use the dot for "any character", but by default, the point does not consider line breaks, then [\S\s] turns out to be an alternative, because he takes \s (one shortcut for spaces, tabs, line breaks, among other characters) and \S (everything that is not \s). That is to say, [\S\s] takes any character, including line breaks.

Already the quantifier + means "one or more occurrences", but by default, quantifiers are "greedy" and try to get as many characters as possible. And what is the largest possible number of characters that soon after has the Tempo de viagem total etc...? The whole string, so he picks up the 5h 10m.

One way to solve is to make the quantifier + "lazy", putting a ? in front: [\S\s]+?. So it takes as few characters as possible, in this case, going to the 4h 05m.

Another detail is that you are using .*h .*m, but .* means "zero or more characters", which means he would also take things like h m or abch xyzm. If you want to restrict to just numbers, you can use \d+h \d+m (d+ means "one or more digits"), or if you want to control the quantity, you can use \d{1,2}h \d{1,2}m (in the case, \d{1,2} means "at least 1, at most 2 digits"). So you guarantee that you must have at least one digit.

It might even "work" with .*, but often you do not want "anything", but something more specific (in this case, it seems to me that only digits would be valid, so it is not "anything"), and in regex it’s better to be specific about what you want and also what you don’t want.

In short, the regex would be ENTER[\S\s]+?Tempo de viagem total:\d+h \d+m.


In the case of sites like Regexr, you could also use the dot instead of [\S\s], and enable the single line flag:

RegExr com flag singleline habilitada

Thus, the point also considers line breaks (see here), replacing the [\S\s]. In this case, the expression would be ENTER.+?Tempo de viagem total:\d+h \d+m.


To better understand greedy ("Greedy") and lazy ("Lazy") quantifiers, read here, here and here.


Remember that this problem would be - in my opinion - easier to solve without regex, using some programming language. So you could read line by line, see if it is the beginning of a record or the end (if it contains "Total travel time", for example), and go concatenating the lines into a single record. Regex is not always the best solution.

1

In this case, only the question mark was inserted after the plus sign:

ENTER[\S\s]+?Tempo de viagem total:.*h .*m

This makes the search engine evaluate its expression in the Lazy mode, instead of Greedy mode.

So, if you want to get all the Matches (2 in the case of shared text), use the global flag in your expression. If you only want the first match, don’t use the global flag.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.