What is the "line break" in a Regex?

Asked

Viewed 4,338 times

18

The language I use is R.

And, as the theory of Regular Expressions suggests, each language deals differently with line breaks (\n).

Consider the following string:

text_1 <- c('Olá, meu nome é Stack OverFlow. Sou um site de programação para
        entusiastas e profissionais')

text_1
[1] "Olá, meu nome é Stack OverFlow. Sou um site de programação para\n            entusiastas e profissionais"

Now, take into account the point metacharacter (.). According to the definition, the point gives match with any character, including points with exception of line breaks.

But, the following function takes all the text, including what comes after the break (\n):

library(stringr)

str_extract_all(text_1, 
                regex(pattern = 'f', 
                ignore_case = TRUE, 
                multiline = FALSE))

[[1]]
[1] "F" "f"

I don’t understand the concept of "line breaking". From what I read, after the break (\n) the point (.) would take nothing else. But, even with the argument multiline as FALSE and he took the f after the break (\n) of the word proftrade.

I ask you:

  • What comes to be line breaking (\n)?
  • How it works in the R language?
  • 2

    The multiline means that ^ and $ shall be started and end consecutively, of each line. Without multiline, they will be the beginning and end, consecutively, of the entire string (text). Regardless of whether you have line breaks or not. I have no idea how line breaks work in the [tag:r] Regex. Let’s wait for @hkotsubo to clarify this (you have to invoke it). Kkkkk

3 answers

20


First let’s see what it says to job documentation str_extract_all:

Extract All Pieces Of A String That Match A Pattern.

Extracts all parts of a string that match a pattern

Well, the regex you used defaults to the letter f:

regex(pattern = 'f', ...

An important point to note is that this regex does not have the dot metacharacter. She only has the words f, which means that str_extract_all will return all letters f of the string. And as the option ignore_case is enabled, it returns both uppercase and lowercase letters. That’s why your code returns F and f.


To see the "in action" point, you could use something like:

str_extract_all(text_1, 
                regex(pattern = 'f.*', 
                ignore_case = TRUE, 
                multiline = FALSE))

See this example running on Ideone.com. Now the regex is f.* (the letter f followed by zero or more characters). The return is:

[1] "Flow. Sou um site de programação para"
[2] "fissionais" 

As the option ignore_case is still activated, the regex considers both f how much F. And the .* take zero or more characters (any character other than a line break).

The first occurrence begins in F and goes to the line break (just after the word "for"). And the second occurrence begins in the f and goes to the end of the string (since there are no more line breaks).


Note that in both cases, the function str_extract_all traverses the entire string looking for a chunk that corresponds to regex.

The first regex is only the letter f, then regex only search for letters f or F (since ignore_case is activated). When going through the string, whether it has line breaks or not, I just want to know if it has any f.

In the second regex we have f.*, then she goes through the string looking for some letter f plus .* (zero or more occurrences of any character). Only the dot does not consider line breaks, so the regex only takes the letter f (or F) until the next line break. After finding an occurrence, regex continues to traverse the string to see if there is any other chunk that has f followed by zero or more characters (and it doesn’t matter if in the middle of the way she finds some line break, what matters is to find some f and then pick up the characters that match .*).


If you want, can use the option dotall, which causes the point to match line breaks:

str_extract_all(text_1, 
                regex(pattern = 'f.*', 
                ignore_case = TRUE, 
                dotall = TRUE))

See this example running on Ideone.com. The return is:

[1] "Flow. Sou um site de programação para\n        entusiastas e profissionais"

For now the point considers line breaks. That means the regex f.* gets the first f (and how ignore_case is activated, the first to be found is the F), and then takes all the characters (including line breaks) to the end of the string.

Just remembering that the quantifier * is greedy and try to pick up as many characters as possible. As now the point corresponds to any character, including line breaks, it ends up going to the end of the string and picking up everything.


Note that the option multiline makes no difference in this case, as already explained in reply from Mark.

And answering the question of the title, the line break is a character like any other. What changes to a regex is what it means according to certain settings: ignore_case changes the meaning of f (as it also considers F) and dotall changes the meaning of the point (starts to consider line breaks).

9

even with the multiline argument as FALSE and it caught the f after the break ( n) of the word professional.

The argument multiline simply changes the behavior of ^ and $ in a regular expression so that the pattern specified between them starts at the beginning and ends at the end of each line. Usually, they expect the pattern from the beginning to the end of the whole string.

So in your example the function will still search for pouch of the pattern f throughout the string, in the same way.

Ex. 1: With multiline = FALSE, if we tried to capture the pattern ^.*$, for example, we wouldn’t have any match . That’s because we have a character \n between the beginning and the end of the string, which does not match the pattern .* from start to end of string.

Ex.2: With multiline = TRUE, now ^.*$ pattern-finding .* starting at the beginning and ending at the end of each line (notice the difference to "look for this pattern between the beginning and end of each line"). In that case, we’ll have two pouch; Olá, meu nome é Stack OverFlow. Sou um site de programação para and entusiastas e profissionais.

What comes to be line breaking (\n)? How it works in language r?

It is a special character representing the ending a line.

Using your context as an example of use, with multiline = TRUE, the regular expression "knows" that has reached the end of a line by "bumping" with the character \n.

The functioning of the character \n does not vary from language to programming language, or between implementations of evaluators of regular expressions. It is a character, as well as a or b. Inclusive, your code is 0d10 in the ASCII table.

Furthermore, I also recommend a cool site for you to try and learn more about regex, the Regexr.

-3

I don’t know much about Regexp, but I think I could try something like this:

"(.|\s)*" or \n

Browser other questions tagged

You are not signed in. Login or sign up in order to post.