First let’s see what it says to job documentation str_extract_all
:
Extract All Pieces Of A String That Match A Pattern.
Extracts all parts of a string that match a pattern
Well, the regex you used defaults to the letter f
:
regex(pattern = 'f', ...
An important point to note is that this regex does not have the dot metacharacter. She only has the words f
, which means that str_extract_all
will return all letters f
of the string. And as the option ignore_case
is enabled, it returns both uppercase and lowercase letters. That’s why your code returns F
and f
.
To see the "in action" point, you could use something like:
str_extract_all(text_1,
regex(pattern = 'f.*',
ignore_case = TRUE,
multiline = FALSE))
See this example running on Ideone.com.
Now the regex is f.*
(the letter f
followed by zero or more characters). The return is:
[1] "Flow. Sou um site de programação para"
[2] "fissionais"
As the option ignore_case
is still activated, the regex considers both f
how much F
. And the .*
take zero or more characters (any character other than a line break).
The first occurrence begins in F
and goes to the line break (just after the word "for"). And the second occurrence begins in the f
and goes to the end of the string (since there are no more line breaks).
Note that in both cases, the function str_extract_all
traverses the entire string looking for a chunk that corresponds to regex.
The first regex is only the letter f
, then regex only search for letters f
or F
(since ignore_case
is activated). When going through the string, whether it has line breaks or not, I just want to know if it has any f
.
In the second regex we have f.*
, then she goes through the string looking for some letter f
plus .*
(zero or more occurrences of any character). Only the dot does not consider line breaks, so the regex only takes the letter f
(or F
) until the next line break. After finding an occurrence, regex continues to traverse the string to see if there is any other chunk that has f
followed by zero or more characters (and it doesn’t matter if in the middle of the way she finds some line break, what matters is to find some f
and then pick up the characters that match .*
).
If you want, can use the option dotall
, which causes the point to match line breaks:
str_extract_all(text_1,
regex(pattern = 'f.*',
ignore_case = TRUE,
dotall = TRUE))
See this example running on Ideone.com. The return is:
[1] "Flow. Sou um site de programação para\n entusiastas e profissionais"
For now the point considers line breaks. That means the regex f.*
gets the first f
(and how ignore_case
is activated, the first to be found is the F
), and then takes all the characters (including line breaks) to the end of the string.
Just remembering that the quantifier *
is greedy and try to pick up as many characters as possible. As now the point corresponds to any character, including line breaks, it ends up going to the end of the string and picking up everything.
Note that the option multiline
makes no difference in this case, as already explained in reply from Mark.
And answering the question of the title, the line break is a character like any other. What changes to a regex is what it means according to certain settings: ignore_case
changes the meaning of f
(as it also considers F
) and dotall
changes the meaning of the point (starts to consider line breaks).
The multiline means that
^
and$
shall be started and end consecutively, of each line. Without multiline, they will be the beginning and end, consecutively, of the entire string (text). Regardless of whether you have line breaks or not. I have no idea how line breaks work in the [tag:r] Regex. Let’s wait for @hkotsubo to clarify this (you have to invoke it). Kkkkk– LipESprY