Complementing the answer from Andrei, follow some alternatives (considering the case where the whole string is in one line):
let str = '2019-03-07T02:48:17-03:00<br /><br />domain: 100anosdemusica.com.br<br />expires: 20190421<br /><br />country: BR';
let regex = /<br \/>([\w-]+):\s*([^<]+)/gi;
let match;
while (match = regex.exec(str)) { // match[1] é o nome do campo, match[2] é o valor
console.log(`${match[1]} = ${match[2]}`);
}
The regex is very similar:
- begins with
<br />
(and the bar must be escaped with \
not to be confused with the delimiter of the regex)
[\w-]+
is "one or more occurrences of \w
(letters, numbers and the character _
) or hyphen"
:\s*
: two-point, followed by zero or more spaces
[^<]+
: here is the difference. Instead of .
(meaning "any character"), I used a character class denied. In case, everything that’s between [^
and ]
is denied, that is, it means "one or more occurrences of any character other than <
". This ensures that I get all the characters up to the next tag
If it was a more complex HTML I would put some more controls in regex - or would use an HTML parser same - but as I "know" that the string only has <br />
(I am actually assuming that it has the format indicated in the question, so there is no other tag), I can use [^<]+
hassle-free.
Using .+
does not work because the quantifier +
is greedy and tries to grab as many characters as possible (see an example). And as the point corresponds to any character, it ends up picking up even other tags <br />
. Already using [^<]+
I guarantee that the regex will stop when it finds some tag (see an example) - and how I am assuming that there are no other tags besides br
, is enough to make it work.
In fact, the standard behavior of the dot is to take any character except for line breaks, that’s why we use .+
works when there are line breaks in the string (by the way, if using [^<]
, will also work when there are line breaks).
So the regex takes everything after one <br />
, until I find the next.
Just one detail about [\w-]+
. The \w
is a shortcut to "letters of A
to Z
(uppercase or lowercase), numbers from 0 to 9 or the character _
", then [\w-]+
accepts strings as ---_1
, for example (see).
Since the string has a format that seems to be well defined, it may be that these strange situations never occur, but if you want to be more specific, you can use something like [a-z]+(?:-[a-z]+)*
(letters, followed by zero or more "hyphen followed by letters"). This ensures that names such as domain
and nic-hdl-br
are accepted, but ---_1
nay (see).
And how regex uses the option i
(look after the bar you have gi
), it considers both uppercase and lowercase letters (but can remove the i
if you only want lowercase, for example, and use [a-z0-9]
if you also want numbers).
Then it would look like this:
let str = '2019-03-07T02:48:17-03:00<br /><br />domain: 100anosdemusica.com.br<br />expires: 20190421<br /><br />country: BR';
let regex = /<br \/>([a-z]+(?:-[a-z]+)*):\s*([^<]+)/gi;
let match;
while (match = regex.exec(str)) { // match[1] é o nome do campo, match[2] é o valor
console.log(`${match[1]} = ${match[2]}`);
}
Notice I used (?:
, which makes the parentheses a catch group. If only I had (
, they would form another group, interfering with existing groups (match[2]
would become match[3]
, since now it would be 3 pairs of parentheses). Since this is a group I don’t want to capture, I use the catch-no syntax, and the existing groups remain intact.
split
Another alternative is to use split
, for which I also provide an alternative:
let str = '2019-03-07T02:48:17-03:00<br /><br />domain: 100anosdemusica.com.br<br />expires: 20190421<br /><br />country: BR';
str.split(/(?:<br \/>)+/i).forEach(element => {
let v = element.split(/:\s*/); // v[0] é o nome do campo, v[1] é o valor
if (v.length == 2) console.log(`${v[0]} = ${v[1]}`);
});
The difference to the another answer is that I used (?:<br \/>)+
: one or more occurrences of <br />
. I had to use (?:
to make parentheses a catch group. If I didn’t do this (and use simple parentheses, like (<br \/>)+
), the catch groups would be included in the result (that is, the resulting array would have the <br />
inside it too). I also used the option i
in case you have a tag <BR />
, for example (if you’re sure you always are br
tiny, you can even take the i
).
The split
causes the string to be separated by <br />
. Then for each element I do another split
, to separate by "two-points followed by zero or more spaces", to take the name and value of each field separately. I also test if the result of this split
is an array of size 2, because when doing split
date (2019-03-07T02:48:17-03:00
), the result is an array of 4 elements (testing the size of the array I already discard most of these false positives).
Obviously I could also use a regex similar to the previous one, but without the <br />
, to extract the name and value:
let str = '2019-03-07T02:48:17-03:00<br /><br />domain: 100anosdemusica.com.br<br />expires: 20190421<br /><br />country: BR';
str.split(/(?:<br \/>)+/i).forEach(element => {
let match = /([a-z]+(?:-[a-z]+)*):\s*(.+)/.exec(element);
if (match) console.log(`${match[1]} = ${match[2]}`);
});
In that case I used the stitch instead of [^<]
, since all occurrences of <br />
were eliminated by split
(and I’m assuming that the string has no other tags).
Which language ? What result do you expect to get ? What will you do with this result ?
– Isac
I am using java script, I want to get the result of each line for example: person: Ana Flavia Miziara, and I will feed a database with these values.
– Leticia Fatima
If you make a
split
for<br>
you already get all the lines, but something tells me you’re only interested in a few, and in a specific format– Isac
is that however, I want to remove the <br /> and get the value that comes after. For example: in line <br />person: Ana Flavia Miziara, I only want the value person: Ana Flavia Miziara. I’m doing this through a spreadsheet. 1 cell contains the whois, I will use JS to make a script that reads the whois in the cell, uses regex to split and plays the value X in another cell.
– Leticia Fatima