Complementing the answer from Andrei, follow some alternatives (considering the case where the whole string is in one line):
let str = '2019-03-07T02:48:17-03:00<br /><br />domain: 100anosdemusica.com.br<br />expires: 20190421<br /><br />country: BR';
let regex = /<br \/>([\w-]+):\s*([^<]+)/gi;
let match;
while (match = regex.exec(str)) { // match[1] é o nome do campo, match[2] é o valor
console.log(`${match[1]} = ${match[2]}`);
}
The regex is very similar:
- begins with
<br /> (and the bar must be escaped with \ not to be confused with the delimiter of the regex)
[\w-]+ is "one or more occurrences of \w (letters, numbers and the character _) or hyphen"
:\s*: two-point, followed by zero or more spaces
[^<]+: here is the difference. Instead of . (meaning "any character"), I used a character class denied. In case, everything that’s between [^ and ] is denied, that is, it means "one or more occurrences of any character other than <". This ensures that I get all the characters up to the next tag
If it was a more complex HTML I would put some more controls in regex - or would use an HTML parser same - but as I "know" that the string only has <br /> (I am actually assuming that it has the format indicated in the question, so there is no other tag), I can use [^<]+ hassle-free.
Using .+ does not work because the quantifier + is greedy and tries to grab as many characters as possible (see an example). And as the point corresponds to any character, it ends up picking up even other tags <br />. Already using [^<]+ I guarantee that the regex will stop when it finds some tag (see an example) - and how I am assuming that there are no other tags besides br, is enough to make it work.
In fact, the standard behavior of the dot is to take any character except for line breaks, that’s why we use .+ works when there are line breaks in the string (by the way, if using [^<], will also work when there are line breaks).
So the regex takes everything after one <br />, until I find the next.
Just one detail about [\w-]+. The \w is a shortcut to "letters of A to Z (uppercase or lowercase), numbers from 0 to 9 or the character _", then [\w-]+ accepts strings as ---_1, for example (see).
Since the string has a format that seems to be well defined, it may be that these strange situations never occur, but if you want to be more specific, you can use something like [a-z]+(?:-[a-z]+)* (letters, followed by zero or more "hyphen followed by letters"). This ensures that names such as domain and nic-hdl-br are accepted, but ---_1 nay (see).
And how regex uses the option i (look after the bar you have gi), it considers both uppercase and lowercase letters (but can remove the i if you only want lowercase, for example, and use [a-z0-9] if you also want numbers).
Then it would look like this:
let str = '2019-03-07T02:48:17-03:00<br /><br />domain: 100anosdemusica.com.br<br />expires: 20190421<br /><br />country: BR';
let regex = /<br \/>([a-z]+(?:-[a-z]+)*):\s*([^<]+)/gi;
let match;
while (match = regex.exec(str)) { // match[1] é o nome do campo, match[2] é o valor
console.log(`${match[1]} = ${match[2]}`);
}
Notice I used (?:, which makes the parentheses a catch group. If only I had (, they would form another group, interfering with existing groups (match[2] would become match[3], since now it would be 3 pairs of parentheses). Since this is a group I don’t want to capture, I use the catch-no syntax, and the existing groups remain intact.
split
Another alternative is to use split, for which I also provide an alternative:
let str = '2019-03-07T02:48:17-03:00<br /><br />domain: 100anosdemusica.com.br<br />expires: 20190421<br /><br />country: BR';
str.split(/(?:<br \/>)+/i).forEach(element => {
let v = element.split(/:\s*/); // v[0] é o nome do campo, v[1] é o valor
if (v.length == 2) console.log(`${v[0]} = ${v[1]}`);
});
The difference to the another answer is that I used (?:<br \/>)+: one or more occurrences of <br />. I had to use (?: to make parentheses a catch group. If I didn’t do this (and use simple parentheses, like (<br \/>)+), the catch groups would be included in the result (that is, the resulting array would have the <br /> inside it too). I also used the option i in case you have a tag <BR />, for example (if you’re sure you always are br tiny, you can even take the i).
The split causes the string to be separated by <br />. Then for each element I do another split, to separate by "two-points followed by zero or more spaces", to take the name and value of each field separately. I also test if the result of this split is an array of size 2, because when doing split date (2019-03-07T02:48:17-03:00), the result is an array of 4 elements (testing the size of the array I already discard most of these false positives).
Obviously I could also use a regex similar to the previous one, but without the <br />, to extract the name and value:
let str = '2019-03-07T02:48:17-03:00<br /><br />domain: 100anosdemusica.com.br<br />expires: 20190421<br /><br />country: BR';
str.split(/(?:<br \/>)+/i).forEach(element => {
let match = /([a-z]+(?:-[a-z]+)*):\s*(.+)/.exec(element);
if (match) console.log(`${match[1]} = ${match[2]}`);
});
In that case I used the stitch instead of [^<], since all occurrences of <br /> were eliminated by split (and I’m assuming that the string has no other tags).
Which language ? What result do you expect to get ? What will you do with this result ?
– Isac
I am using java script, I want to get the result of each line for example: person: Ana Flavia Miziara, and I will feed a database with these values.
– Leticia Fatima
If you make a
splitfor<br>you already get all the lines, but something tells me you’re only interested in a few, and in a specific format– Isac
is that however, I want to remove the <br /> and get the value that comes after. For example: in line <br />person: Ana Flavia Miziara, I only want the value person: Ana Flavia Miziara. I’m doing this through a spreadsheet. 1 cell contains the whois, I will use JS to make a script that reads the whois in the cell, uses regex to split and plays the value X in another cell.
– Leticia Fatima