Split into whois using regular expression

Asked

Viewed 148 times

3

I have a replay with whois.

I’d like to perform a split taking the values after the <br /> up to the end of the whois line.

For example: on the line <br />person: Ana Flavia Miziara, I only want the value person: Ana Flavia Miziara. I’m doing this through a spreadsheet. 1 cell contains the whois, I will use JS to make a script that reads the whois in the cell, use regex to do the split and play the X value in another cell.

I’ll put an example below the whois:

{"whois":
"<br />% Copyright (c) Nic.br<br />%  The use of the data below is only permitted as described in
<br />%  full by the terms of use at https://registro.br/termo/en.html ,
<br />%  being prohibited its distribution, commercialization or
<br />%  reproduction, in particular, to use it for advertising or
<br />%  any similar purpose.
<br />%  2019-03-07T02:48:17-03:00
<br />
<br />domain:      100anosdemusica.com.br
<br />owner:       BMGV Music Software Net Editora Ltda.
<br />ownerid:     66.587.684/0001-88
<br />responsible: Ana Flavia Miziara
<br />country:     BR
<br />owner-c:     AFM3
<br />admin-c:     AFM3
<br />tech-c:      AFM3
<br />billing-c:   AFM3
<br />nserver:     ns1.locaweb.com.br
<br />nsstat:      20190304 TIMEOUT
<br />nslastaa:    20190301
<br />nserver:     ns2.locaweb.com.br
<br />nsstat:      20190304 AA
<br />nslastaa:    20190304
<br />created:     20000421 #322333
<br />changed:     20180330
<br />expires:     20190421
<br />status:      published
<br />
<br />nic-hdl-br:  AFM3
<br />person:      Ana Flavia Miziara
<br />e-mail:      [email protected]
<br />country:     BR
<br />created:     19980128
<br />changed:     20031218
<br />
<br />% Security and mail abuse issues should also be addressed to
<br />% cert.br, http://www.cert.br/ , respectivelly to [email protected]
<br />% and [email protected]
<br />%
<br />% whois.registro.br accepts only direct match queries. Types
<br />% of queries are: domain (.br), registrant (tax ID), ticket,
<br />% provider, contact handle (ID), CIDR block, IP and ASN.
<br />","error":false}
  • Which language ? What result do you expect to get ? What will you do with this result ?

  • I am using java script, I want to get the result of each line for example: person: Ana Flavia Miziara, and I will feed a database with these values.

  • 2

    If you make a split for <br> you already get all the lines, but something tells me you’re only interested in a few, and in a specific format

  • is that however, I want to remove the <br /> and get the value that comes after. For example: in line <br />person: Ana Flavia Miziara, I only want the value person: Ana Flavia Miziara. I’m doing this through a spreadsheet. 1 cell contains the whois, I will use JS to make a script that reads the whois in the cell, uses regex to split and plays the value X in another cell.

2 answers

4

If it’s only the values that have the : (two points) as:

person: Ana Flavia Miziara

You can catch them using this REGEX:

<br \/>([\w-]+:\s*.+)

What does this REGEX mean:

Beginning:

  • <br \/> Identifies and starts with this tag

Group (the information you want will be here):

  • [\w-]+ then take any letter or number with or without hyphen
  • : then take the 2 points once only
  • \s* there may be spaces or not
  • .+ take everything that comes after in a group

And then, you can use it as follows:

var str = '{"whois":\n'+
'"<br />% Copyright (c) Nic.br<br />%  The use of the data below is only permitted as described in\n'+
'<br />%  full by the terms of use at https://registro.br/termo/en.html ,\n' +
'<br />%  being prohibited its distribution, commercialization or\n'+
'<br />%  reproduction, in particular, to use it for advertising or\n'+
'<br />%  any similar purpose.\n' +
'<br />%  2019-03-07T02:48:17-03:00\n' +
'<br />\n'+
'<br />domain:      100anosdemusica.com.br\n'+
'<br />owner:       BMGV Music Software Net Editora Ltda.\n'+
'<br />responsible: Ana Flavia Miziara\n'+
'<br />country:     BR\n'+
'<br />owner-c:     AFM3\n';

var regex = /<br \/>([\w-]+:\s*.+)/gi;
match = regex.exec(str);
while (match != null) {
  console.log(match[1]);
  match = regex.exec(str);
}

If there is more information you need. The split <br /> is the best way, as it was said by @Isac.

As suggested by @hkotsubo:

If you need separate values, such as:

"person" - "Ana Flavia Miziara"

This REGEX...

<br \/>([\w-]+):\s*(.+)

... will automatically separate values. And you can use it as follows:

var regex = /<br \/>([\w-]+):\s*(.+)/gi;
match = regex.exec(str);
while (match != null) {
  console.log(match[1]); // valor "person" 
  console.log(match[2]); // valor "Ana Flavia Miziara" 
  match = regex.exec(str);
}

Editing

The previous regex do not work because, unlike the question, the string does not skip lines. It is a line only. Which makes it harder to create a regex for it as it picks up the entire line after finding the first occurrence of <br />.

The solution goes back to what @Isac commented is to use the split to break the lines and recover the values:

var str = '2019-03-07T02:48:17-03:00<br /><br />domain:      100anosdemusica.com.br<br />expires:     20190421<br /><br />country:     BR';
var strArray = str.split("<br /><br />");
var elementsArray1 = strArray[1].split("<br />");
var elementsArray2 = strArray[2].split("<br />");
elementsArray1.forEach(function(element, index, array){
	console.log(element);
});
elementsArray2.forEach(function(element, index, array){
	console.log(element);
});

But if still yes, you prefer the regex, I believe this meets you:

<br \/>([\w-]+:\s*[\/\w\.\-\#\s]+) 

Thus remaining the code:

var regex = /<br \/>([\w-]+:\s*[\/\w\.\-\#\s]+)/gi;
var match = regex.exec(str);
while(match != null){
    console.log(match[1]);
    match = regex.exec(str);
}

Obs: The problem with this regex is that if there are any characters unexpected, missing information on line.

  • {1} is not necessary, (qualquer-coisa){1} is the same as (qualquer-coisa) - see how that is the same as that. One suggestion is to switch to <br \/>([\w-]+):\s*(.+), so group 1 (match[1]) will be the name of the field (Domain, Owner, etc.) and the group 2 (match[2]) will be the respective value, see

  • @hkotsubo thank you very much! I will edit and insert as you suggested! In the question of groups, from what I understand... She wants it all together! But I’ll put that version aside in case she needs it.

  • I understood that it was to catch separately :-)

  • @hkotsubo I may have misunderstood even... =) ... But thank you anyway!

  • @hkotsubo I knew you would come! XD luck I got before! Thanks for the +1

  • 1

    Haha, I was late :-) But now rereading the question, I was in doubt if it was to be together or separated... Anyway, I still think it’s better to separate, because then it eliminates the spaces after the :

  • @hkotsubo agree! But here is an option more!

  • Hello Andrei, thank you for your reply! However, when I try to reproduce it, the regex ignores only the first <br />

  • @Leticiafatima made an edition. In the variable str put the whole string of whois

  • 1

    @Leticiafatima did another issue.

  • I did not resist and ended up putting an answer :-)

Show 6 more comments

4


Complementing the answer from Andrei, follow some alternatives (considering the case where the whole string is in one line):

let str = '2019-03-07T02:48:17-03:00<br /><br />domain:      100anosdemusica.com.br<br />expires:     20190421<br /><br />country:     BR';

let regex = /<br \/>([\w-]+):\s*([^<]+)/gi;
let match;
while (match = regex.exec(str)) { // match[1] é o nome do campo, match[2] é o valor
  console.log(`${match[1]} = ${match[2]}`);
}

The regex is very similar:

  • begins with <br /> (and the bar must be escaped with \ not to be confused with the delimiter of the regex)
  • [\w-]+ is "one or more occurrences of \w (letters, numbers and the character _) or hyphen"
  • :\s*: two-point, followed by zero or more spaces
  • [^<]+: here is the difference. Instead of . (meaning "any character"), I used a character class denied. In case, everything that’s between [^ and ] is denied, that is, it means "one or more occurrences of any character other than <". This ensures that I get all the characters up to the next tag

If it was a more complex HTML I would put some more controls in regex - or would use an HTML parser same - but as I "know" that the string only has <br /> (I am actually assuming that it has the format indicated in the question, so there is no other tag), I can use [^<]+ hassle-free.

Using .+ does not work because the quantifier + is greedy and tries to grab as many characters as possible (see an example). And as the point corresponds to any character, it ends up picking up even other tags <br />. Already using [^<]+ I guarantee that the regex will stop when it finds some tag (see an example) - and how I am assuming that there are no other tags besides br, is enough to make it work.

In fact, the standard behavior of the dot is to take any character except for line breaks, that’s why we use .+ works when there are line breaks in the string (by the way, if using [^<], will also work when there are line breaks).

So the regex takes everything after one <br />, until I find the next.

Just one detail about [\w-]+. The \w is a shortcut to "letters of A to Z (uppercase or lowercase), numbers from 0 to 9 or the character _", then [\w-]+ accepts strings as ---_1, for example (see).

Since the string has a format that seems to be well defined, it may be that these strange situations never occur, but if you want to be more specific, you can use something like [a-z]+(?:-[a-z]+)* (letters, followed by zero or more "hyphen followed by letters"). This ensures that names such as domain and nic-hdl-br are accepted, but ---_1 nay (see).

And how regex uses the option i (look after the bar you have gi), it considers both uppercase and lowercase letters (but can remove the i if you only want lowercase, for example, and use [a-z0-9] if you also want numbers).

Then it would look like this:

let str = '2019-03-07T02:48:17-03:00<br /><br />domain:      100anosdemusica.com.br<br />expires:     20190421<br /><br />country:     BR';

let regex = /<br \/>([a-z]+(?:-[a-z]+)*):\s*([^<]+)/gi;
let match;
while (match = regex.exec(str)) { // match[1] é o nome do campo, match[2] é o valor
  console.log(`${match[1]} = ${match[2]}`);
}

Notice I used (?:, which makes the parentheses a catch group. If only I had (, they would form another group, interfering with existing groups (match[2] would become match[3], since now it would be 3 pairs of parentheses). Since this is a group I don’t want to capture, I use the catch-no syntax, and the existing groups remain intact.


split

Another alternative is to use split, for which I also provide an alternative:

let str = '2019-03-07T02:48:17-03:00<br /><br />domain:      100anosdemusica.com.br<br />expires:     20190421<br /><br />country:     BR';
str.split(/(?:<br \/>)+/i).forEach(element => {
    let v = element.split(/:\s*/); // v[0] é o nome do campo, v[1] é o valor
    if (v.length == 2) console.log(`${v[0]} = ${v[1]}`);
});

The difference to the another answer is that I used (?:<br \/>)+: one or more occurrences of <br />. I had to use (?: to make parentheses a catch group. If I didn’t do this (and use simple parentheses, like (<br \/>)+), the catch groups would be included in the result (that is, the resulting array would have the <br /> inside it too). I also used the option i in case you have a tag <BR />, for example (if you’re sure you always are br tiny, you can even take the i).

The split causes the string to be separated by <br />. Then for each element I do another split, to separate by "two-points followed by zero or more spaces", to take the name and value of each field separately. I also test if the result of this split is an array of size 2, because when doing split date (2019-03-07T02:48:17-03:00), the result is an array of 4 elements (testing the size of the array I already discard most of these false positives).

Obviously I could also use a regex similar to the previous one, but without the <br />, to extract the name and value:

let str = '2019-03-07T02:48:17-03:00<br /><br />domain:      100anosdemusica.com.br<br />expires:     20190421<br /><br />country:     BR';
str.split(/(?:<br \/>)+/i).forEach(element => {
    let match = /([a-z]+(?:-[a-z]+)*):\s*(.+)/.exec(element);
    if (match) console.log(`${match[1]} = ${match[2]}`);
});

In that case I used the stitch instead of [^<], since all occurrences of <br /> were eliminated by split (and I’m assuming that the string has no other tags).

  • 1

    I knew you wouldn’t resist.... XD Great answer!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.