Help to parse data and turn it into a dictionary

Asked

Viewed 78 times

0

Guys I have the following problem and I would like help to turn a log file into a key and value dictionary for later use more I’m stuck on the code.

I have the following data and would like help to make them a key/value dictionary.

default: T (add header): [10.78/15.00] [SURBL_VERYBAD(5.00){test.abuse.dnsbl;},HTML_SHORT_LINK_IMG_1(2.00){},IP_SCORE(1.99){ip: (1.56), ipnet: 123.245.3.34/19(2.57), asn: 12876(2.45), country: FR(0.06);},HAS_INTERSPIRE_SIG(1.00){},MID_RHS_WWW(0.50){},MIME_HTML_ONLY(0.20){},BAD_REP_POLICIES(0.10){},HAS_LIST_UNSUB(-0.01){},ARC_NA(0.00){},ASN(0.00){asn:12876, ipnet:123.245.3.34/19, ipnet:123.245.3.34/19, country:FR;},DKIM_TRACE(0.00){test.com:+;},DMARC_POLICY_ALLOW(0.00){teste.com;none;},FROM_EQ_ENVFROM(0.00){},FROM_HAS_DN(0.00){},HAS_REPLYTO(0.00){[email protected];},MIME_TRACE(0.00){0:~;},PREVIOUSLY_DELIVERED(0.00){[email protected];},RCPT_COUNT_ONE(0.00){1;},RCVD_COUNT_TWO(0.00){2;},RCVD_TLS_LAST(0.00){},REPLYTO_ADDR_EQ_FROM(0.00){},R_DKIM_ALLOW(0.00){test.com:s=dkim;},R_SPF_ALLOW(0.00){+ptr;},TO_DN_NONE(0.00){},TO_MATCH_ENVRCPT_ALL(0.00){}])

And I’d like you to have the following exit:

{"default: T (add header)": "10.78/15.00",
"SURBL_VERYBAD": "5.00",
"HTML_SHORT_LINK_IMG_1": "2.00",
"IP_SCORE": "1.99",
"ip": "1.56",
"ipnet: 123.245.3.34/19": "2.57",
"asn": "2.45",
"country": "FR",
"HAS_INTERSPIRE_SIG": "1.00",
"MID_RHS_WWW": "0.50",
"MIME_HTML_ONLY": "0.20",
"BAD_REP_POLICIES": "0.10",
"HAS_LIST_UNSUB": "-0.01",
"ARC_NA": "0.00",
"DMARC_POLICY_ALLOW": "0.00",
"FROM_EQ_ENVFROM": "0.00",
"HAS_REPLYTO": "5.00"
}

Remembering that I will use a file with several lines similar to this to transform into dictionary.

This is what I have so far adapted from a solution from Stackoverflow himself, but it’s not what I need yet:

#!/usr/bin/env python
# coding: utf-8

from itertools import tee

arquivo = open('new.txt', 'r')
dados = arquivo.readline().split(',')


def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)


name_map = {number: name for name, number in pairwise(dados)}


print(name_map)

Give me this way out:

{" ' qid: <48FVCS2HX2zRhRN>'": "['<[email protected]>'", " ' ip: 123.83.149.223'": " ' qid: <48FVCS2HX2zRhRN>'", " ' from: <[email protected]>'": " ' ip: 123.83.149.223'", " ' (default: T (add header): [10.78/15.00] [SURBL_VERYBAD(5.00)'": " ' from: <[email protected]>'", " 'HTML_SHORT_LINK_IMG_1(2.00){}'": " ' (default: T (add header): [10.78/15.00] [SURBL_VERYBAD(5.00)'", " 'IP_SCORE(1.99){ip: (1.56)'": " 'HTML_SHORT_LINK_IMG_1(2.00){}'", " ' ipnet: 123.83.128.0/19(2.57)'": " 'IP_SCORE(1.99){ip: (1.56)'", " ' asn: 12876(2.45)'": " ' ipnet: 123.83.128.0/19(2.57)'", " ' country: FR(0.06);}'": " ' asn: 12876(2.45)'", " 'HAS_INTERSPIRE_SIG(1.00){}'": " ' country: FR(0.06);}'", " 'MID_RHS_WWW(0.50){}'": " 'HAS_INTERSPIRE_SIG(1.00){}'", " 'MIME_HTML_ONLY(0.20){}'": " 'MID_RHS_WWW(0.50){}'", " 'BAD_REP_POLICIES(0.10){}'": " 'MIME_HTML_ONLY(0.20){}'", " 'HAS_LIST_UNSUB(-0.01){}'": " 'BAD_REP_POLICIES(0.10){}'", " 'ARC_NA(0.00){}'": " 'HAS_LIST_UNSUB(-0.01){}'", " 'ASN(0.00){asn:12876'": " 'ARC_NA(0.00){}'", " ' ipnet:123.83.128.0/19'": " 'ASN(0.00){asn:12876'", " ' country:FR;}'": " ' ipnet:123.83.128.0/19'", " 'DKIM_TRACE(0.00){teste.net.br:+;}'": " ' country:FR;}'", " 'DMARC_POLICY_ALLOW(0.00){teste.net.br;none;}'": " 'DKIM_TRACE(0.00){teste.net.br:+;}'", " 'FROM_EQ_ENVFROM(0.00){}'": " 'DMARC_POLICY_ALLOW(0.00){teste.net.br;none;}'", " 'FROM_HAS_DN(0.00){}'": " 'FROM_EQ_ENVFROM(0.00){}'", " 'HAS_REPLYTO(0.00){[email protected];}'": " 'FROM_HAS_DN(0.00){}'", " 'MIME_TRACE(0.00){0:~;}'": " 'HAS_REPLYTO(0.00){[email protected];}'", " 'PREVIOUSLY_DELIVERED(0.00){[email protected];}'": " 'MIME_TRACE(0.00){0:~;}'", " 'RCPT_COUNT_ONE(0.00){1;}'": " 'PREVIOUSLY_DELIVERED(0.00){[email protected];}'", " 'RCVD_COUNT_TWO(0.00){2;}'": " 'RCPT_COUNT_ONE(0.00){1;}'", " 'RCVD_TLS_LAST(0.00){}'": " 'RCVD_COUNT_TWO(0.00){2;}'", " 'REPLYTO_ADDR_EQ_FROM(0.00){}'": " 'RCVD_TLS_LAST(0.00){}'", " 'R_DKIM_ALLOW(0.00){teste.net.br:s=dkim;}'": " 'REPLYTO_ADDR_EQ_FROM(0.00){}'", " 'R_SPF_ALLOW(0.00){+ptr;}'": " 'R_DKIM_ALLOW(0.00){teste.net.br:s=dkim;}'", " 'TO_DN_NONE(0.00){}'": " 'R_SPF_ALLOW(0.00){+ptr;}'", " 'TO_MATCH_ENVRCPT_ALL(0.00){}])'": " 'TO_DN_NONE(0.00){}'", " ' len: 2351'": " 'TO_MATCH_ENVRCPT_ALL(0.00){}])'", " ' time: 235.999ms real'": " ' len: 2351'", " ' 44.667ms virtual'": " ' time: 235.999ms real'", " ' dns req: 39'": " ' 44.667ms virtual'", " ' digest: <accefae22f4a22bfc94217189668f964>'": " ' dns req: 39'", " ' rcpts: <[email protected]>'": " ' digest: <accefae22f4a22bfc94217189668f964>'", " ' mime_rcpts: <[email protected]>\\n": " ' rcpts: <[email protected]>'"}
  • Leandro, it would be possible to change the structure of this data in some way ? As it is there is no pattern to be followed.

  • It would be much easier if your data had a set pattern or were stored in a json. The only solution I can see is to manually parse each field, but that would make your code very large and certainly not what you want.

  • So @Jeanextreme002 unfortunately there is no way because it is a structure defined by the application. I really also only see this alternative, but it is not really what I wanted because it is not an elegant option. I’m racking my brain for this! Another option would be to turn scores into list and column tags into a csv which would also be viave.

2 answers

3


You can combine operations with strings and regular expressions to build a parser that is able to extract the information in the desired format. Look at an example:

import re

def parse(s):
    ret = {}
    val, data = re.findall(r'\[(.*?) *\]', s)
    key = s.split('[',1)[0].rsplit(':',1)[0]
    ret.update({key : val})
    data = re.sub(r'\{.*?\}', '', data)
    keys = re.findall(r'[\[,](.*?) *\(', data)
    vals = re.findall(r'\((.*?) *\)', data)
    ret.update(dict(zip(keys,vals)))
    return ret

See working on Repl.it

  • Thank you so much @Lacobus , I will try this approach and return soon.

1

@Lacobus thank you so much thanks to your help and @Jeanextreme002, I filtered to get the solution below that suits me.

import re

entrada = """default: T (add header): [10.78/15.00] [SURBL_VERYBAD(5.00){test.abuse.dnsbl;},HTML_SHORT_LINK_IMG_1(2.00){},IP_SCORE(1.99){ip: (1.56), ipnet: 123.245.3.34/19(2.57), asn: 12876(2.45), country: FR(0.06);},HAS_INTERSPIRE_SIG(1.00){},MID_RHS_WWW(0.50){},MIME_HTML_ONLY(0.20){},BAD_REP_POLICIES(0.10){},HAS_LIST_UNSUB(-0.01){},ARC_NA(0.00){},ASN(0.00){asn:12876, ipnet:123.245.3.34/19, ipnet:123.245.3.34/19, country:FR;},DKIM_TRACE(0.00){test.com:+;},DMARC_POLICY_ALLOW(0.00){teste.com;none;},FROM_EQ_ENVFROM(0.00){},FROM_HAS_DN(0.00){},HAS_REPLYTO(0.00){[email protected];},MIME_TRACE(0.00){0:~;},PREVIOUSLY_DELIVERED(0.00){[email protected];},RCPT_COUNT_ONE(0.00){1;},RCVD_COUNT_TWO(0.00){2;},RCVD_TLS_LAST(0.00){},REPLYTO_ADDR_EQ_FROM(0.00){},R_DKIM_ALLOW(0.00){test.com:s=dkim;},R_SPF_ALLOW(0.00){+ptr;},TO_DN_NONE(0.00){},TO_MATCH_ENVRCPT_ALL(0.00){}])"""
regexp = '(\]\ \[).*(\]\))'


def gather_metrics(metrics):
    metrics_temp = {}
    for metric in metrics.split(','):
        metrica = metric.split('(')[0]
        if metrica.isupper():
            score = metric.split('(')[1].split(')')[0]
            metrics_temp[metrica] = score
    return metrics_temp


all_sig = re.search(regexp, entrada).group().strip('] [').strip('])')
metrics = gather_metrics(all_sig)
print(metrics)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.