Formatting a dictionary using regex, based on a large database

Asked

Viewed 89 times

0

Let’s say I have a following sample from a more extensive database.

146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web+services HTTP/2.0" 203 26554
156.127.178.177 - okuneva5222 [21/Jun/2019:15:45:27 -0700] "DELETE /interactive/transparent/niches/revolutionize HTTP/1.1" 416 14701
100.32.205.59 - ortiz8891 [21/Jun/2019:15:45:28 -0700] "PATCH /architectures HTTP/1.0" 204 6048
168.95.156.240 - stark2413 [21/Jun/2019:15:45:31 -0700] "GET /engage HTTP/2.0" 201 9645
71.172.239.195 - dooley1853 [21/Jun/2019:15:45:32 -0700] "PUT /cutting-edge HTTP/2.0" 406 24498
180.95.121.94 - mohr6893 [21/Jun/2019:15:45:34 -0700] "PATCH /extensible/reinvent HTTP/1.1" 201 27330
144.23.247.108 - auer7552 [21/Jun/2019:15:45:35 -0700] "POST /extensible/infrastructures/one-to-one/enterprise HTTP/1.1" 100 22921

I need to sort the information as follows, creating a dictionary:

example_dict = {"host":"146.204.224.152", 
            "user_name":"feest6811", 
            "time":"21/Jun/2019:15:45:24 -0700",
            "request":"POST /incentivize HTTP/1.1"}

This is my current progress:

import re
match = []
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
        match ={"host": re.findall(r"\d{1,3}.\d{1,3}.\d{1,3}", logdata)}
        
       
    return match


logs()

1 answer

1


First, if you just create a dictionary like this:

example_dict = {"host":"146.204.224.152", 
            "user_name":"feest6811", 
            "time":"21/Jun/2019:15:45:24 -0700",
            "request":"POST /incentivize HTTP/1.1"}

It will only match a single record. If the idea is to get this structure for all records, then it is best to create a list of dictionaries (each of which corresponds to a record).

And as I said in your other question, read() loads the entire contents of the file into memory, and this might not be a good one if the file is too large (after all, we will already store everything in a large list of dictionaries, which will spend a lot of memory, so let’s at least try to save on reading the file: as each record is on a line, reading it line by line seems to me a better alternative).

If the format is fixed (IP - username [horário] "request"), we can do it like this:

import re

r = re.compile(r'(\S+) - (\S+) \[([^]]+)\] "([^"]+)"')
registros = []
with open("logdata.txt") as arquivo:
    for linha in arquivo: # para cada linha do arquivo
        dados = r.match(linha)
        if dados: # se encontrou algo
             host, username, data, request = dados.groups()
             registros.append({ 'host': host, 'user_name': username, 'time': data, 'request': request })

The shortcut \S is any character that is not \s (which in turn corresponds to spaces, line breaks, among others). And the quantifier + indicates "one or more occurrences". Therefore, \S+ is one or more characters that are not space. So I take everything up to the first space (and then in regex we have a space, the hyphen and other space).

Then we have another occurrence of \S+, because I understand that the username cannot have spaces.

Next we have the clasps (which should be escaped with \), and within them we have [^]]+, which is "one or more characters that nay are ]". This guarantees that I will take everything that is between the brackets.

Then you have the quotes, and inside of them you have [^"]+ (one or more characters that nay are "), so I take everything that’s in quotes.

Each of the parts is in parentheses to form capture groups, so I can catch them later with the method groups().

And I compile the regex only once, before the for, and reuse in the loop. You don’t have to create it again with every iteration. The documentation itself says:

"saving the Resulting regular Expression Object for reuse is more Efficient when the Expression will be used several times in a single program"


save the regular expression in an object for reuse is more efficient when the expression is often used in the same program (what exactly is the case here).

At the end, the list registros will have several dictionaries, each corresponding to a file line, with the keys "host", "user_name", "time" and "request" and their respective values.


This is a "lazy" way, because if I already "know" that the file has this specific format, I don’t need to validate the information. But if you want to be more specific and use a slightly more "guaranteed" format, you can use something like this:

r = re.compile(r'(\d{1,3}(?:\.\d{1,3}){3}) - ([a-zA-Z0-9]+) \[(\d{2}/[A-Za-z]{3}/\d{4}:\d{2}:\d{2}:\d{2} [-+]\d{4})\] "((?:POST|DELETE|PUT|PATCH|GET) [^"]+)"')

For the IP, I used (\d{1,3}(?:\.\d{1,3}){3}: 1 to 3 digits, followed by "dot and 1 to 3 digits" (and this whole stretch repeats 3 times). But for the parentheses not to create another group, I had to use (?: to create a catch group (otherwise the corresponding chunk would be returned by groups and mess everything up).

For the username I used [a-zA-Z0-9]+ (one or more letters or numbers). For the date I put the amounts of digits and letters in each snippet, and for the request I put the accepted methods, followed by several characters other than quotation marks.

It might be an unnecessary complication if you don’t need to validate the file data (for example, if it was generated from a source that "guarantees" that the format and information are correct, you wouldn’t need to be so hard on regex).

Because regex is still "naive": to validate IP is a little more complicated than that, for dates then, is even worse. In fact, if you need validate data, I would make each one separately, using the appropriate tool for each case (for IP, it is better not to use regex, for dates, ditto, etc). But if it’s just to read the file and assemble the dictionaries, the first option already seems enough.


Regex-free

But maybe you don’t even need regex. You can use method partition, that separates the string into parts, to get each piece you need:

def extrair_registros(linha):
    host, _, resto = linha.partition(' - ')
    username, _, resto = resto.partition(' [')
    data, _, resto = resto.partition('] "')
    request, _, resto = resto.partition('"')
    return { 'host': host, 'user_name': username, 'time': data, 'request': request }

registros = []
with open("logdata.txt") as arquivo:
    for linha in arquivo: # para cada linha do arquivo
         registros.append(extrair_registros(linha))

For example, when doing linha.partition(' - '), the return is a tuple containing 3 strings: the part you have before the ' - ', own ' - ' and the part that comes after. So when doing:

host, _, resto = linha.partition(' - ')

I take the IP and the rest of the line (the variable _ will contain the separator itself ' - ', and use _ is a Python convention to indicate that I will not use that variable).

Then I’ll do another partition, using the separator ' [', so I get the username and the rest of the string will contain from the date on. And so I go on, every hour using a different tab to get the snippet that I want.

You could do the job extrair_registro more generally, receiving the separators and the respective fields that must be created:

def extrair_registros(linha, separadores):
    registro = {}
    resto = linha
    for sep, campo in separadores.items():
        registro[campo], _, resto = resto.partition(sep)
    return registro

separadores = { # mapear separadores e respectivos campos
    ' - ': 'host',
    ' [': 'user_name',
    '] "': 'time',
    '"': 'request'
}
registros = []
with open("logdata.txt") as arquivo:
    for linha in arquivo: # para cada linha do arquivo
         registros.append(extrair_registros(linha, separadores))

And to build the list, it is still possible to use comprehensilist on:

def extrair_registros(linha, separadores):
    registro = {}
    resto = linha
    for sep, campo in separadores.items():
        registro[campo], _, resto = resto.partition(sep)
    return registro

separadores = { # mapear separadores e respectivos campos
    ' - ': 'host',
    ' [': 'user_name',
    '] "': 'time',
    '"': 'request'
}
with open("logdata.txt") as arquivo:
    registros = [ extrair_registros(linha, separadores) for linha in arquivo ]
  • It is very clear now with this explanation, thank you again, friend!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.