How to extract data from simple non-standard texts?

Asked

Viewed 189 times

0

I would like to extract fields for a database from text files. However the fields are positioned in different ways in each text being difficult to obtain the values by common methods, for example:

file 1:

PROVA: 2º Corta Mato    

LOCAL:  Pinhal da Paz
ORGANIZAÇÃO:    
AAP 

ESTADO TEMPO: Bom   
DATA:   28-01-2007  

file 2:

PROVA: MEGA SPRINTER         LOCAL: E.B.I. DE ARRIFES
ASSOCIACAO: AASM/SDSM
TEMPO: Nublado c/ vento
DIA: 22 de Março de 2006

file 3:

AASM
ESTADO TEMPO: Nublado/Ventoso c/ alguma chuva
DATA: 19 de Novembro de 2005
1º Triatlo Técnico + P. de Preparação
C. D. DAS LARANJEIRAS

There are thousands of files, multiple fields per file and each field can have one or multiple values per text, so doing data extraction by hand is out of the question.

1 answer

1


For this purpose I created the package Masstextextractor to load it simply install, via Pip, on the command line:

sudo pip install MassTextExtractor

An example of its use for the "local" and "proof" fields of the samples of the demonstrated files would be:

from MassTextExtractor import TextsParser

# marcar linhas do campo prova
file_dirs = ["./ficheiro_1.txt", "./ficheiro_2.txt", "./ficheiro_3.txt"]
flags = ["Triatlo", "PROVA:"]
prova = TextsParser(file_dirs, flags)

# limpar partes da linha
prova.switchers = [("PROVA:", "")]
prova.switch_texts_field_lines()

# partir parte da linha
prova.breakers = [("LOCAL", 0)]
prova.break_texts_field_lines()


# marcar linhas do campo local
file_dirs = ["./ficheiro_1.txt", "./ficheiro_2.txt", "./ficheiro_3.txt"]
flags = ["LARANJEIRAS", "LOCAL:"]
local = TextsParser(file_dirs, flags)

# partir parte da linha
local.breakers = [("LOCAL:", 1)]
local.break_texts_field_lines()

# limpar partes da linha
local.switchers = [("LOCAL:", "")]
local.switch_texts_field_lines()


print prova.return_texts_field_lines()
print local.return_texts_field_lines()

It may seem overly pedantic, however, I believe it can be quite useful when used as a last resort to get data from large amounts of semi-unstructured text.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.