Identify a numerical sequence in a text file

Question

Identify a numerical sequence in a text file

Asked 7 years, 9 months ago

Viewed 1,051 times

2

I’m new to Python, and I’m having a problem that I can’t find a solution to. I have a folder with about 10k of . txt (written in many different ways). I need to extract the FIRST sequence of 17 numbers that is located in the first lines of these txt`s, and rename the file with the extracted sequence.

This sequence sometimes appears concatenated and sometimes appears separated by a dot and hyphen (e.g.: 00273200844202003, 00588.2007.011.02.00-9) PS: there are other numerical sequences in the text different or equal to 17 numbers, but the sequence is always the first of 17 that appears.

I stored the current document names in a list, was trying to find the sequence of numbers in the text using the NLTK package but without success.

pasta_de_documentos = (r'''C:\Users\mateus.ferreira\Desktop\Estudos\Python\Doc_Classifier\TXT''')
documentos = os.listdir(pasta_de_documentos)

If anyone knows a better approach or can give me a way to continue attacking the problem thanks. (I’m using Python 3)

1

When separated by dots and hyphens, these characters are always in the same positions within the number or may vary?

– Woss

2017/11/16 at 11:27
@Andersoncarloswoss the characters when they appear by what I looked in the hand, appear in the same positions

– stacker

2017/11/16 at 11:43
And why when there are characters separating the number has 20 digits? Shouldn’t it always be 17?

– Woss

2017/11/16 at 12:10
I saw now that I ended up copying the wrong sequence of numbers in my example, already corrected, thank you. The correct example would be 00588.2007.011.02.00-9

– stacker

2017/11/16 at 12:25

3 answers

2

One solution is to seek value through regular expression. To satisfy both possibilities, you can set as optional the presence of dots and hyphens between digits. It would look something like:

r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)'

The prefix r defines the string as raw. Parentheses create a capture group for regular expression and characterize this group as being:

5-digit sequence;
May or may not be followed by a;
Sequence of 4 digits;
May or may not be followed by a;
3-digit sequence;
May or may not be followed by a;
2-digit sequence;
May or may not be followed by a;
2-digit sequence;
May or may not be followed by a hyphen;
Sequence of 1 digit;

With Python, you can use the module re to handle the contents of the file along with the regular expression:

import re

with open('data.txt') as content:
    search = re.search(r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)', content.read())
    if search is not None:
        print(search.group(0))

^{See working on Repl.it}

Thus the value of search.group(0) will be the first 17 digit value, with tabs or not, found in the file data.txt. If you have multiple files, just go through all of them and run the same logic. Enjoy and read about the module glob, may be of use to you.

Browser other questions tagged python python-3.x nltk

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2017-11-16T12:45:54+00:00

You can use regular expressions for this.

A regular expression that finds all sequences that can contain digits, "-" and "." with at least 17 elements - it would be possible to refine the expression until it finds for itself 17 digits, but I think it gets too complex - so I prefer to combine the regular expression with some logic in Python.

Since the files are small (10kb, but even if they were 30 times larger), it is not necessary to read only part of the file and search there. But da also prevents you from reading the first 4KB of each file if the sequence is always there (~400 lines if the lines are not large).

import os, re

def encontra_nome(pasta, nome_do_arquivo):
    dados = open(os.path.join(pasta, nome_do_arquivo)).read(4096)
    sequencias = re.findall(r"[0-9\.\-]{17, 35}", dados)
    for seq in sequencias:
        sequencia_limpa = re.sub("\-|\.", "", a)
        if len sequencia_limpa >= 17:
             return sequencia
    raise ValueError ("Sequencia de 17 dígitos não encontrada")

The regular expression r"[0-9\.\-]{17, 35}" search, as described, any sequence between 17 and 35 repetitions of characters between digits, "-" and ".". This allows up to a separator after each digit, so it should cover all possible formats. I preferred this rather than complicating regular expression - because they are neither especially readable, nor easy to do, to "count only the digits and ignore the other characters, and find 17 exactly". A single regular expression for this would certainly be possible. Instead, once all candidates have been found, I use a linear search with a for, filter them - and . - this time with a simple regular expression that replaces all "-" and "." with "".

I sometimes prefer to use two calls to the replace method of the strings instead of doing this, but since we are already making use of regular expressions, there is no reason not to use one more: there are no performance barriers or anything like that, but there are barriers to "oops, here comes a regular expression" of people keeping their code.

by Lacobus • **13,510** points · Answer 2 · 2017-11-16T13:25:02+00:00

You can use the module glob to recover a list containing the name of all the files .txt in a given directory.

By iterating on this list, you can open each of the files, reading only the first line and extracting only the digits from it:

linha = entrada.readline()
digitos = (''.join(s for s in linha if s.isdigit()))

Of these digits read, only the 17 would be considered and concatenated with the extension .txt:

destino = digitos[:17] + '.txt'

Once with the output file name mounted, you can use the module shutil to duplicate the file with the new name.

Here is an example that can solve your problem:

import shutil
import glob


# Recupera listagem de todos os arquivos .txt de um dado diretorio...
lista_arquivos = glob.glob('/tmp/teste/*.txt')

# Para cada arquivo na lista
for origem in lista_arquivos:

    # Abre arquivo de origem para leitura em modo texto
    with open( origem ) as entrada:

        # Le apenas a primeira linha do arquivo de origem
        linha = entrada.readline()

        # Extrai somente os digitos da linha lida
        digitos = (''.join(s for s in linha if s.isdigit()))

        # Formata o nome do arquivo de destino
        destino = digitos[:17] + '.txt'

    # Exibe status do processamento
    print("{} -> {}".format( origem, destino ))

    # Copia arquivo de origem para o destino
    shutil.copyfile( origem, destino );