Extract data from all rows of a file and create a dataframe

Asked

Viewed 322 times

0

I have a file . txt with 2000 lines (Whatsapp chat) from where I need to extract to a pandas dataframe the date, time and sender of the message. I can do this with the function below:

def parse(file):
  
    data = re.search(r'\d{2}/\d{2}/\d{4}',file )
    hora = re.search(r'\d{2}:\d{2}', file)
    pessoa = re.search(r'(?<=\-)(.*?)(?=\:)',file)
    return data.group(0), hora.group(0), pessoa.group(0)

which works perfectly for a line of the type:

    file = ('20/05/2020 20:35 - Rodrigo Toledo:')
    parse(file)

But I want a way to apply the parse function to all lines of the file . txt, and then turn it into a dataframe.

  • Could you give an example of an error? Another type of data that your code should work with

  • The code should always work with a txt file whose lines follow the pattern ('20/05/2020 20:35 - Rodrigo Toledo:'). txt has 2000 lines, so the parse function will need to traverse these 2000 lines, savanda each line executed in another file that will serve as a basis for creating a dataframe.

1 answer

2


If the format is always this, you can use a regex to extract all the data at once, go saving the results in a list and at the end create the dataframe:

import pandas as pd
import re

r = re.compile(r'(\d{2}/\d{2}/\d{4}) (\d{2}:\d{2}) - ([^:]+)')
itens = []
with open('dados.txt') as arq:
    for linha in arq: # para cada linha do arquivo
        m = r.match(linha)
        if m: # se a regex encontrou um match, adiciona na lista
            itens.append(m.groups())

# cria o dataframe
df = pd.DataFrame(itens, columns=['data', 'hora', 'nome'])

In the regex I put the passages corresponding to the date, time and name. For date and time, I used the same one you were already using: the number of numbers and the separators.

For the name, I used [^:]+, which is "one or more characters that nay are :". So I take everything after the hyphen :.

And each of these passages is in parentheses to form a catch group, so I can take it all at once with the method groups, that returns a tuple with all groups.

At the end of loop, the list itens will have several tuples, each containing the date, time and name.

Then just create the dataframe and choose the column names (as an example, I used the creative names "date", "time" and "name").

  • It worked perfectly. Thank you very much. Last little problem: in each line I have a second : how I caught everything after this second sign of : ?

  • 1

    @Pouchewar If it’s to take it all to the next : just add one more capture group: re.compile(r'(\d{2}/\d{2}/\d{4}) (\d{2}:\d{2}) - ([^:]+):([^:]+)') - remembering that when creating the dataframe you have to put the name of the fourth column as well. But if you want to take everything to the end of the line, use re.compile(r'(\d{2}/\d{2}/\d{4}) (\d{2}:\d{2}) - ([^:]+):(.+)') - remembering that in this case .+ will pick everything up to the end of the line, so it depends on what you need

  • Thank you very much. It helped a lot.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.